SolidWorks

The Digital Preservation Coalition recently released their tech watch report on Preserving Geospatial Data. This adds to reports on CAD, Construction, and others. One of the many areas of difficulties in Digital Preservation is understanding these areas of GIS, CAD, and 3D Modeling software and the file formats which belong to the software titles in this space. Not only are the file formats plentiful but the software is extensive and expensive. Documentation is lacking in understanding the different file formats associated with each software title. These tech watch reports are super useful, but more is needed to enhance the tools we use to better identify, validate, and transform these formats in order to preserve them long term.

I was processing some data sets from a recent collection added to our Scholarly repository and came across some models in the SolidWorks part format. I was surprised to find that this format has been around since 1995 and has yet to be added to the PRONOM registry.

SolidWorks is mechanical design software used for making 3D models which can be made to be individual parts, part of larger assemblies and added to drawings giving engineers access to 3D deisgn on their desktops. Bought by Dassault Systèmes in 1997, they are the makers of the CATIA CAD software. Since 1995 a new version was released almost every year, adding new features and improvements to the format. The original versions made use of the Microsoft OLE object container, but in 2015 the format shifted to a proprietary binary format. Let’s take a look at some samples.

There are three types of SolidWorks file formats, the SolidWork part (sldprt), the assembly (sldasm), and drawing (slddrw). The first versions of SolidWorks used prt, asm, and drw, but quickly added “sld” to avoid confusion with other CAD tools.

Path = flatann.sldprt
Type = Compound
Physical Size = 5851648
Extension = compound
Cluster Size = 512
Sector Size = 64

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
1997-08-05 08:34:21 D....                            Contents
                    .....        60844        60928  Header
                    .....        45022        45056  Preview
1997-08-05 08:34:06 D....                            ThirdPty
                    .....          237          256  [5]SummaryInformation
1997-08-05 08:34:18 D....                            _MO_VERSION_629
                    .....          157          192  _MO_VERSION_629/History
                    .....          126          128  [5]DocumentSummaryInformation
                    .....       996343       996352  Contents/Definition
                    .....      1003198      1003520  Contents/Default
                    .....       781536       781824  Contents/DisplayLists
------------------- ----- ------------ ------------  ------------------------
1997-08-05 08:34:21            2887463      2888256  8 files, 3 folders

We can see this file is a compound (OLE) container file. It’s very useful to have a directory within the container with a version number. With this version number we can use the chart on the file format wiki to see this file was last modified by SolidWorks 97 Plus. The problem comes in when we look at an assembly file and compare.

Path = dispenser.sldasm
Type = Compound
Physical Size = 2143232
Extension = compound
Cluster Size = 512
Sector Size = 64

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
1997-03-19 17:29:16 D....                            ThirdPty
                    .....        16812        16896  Preview
                    .....         4655         5120  Header
1997-09-04 15:30:48 D....                            Contents
                    .....      1009461      1009664  Contents/DisplayLists
                    .....        23931        24064  Contents/Definition
                    .....          237          256  [5]SummaryInformation
1997-09-04 15:35:39 D....                            _MO_VERSION_629
                    .....          107          128  _MO_VERSION_629/History
                    .....          126          128  [5]DocumentSummaryInformation
------------------- ----- ------------ ------------  ------------------------
1997-09-04 15:35:39            1055329      1056256  7 files, 3 folders

Almost the same contents, the same version directory. The only difference in content is the file Defaults in the Contents directory. But hard to know if all have the same difference. We will have to look closer at the individual files to hopefully find what sets the different formats apart.

The SolidWorks 2000 format added additional files to the container which can help.

Path = SW2000-s01.SLDPRT
Type = Compound
Physical Size = 20992
Extension = compound
Cluster Size = 512
Sector Size = 64

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2024-01-16 20:00:51 D....                            _DL_VERSION_1500
                    .....         5300         5632  Preview
                    .....          481          512  Header
2024-01-16 20:00:51 D....                            Contents
2024-01-16 20:00:51 D....                            ThirdPty
                    .....            4           64  Contents/OleItems
                    .....           69          128  Contents/CMgrHdr
                    .....          343          384  Contents/CMgr
                    .....         5456         5632  Contents/Config-0
                    .....          592          640  Contents/DisplayLists__Zip
                    .....          957          960  Contents/Definition
                    .....          252          256  [5]SummaryInformation
2024-01-16 20:00:51 D....                            _MO_VERSION_1500
                    .....          840          896  _MO_VERSION_1500/Biography
                    .....           98          128  _MO_VERSION_1500/History
                    .....          148          192  [5]DocumentSummaryInformation
                    .....          120          128  ISolidWorksInformation
                    .....            6           64  _DL_VERSION_1500/DLUpdateStamp
------------------- ----- ------------ ------------  ------------------------
2024-01-16 20:00:51              14666        15616  14 files, 4 folders

The introduction of the “ISolidWorksInformation” file helps give positive identification of the SolidWorks format.

hexdump -C SW2000-s01.SLDPRT/ISolidWorksInformation      
00000000  fe ff 00 00 04 0a 02 00  02 d5 cd d5 9c 2e 1b 10  |................|
00000010  93 97 08 00 2b 2c f9 ae  01 00 00 00 05 d5 cd d5  |....+,..........|
00000020  9c 2e 1b 10 93 97 08 00  2b 2c f9 ae 30 00 00 00  |........+,..0...|
00000030  48 00 00 00 02 00 00 00  02 00 00 00 18 00 00 00  |H...............|
00000040  00 00 00 00 24 00 00 00  1e 00 00 00 01 00 00 00  |....$...........|
00000050  00 00 00 00 02 00 00 00  00 00 00 00 01 00 00 00  |................|
00000060  00 02 00 00 00 0d 00 00  00 53 57 2d 46 69 6c 65  |.........SW-File|
00000070  20 4e 61 6d 65 00 00 00                           | Name...|

hexdump -C SW2000-s02.SLDASM/ISolidWorksInformation
00000000  fe ff 00 00 04 0a 02 00  02 d5 cd d5 9c 2e 1b 10  |................|
00000010  93 97 08 00 2b 2c f9 ae  01 00 00 00 05 d5 cd d5  |....+,..........|
00000020  9c 2e 1b 10 93 97 08 00  2b 2c f9 ae 30 00 00 00  |........+,..0...|
00000030  6c 00 00 00 03 00 00 00  02 00 00 00 20 00 00 00  |l........... ...|
00000040  03 00 00 00 2c 00 00 00  00 00 00 00 34 00 00 00  |....,.......4...|
00000050  1e 00 00 00 01 00 00 00  00 00 00 00 0b 00 00 00  |................|
00000060  00 00 00 00 03 00 00 00  00 00 00 00 01 00 00 00  |................|
00000070  00 03 00 00 00 0e 00 00  00 41 73 73 65 6d 62 6c  |.........Assembl|
00000080  79 20 74 79 70 65 00 02  00 00 00 0d 00 00 00 53  |y type.........S|
00000090  57 2d 46 69 6c 65 20 4e  61 6d 65 00              |W-File Name.|

hexdump -C SW2000-s01.SLDDRW/ISolidWorksInformation
00000000  fe ff 00 00 04 0a 02 00  02 d5 cd d5 9c 2e 1b 10  |................|
00000010  93 97 08 00 2b 2c f9 ae  01 00 00 00 05 d5 cd d5  |....+,..........|
00000020  9c 2e 1b 10 93 97 08 00  2b 2c f9 ae 30 00 00 00  |........+,..0...|
00000030  bc 01 00 00 0a 00 00 00  02 00 00 00 58 00 00 00  |............X...|
00000040  03 00 00 00 64 00 00 00  04 00 00 00 70 00 00 00  |....d.......p...|
00000050  05 00 00 00 7c 00 00 00  06 00 00 00 88 00 00 00  |....|...........|
*
000000d0  05 00 00 00 52 27 a0 89  b0 e1 d1 3f 05 00 00 00  |....R'.....?....|
000000e0  51 6b 9a 77 9c a2 cb 3f  03 00 00 00 00 00 00 00  |Qk.w...?........|
000000f0  0a 00 00 00 00 00 00 00  01 00 00 00 00 04 00 00  |................|
00000100  00 15 00 00 00 53 57 2d  53 68 65 65 74 20 46 6f  |.....SW-Sheet Fo|
00000110  72 6d 61 74 20 53 69 7a  65 00 05 00 00 00 11 00  |rmat Size.......|
00000120  00 00 53 57 2d 43 75 72  72 65 6e 74 20 53 68 65  |..SW-Current She|
00000130  65 74 00 08 00 00 00 19  00 00 00 41 63 74 69 76  |et.........Activ|
00000140  65 20 73 68 65 65 74 20  70 61 70 65 72 20 77 69  |e sheet paper wi|
00000150  64 74 68 00 02 00 00 00  0d 00 00 00 53 57 2d 46  |dth.........SW-F|
00000160  69 6c 65 20 4e 61 6d 65  00 09 00 00 00 14 00 00  |ile Name........|
00000170  00 41 63 74 69 76 65 20  73 68 65 65 74 20 48 65  |.Active sheet He|
00000180  69 67 68 74 00 07 00 00  00 0e 00 00 00 53 57 2d  |ight.........SW-|
00000190  53 68 65 65 74 20 4e 61  6d 65 00 0a 00 00 00 18  |Sheet Name......|
000001a0  00 00 00 41 63 74 69 76  65 20 73 68 65 65 74 20  |...Active sheet |
000001b0  70 61 70 65 72 20 73 69  7a 65 00 03 00 00 00 0f  |paper size......|
000001c0  00 00 00 53 57 2d 53 68  65 65 74 20 53 63 61 6c  |...SW-Sheet Scal|
000001d0  65 00 06 00 00 00 10 00  00 00 53 57 2d 54 6f 74  |e.........SW-Tot|
000001e0  61 6c 20 53 68 65 65 74  73 00 00 00              |al Sheets...|

Starting in 2015 the format changed from an OLE container, to a binary file. Here is what the first few bytes look like from a 2015 file and a later 2023 file:

hexdump -C Bracket.SLDPRT | head
00000000  9f e4 18 9f 00 00 00 04  26 00 42 15 14 00 06 00  |........&.B.....|
00000010  08 00 06 00 40 a5 c3 a7  0e 51 5b 03 00 00 91 07  |....@....Q[.....|
00000020  00 00 0d 00 00 00 34 f6  e6 47 56 e6 47 37 f2 34  |......4..GV.G7.4|
00000030  d4 76 27 b5 55 5d 48 14  51 14 3e ab 2e f6 63 65  |.v'.U]H.Q.>...ce|
00000040  8b be 55 2e 42 0f 45 89  05 16 68 3a 93 eb f6 03  |..U.B.E...h:....|
00000050  ab 2e ae 89 d4 c2 3a ee  ce ae 53 bb 3b cb cc 2e  |......:...S.;...|
00000060  18 42 0d f8 16 41 3d 95  42 94 24 41 b0 3d 54 14  |.B...A=.B.$A.=T.|
00000070  fd 68 ad 52 0f 45 54 06  61 84 d1 0f 52 3e 44 20  |.h.R.ET.a...R>D |
00000080  48 af 6e e7 cc cc dd 3f  5d ea a5 3b dc 3d df f9  |H.n....?]..;.=..|
00000090  b9 e7 9c 7b ef b9 67 d3  69 0b 54 41 44 76 c8 d1  |...{..g.i.TADv..|

hexdump -C SW2023-s01.SLDPRT | head
00000000  f4 e9 02 fc 00 00 00 04  51 3f 60 ad 6a 35 f9 b3  |........Q?`.j5..|
00000010  14 00 06 00 08 00 a8 8c  60 c0 d0 05 00 00 74 01  |........`.....t.|
00000020  00 00 e8 02 00 00 07 00  00 00 05 27 56 67 96 56  |...........'Vg.V|
00000030  77 d6 df ea 07 e7 cf ed  c6 8e 6c a1 48 70 d6 76  |w.........l.Hp.v|
00000040  cd 16 7f e9 6b 95 3a 4e  bb 6e 95 cc d2 b3 69 a9  |....k.:N.n....i.|
00000050  72 6b af c7 82 38 95 6f  bc 37 d2 4e a6 28 36 bd  |rk...8.o.7.N.(6.|
00000060  c3 cf 85 46 0a 85 63 97  83 56 88 a1 38 02 64 14  |...F..c..V..8.d.|
00000070  00 06 00 08 00 a8 8c 60  c0 44 07 00 00 d1 01 00  |.......`.D......|
00000080  00 a2 03 00 00 22 00 00  00 34 f6 e6 47 56 e6 47  |....."...4..GV.G|
00000090  37 f2 34 f6 e6 66 96 76  d2 03 d2 25 56 37 f6 c6  |7.4..f.v...%V7..|

The newer version of the format is much different and is in a proprietary binary format with no specifications, which makes it much more difficult to know which parts of the file can be used for identification. All these new formats have the hex values “00 00 00 04” as bytes 4 through 7. Not very unique for identification. There is another set of bytes which does seem to be consistent for all samples so far, but they vary in their location. The values “34 f6 e6 47 56 e6 47 37 f2” seem to be in every sample. The 10th byte often has the value 34, but in many samples either has 34, B4, 44, 64, or 33. The other formats, SLDASM and SLDDRW also have this pattern which might give us enough to make a good signature. At this time we may not be able to distinguish the different formats, but maybe in the future.

More work is needed to really develop signatures that can identify each format from SolidWorks definitely. My initial assumptions we not completely correct and there are a few exceptions to the patterns I felt were good enough. One unknown is the formats from SolidWorks 95 through 99 and properly identifying them. More samples are needed. I have placed my initial signature and some samples on my GitHub. Please get in tough if you have additional samples or ideas on better identification.

Universal Scene Description

A few years ago I became obsessed with creating 3D models from physical objects. There was an app on my iPhone called 123D Catch by AutoDesk and it allowed you to take a series of photos with your iPhone camera, then combine them to create a 3D Model. This lead me down a path to eventually take a course on Photogrammetry and develop a process for capturing objects in our Museum.

Autodesk eventually discontinued the app and built the technology into their paid products. This is when we started seeing lidar introduced with handheld devices. The first one I tried was using my XBOX360 kinect sensor with the skanect software. The quality was horrible, but was fun to learn about depth sensors and structure from motion. When the iPhone finally came out with lidar sensor it was like Apple had read my mind. I love having the ability to capture objects I find into 3D models. The quality is pretty good, not as good as taking the time to capture image sets for photogrammetry and use tools like Aigisoft metashape, but apps like Scaniverse do a fantastic job. You can check out some of the models I have captured on my Sketchfab page.

With any new technology comes new file formats, and 3D formats are definitely no exception. It seems every software developer has to come up with their own proprietary format leaving the digital preservation folks scrambling to keep up. The DPC and Archivematica published a report a couple years ago and state:

“There are many challenges in preserving 3D data. As well as the complexity of the data itself, there is
a lack of interoperability between the different (often proprietary) systems that are used to create
and manipulate 3D models. Relationships to other data, software and hardware also need to be
captured and managed effectively.”

https://www.dpconline.org/docs/technology-watch-reports/2479-preserving-3d/file

With my new iPhone in hand I found myself with new file format I was unfamiliar with. Universal Scene Description is a framework to exchanging 3D data between different software developed by Pixar. The relationship between Apple and Pixar goes way back so it was no surprise the Apple iPhone has support built in for this new format and I found myself capturing and sending 3D models to others with an iPhone. The USDZ format is a ZIP package format for containing a USD 3D model and is perfect for sharing and preserving.

There is no current PRONOM signatures for identifying USD formats, so I wanted to look into creating one. This is where I ran into a problem. The current PRONOM signature syntax has no way of properly identifying the USDZ format. Let me explain.

When DROID or Siegfried is used to identify a container format such as USDZ. It will first identify the format as a ZIP file, which technically it is. This triggers the software to then refer to the container signature to see if any patterns from the files internal to the ZIP match to a known format. This is done by pointing to a specific file and a hex pattern or ascii string within the file. In the case of a USDZ the internal structure may look like this:

Listing archive: scaniverse-20210928-113055.usdz

--
Path = scaniverse-20210928-113055.usdz
Type = zip
Physical Size = 5702256

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2021-09-28 11:47:36 .....       297999       297999  scaniverse-20210928-113055.usdc
2021-09-28 11:47:36 .....      5403849      5403849  0/texgen_0.jpg
------------------- ----- ------------ ------------  ------------------------
2021-09-28 11:47:36            5701848      5701848  2 files

In this sample file the name of the USDZ is the same name as the internal USDC file. So the name of the USDC is variable and DROID needs a static name and path to look for patterns. The USDZ specification is clear that the only required file inside a USDZ is a USD model, anything else is ancillary and is not always going to be included. Currently the only format used is USDC, but in the future may allow a simple USD or USDA format. In addition, some of the other sample files show a very nested USDC file, making identification even more difficult.

Listing archive: Scan.usdz

--
Path = Scan.usdz
Type = zip
Physical Size = 19155195
Characteristics = Minor_Extra_ERROR

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2021-03-09 09:22:36 .....     19154773     19154773  /private/var/mobile/Containers/Data/Application/EFD09E66-32FB-4B08-8BED-B7E3D78FE1A8/tmp/Scan.usdc
------------------- ----- ------------ ------------  ------------------------
2021-03-09 09:22:36           19154773     19154773  1 files

The USDZ format is not the only file format which makes identification difficult through variable names and non-static patterns. An issue on GitHub has been raised to address this problem. One potential fix is to use glob patterns as suggested by the amazing Richard Lehane, creator of Siegfried. This way we could use wildcard to ignore the variable names and find any file with an extension of .USDC for example. The USDC file format has a nice 8 byte header “PXR-USDC” which is perfectly suited for identification so our container signature might look like this:

<ContainerSignature Id="1000" ContainerType="ZIP">
  <Description>USDZ 3D Package</Description>
    <Files>
	<File>
          <Path>*.usdc</Path>
              <BinarySignatures>
                  <InternalSignatureCollection>                    
	             <InternalSignature ID="300">
	                 <ByteSequence Reference="BOFoffset">
	                     <SubSequence Position="1" SubSeqMinOffset="0" SubSeqMaxOffset="0">
	                         <Sequence>50 58 52 2D 55 53 44 43</Sequence>
	                     </SubSequence>
	                 </ByteSequence>
	             </InternalSignature>
	          </InternalSignatureCollection>
             </BinarySignatures>
        </File>           
    </Files>
</ContainerSignature>

Update: I was able to get a beta version of Siegfried working with my test signature.

siegfried   : 1.11.0
scandate    : 2023-06-02T08:54:27-06:00
signature   : default.sig
created     : 2023-06-02T08:52:33-06:00
identifiers : 
  - name    : 'pronom'
    details : 'DROID_SignatureFile_V112.xml; container-signature-20230510.xml; extensions: usdz-signature-file-v1.xml; container extensions: usdz-dev1-signaturefile-20230601.xml'
---
filename : 'scaniverse-20210928-113055.usdz'
filesize : 5702256
modified : 2021-09-28T11:47:37-06:00
errors   : 
matches  :
  - ns      : 'pronom'
    id      : 'BYUdev/1'
    format  : 'USDZ 3D Package'
    version : 
    mime    : 'model/vnd.usdz+zip'
    class   : 
    basis   : 'extension match usdz; container name scaniverse-20210928-113055.usdc with byte match at 0, 8 (signature 1/2)'
    warning : 

I am still in the process of testing some beta versions of Siegfried in hopes of getting the glob matching to work, but still have more to do. Stay tuned!