PNG Plus

Usually in the software world file formats are fairly efficient, the structure is meant to provide a way to store the data of the software being used. There isn’t much need to add additional unnecessary additions. This isn’t always true, but in the early days, disk space was expensive so compression and efficiency ruled. There also wasn’t much need to hide anything or complicate things. That is unless it is intended. This makes me think of two things, Polyglots and Steganography.

Steganography is the art of embedding data within an image. With digital images you can hide another image within the main image by using the most and least significant bits. Fun use of technology, but not something you normally would find in your regular desktop software.

Ange is the master at polyglots. If you haven’t watched his presentation on funky file formats, you are missing out.

Imagine my surprise when I was researching the Picture It! software and the MIX file format only to discover Microsoft decided to make their own polyglot of sorts for their PNG Plus format which replaced the MIX format, then both obsolete when Digital Image was discontinued in 2007. The PNG Plus format was the native format for the Microsoft Picture It! and Digital Image software often found with the Microsoft Works or Digital Imaging suite of software.

Save Menu from Digital Image Pro

According to the help within Digital Image:

The PNG Plus format uses the standard PNG extension but provides saving of layers and pages within the PNG format. Since the PNG format cannot do this natively, how did Microsoft accomplish this? Well, by throwing an OLE container into the middle of the file of course!

PNG Plus files are your regular PNG format and will identify as such. But they are just a low resolution thumbnail of the full image. Let’s take a look:

exiftool PictureIt7-s02.png 
ExifTool Version Number         : 12.70
File Name                       : PictureIt7-s02.png
File Size                       : 26 kB
File Modification Date/Time     : 2023:12:26 22:01:58-07:00
File Access Date/Time           : 2024:01:01 12:31:07-07:00
File Inode Change Date/Time     : 2023:12:26 22:01:58-07:00
File Permissions                : -rwx------
File Type                       : PNG
File Type Extension             : png
MIME Type                       : image/png
Image Width                     : 500
Image Height                    : 333
Bit Depth                       : 8
Color Type                      : RGB with Alpha
Compression                     : Deflate/Inflate
Filter                          : Adaptive
Interlace                       : Noninterlaced
SRGB Rendering                  : Perceptual
Gamma                           : 2.2
White Point X                   : 0.3127
White Point Y                   : 0.329
Red X                           : 0.64
Red Y                           : 0.33
Green X                         : 0.3
Green Y                         : 0.6
Blue X                          : 0.15
Blue Y                          : 0.06
Warning                  : [minor] Text/EXIF chunk(s) found after PNG IDAT (may be ignored by some readers)
Title                           : PictureIt7-s02
Image Size                      : 500x333
Megapixels                      : 0.167

Looks like there is some additional data after the IDAT chunk.

hexdump -C PictureIt7-s02.png | head
00000000  89 50 4e 47 0d 0a 1a 0a  00 00 00 0d 49 48 44 52  |.PNG........IHDR|
00000010  00 00 01 f4 00 00 01 4d  08 06 00 00 00 f6 13 9d  |.......M........|
00000020  37 00 00 00 01 73 52 47  42 00 ae ce 1c e9 00 00  |7....sRGB.......|
00000030  00 04 67 41 4d 41 00 00  b1 8f 0b fc 61 05 00 00  |..gAMA......a...|
00000040  00 20 63 48 52 4d 00 00  7a 26 00 00 80 84 00 00  |. cHRM..z&......|
00000050  fa 00 00 00 80 e8 00 00  75 30 00 00 ea 60 00 00  |........u0...`..|
00000060  3a 98 00 00 17 70 9c ba  51 3c 00 00 24 f4 49 44  |:....p..Q<..$.ID|
00000070  41 54 78 5e ed dd 4d a8  15 57 be 28 f0 1e 08 1e  |ATx^..M..W.(....|
00000080  e3 47 8e 49 ab c7 d8 81  03 09 41 9c 28 38 e8 80  |.G.I......A.(8..|
00000090  d0 9c 0e 08 0e 1a 11 c2  15 07 5e 5a 07 4d c7 2b  |..........^Z.M.+|

The header looks the same as any PNG file, so lets look a little further:

00002560  ff 1f fa 5f 90 66 c9 e6  ad 88 00 00 00 00 63 6d  |..._.f........cm|
00002570  4f 44 4e 88 09 c1 00 00  40 00 63 70 49 70 d0 cf  |ODN.....@.cpIp..|
00002580  11 e0 a1 b1 1a e1 00 00  00 00 00 00 00 00 00 00  |................|
00002590  00 00 00 00 00 00 3e 00  03 00 fe ff 09 00 06 00  |......>.........|
000025a0  00 00 00 00 00 00 00 00  00 00 01 00 00 00 01 00  |................|
000025b0  00 00 00 00 00 00 00 10  00 00 02 00 00 00 01 00  |................|
*
00002970  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff 52 00  |..............R.|
00002980  6f 00 6f 00 74 00 20 00  45 00 6e 00 74 00 72 00  |o.o.t. .E.n.t.r.|
00002990  79 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |y...............|
000029a0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000029b0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 16 00  |................|
000029c0  05 00 ff ff ff ff ff ff  ff ff 01 00 00 00 7e 7f  |..............~.|
000029d0  3f b5 a5 f6 86 43 a1 a1  a3 02 24 d2 88 ef 00 00  |?....C....$.....|
000029e0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000029f0  00 00 03 00 00 00 40 12  00 00 00 00 00 00 44 00  |......@.......D.|
00002a00  61 00 74 00 61 00 53 00  74 00 6f 00 72 00 65 00  |a.t.a.S.t.o.r.e.|
00002a10  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00003930  00 00 00 00 00 00 00 00  00 00 00 00 00 00 43 48  |..............CH|
00003940  4e 4b 49 4e 4b 20 04 00  07 00 0c 00 00 03 00 02  |NKINK ..........|
00003950  00 00 00 0a 00 00 f8 01  0c 00 ff ff ff ff 18 00  |................|
00003960  54 45 58 54 00 00 01 00  00 00 54 45 58 54 00 02  |TEXT......TEXT..|
00003970  00 00 22 00 00 00 18 00  46 44 50 50 00 00 43 00  |..".....FDPP..C.|
00003980  4f 00 4e 00 54 00 45 00  4e 00 54 00 53 00 00 00  |O.N.T.E.N.T.S...|
00003990  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
000039f0  00 00 1f 00 00 00 00 0a  00 00 00 00 00 00 01 00  |................|
00003a00  43 00 6f 00 6d 00 70 00  4f 00 62 00 6a 00 00 00  |C.o.m.p.O.b.j...|
00003a10  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00004530  00 00 00 00 00 00 00 00  00 00 00 00 00 00 01 00  |................|
00004540  fe ff 03 0a 00 00 ff ff  ff ff 00 00 00 00 00 00  |................|
00004550  00 00 00 00 00 00 00 00  00 00 1a 00 00 00 51 75  |..............Qu|
00004560  69 6c 6c 39 36 20 53 74  6f 72 79 20 47 72 6f 75  |ill96 Story Grou|
00004570  70 20 43 6c 61 73 73 00  ff ff ff ff 01 00 00 00  |p Class.........|
00004580  00 00 00 00 f4 39 b2 71  00 00 00 00 00 00 00 00  |.....9.q........|
00004590  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00006570  00 00 00 00 00 00 00 00  00 00 00 00 00 00 ba 84  |................|
00006580  43 51 00 00 00 18 69 54  58 74 54 69 74 6c 65 00  |CQ....iTXtTitle.|
00006590  00 00 00 00 50 69 63 74  75 72 65 49 74 37 2d 73  |....PictureIt7-s|
000065a0  30 32 3a 70 9c 00 00 00  00 14 74 45 58 74 54 69  |02:p......tEXtTi|
000065b0  74 6c 65 00 50 69 63 74  75 72 65 49 74 37 2d 73  |tle.PictureIt7-s|
000065c0  30 32 f2 8f d5 89 00 00  00 00 49 45 4e 44 ae 42  |02........IEND.B|
000065d0  60 82                                             |`.|

What what do we have here? Near the end of the file before the IEND chunk is an OLE file with the very recognizable hex values of “D0CF11E0“. Let’s strip out the OLE file and take a look.

Path = PictureIt7-s02-ole
Type = Compound
WARNINGS:
There are data after the end of archive
Physical Size = 8704
Tail Size = 7764
Extension = compound
Cluster Size = 512
Sector Size = 64

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2023-12-26 22:01:58 D....                            DataStore
2023-12-26 22:01:58 D....                            Text
                    .....         2560         2560  Text/CONTENTS
                    .....           86          128  Text/[1]CompObj
                    .....           96          128  DataStore/3
                    .....            4           64  DataStore/1
                    .....          121          128  DataStore/0
                    .....           57           64  DataStore/2
                    .....           98          128  DataStore/5
                    .....            4           64  DataStore/4
                    .....         1254         1280  DataStore/7
                    .....            4           64  DataStore/6
                    .....            4           64  DataStore/8
------------------- ----- ------------ ------------  ------------------------
2023-12-26 22:01:58               4288         4672  11 files, 2 folders

Interesting, I don’t think I have come across a standard format with a container embedded within. I have come across many OLE and ZIP containers which contain other common formats within, but this format is definitely unique. Others have added features in the IDAT chunk, such as a web shell. I am sure there are others out there. The CompObj file found within the Text directory is very similar to the Microsoft Works and Publisher format. Although trying to open the file in Publisher doesn’t work!

hexdump -C PictureIt7-s02-ole/Text/\[1\]CompObj | head
00000000  01 00 fe ff 03 0a 00 00  ff ff ff ff 00 00 00 00  |................|
00000010  00 00 00 00 00 00 00 00  00 00 00 00 1a 00 00 00  |................|
00000020  51 75 69 6c 6c 39 36 20  53 74 6f 72 79 20 47 72  |Quill96 Story Gr|
00000030  6f 75 70 20 43 6c 61 73  73 00 ff ff ff ff 01 00  |oup Class.......|
00000040  00 00 00 00 00 00 f4 39  b2 71 00 00 00 00 00 00  |.......9.q......|
00000050  00 00 00 00 00 00                                 |......|

PRONOM uses binary and container signatures to identify file formats. Even though this file format contains a valid OLE container, because it is within a regular binary file format, I don’t believe a container signature would work. The difficulty will be to clearly identify this new format without falsely identifying a regular PNG instead. The OLE file format header is not in a consistent location to use a specific offset. Making the string a variable location can causes some undo processing, so lets look to see if there is anything else we can use to make a positive ID.

The PNG file format is based on chunks, you have to have IHDR, then an IDAT and the IEND chunk. If we take a look at a regular PNG file using a libpng tool pngcheck, we see this:

pngcheck -cvt rgb-8.png 
File: rgb-8.png (759 bytes)
  chunk IHDR at offset 0x0000c, length 13
    256 x 256 image, 24-bit RGB, non-interlaced
  chunk tEXt at offset 0x00025, length 44, keyword: Copyright
    ? 2013,2015 John Cunningham Bowler
  chunk iTXt at offset 0x0005d, length 116, keyword: Licensing
    compressed, language tag = en
    no translated keyword, 101 bytes of UTF-8 text
  chunk IDAT at offset 0x000dd, length 518
    zlib: deflated, 32K window, maximum compression
  chunk IEND at offset 0x002ef, length 0
No errors detected in rgb-8.png (5 chunks, 99.6% compression).

The required chunk are there, but a couple extra, the tEXt and iTXt, which are textual metadata you can add. Now lets look at a PNG Plus file:

pngcheck -cvt PictureIt7-s02.png         
File: PictureIt7-s02.png (26066 bytes)
  chunk IHDR at offset 0x0000c, length 13
    500 x 333 image, 32-bit RGB+alpha, non-interlaced
  chunk sRGB at offset 0x00025, length 1
    rendering intent = perceptual
  chunk gAMA at offset 0x00032, length 4: 0.45455
  chunk cHRM at offset 0x00042, length 32
    White x = 0.3127 y = 0.329,  Red x = 0.64 y = 0.33
    Green x = 0.3 y = 0.6,  Blue x = 0.15 y = 0.06
  chunk IDAT at offset 0x0006e, length 9460
    zlib: deflated, 32K window, fast compression
  chunk cmOD at offset 0x0256e, length 0
    Microsoft Picture It private, ancillary, unsafe-to-copy chunk
  chunk cpIp at offset 0x0257a, length 16384
    Microsoft Picture It private, ancillary, safe-to-copy chunk
  chunk iTXt at offset 0x06586, length 24, keyword: Title
    uncompressed, no language tag
    no translated keyword, 15 bytes of UTF-8 text
  chunk tEXt at offset 0x065aa, length 20, keyword: Title
    PictureIt7-s02
  chunk IEND at offset 0x065ca, length 0
No errors detected in PictureIt7-s02.png (10 chunks, 96.1% compression).

It looks like we have the required chunks and some textual chunks but also a couple chunks which pngcheck describes as private and identify’s them as Microsoft Picture It chunks. The cpIp chunk is the one which contains the OLE container. This is the chunk we need to identify in a signature. The problem is the offset for the cpIp chunk is not the same each time. Here is one from Digital Image 10 Pro.

  chunk cpIp at offset 0x737a7, length 245760
    Microsoft Picture It private, ancillary, safe-to-copy chunk

Significantly further in the file that the other example. These samples currently identify as PNG 1.2 files. PRONOM fmt/13 so we can use the signature and add to it, but it currently doesn’t look for IDAT only the iTXt chunk, which is probably not optimal. For PNG Plus, lets get the header which includes IHDR, IDAT, then the cpIp chunk then an end of file sequence for IEND. Take a look at my signature and samples, I am curious how many PNG Plus files are out there hidden to the world.

Turns out there is another PNG flavor which has been enhanced to allow for layers and pages. Adobe Fireworks uses a PNG format as their native format. They also use private chunks, but not within an OLE container. They use additional chunks, but before the IDAT chunk:

  chunk prVW at offset 0x00092, length 1700
    Macromedia Fireworks preview chunk (private, ancillary, unsafe to copy)
  chunk mkBF at offset 0x00742, length 72
    Macromedia Fireworks private, ancillary, unsafe-to-copy chunk
  chunk mkTS at offset 0x00796, length 36716
    Macromedia Fireworks(?) private, ancillary, unsafe-to-copy chunk
  chunk mkBS at offset 0x0970e, length 190
    Macromedia Fireworks private, ancillary, unsafe-to-copy chunk
  chunk mkBT at offset 0x097d8, length 1251
    Macromedia Fireworks private, ancillary, unsafe-to-copy chunk
  chunk mkBT at offset 0x09cc7, length 1358
    Macromedia Fireworks private, ancillary, unsafe-to-copy chunk
  chunk mkBT at offset 0x0a221, length 1145
    Macromedia Fireworks private, ancillary, unsafe-to-copy chunk
  chunk mkBT at offset 0x0a6a6, length 339
    Macromedia Fireworks private, ancillary, unsafe-to-copy chunk
  chunk mkBT at offset 0x0a805, length 695
    Macromedia Fireworks private, ancillary, unsafe-to-copy chunk
  chunk mkBT at offset 0x0aac8, length 3799
    Macromedia Fireworks private, ancillary, unsafe-to-copy chunk
  chunk mkBT at offset 0x0b9ab, length 7733
    Macromedia Fireworks private, ancillary, unsafe-to-copy chunk
  chunk mkBT at offset 0x0d7ec, length 2741
    Macromedia Fireworks private, ancillary, unsafe-to-copy chunk
  chunk mkBT at offset 0x0e2ad, length 5153
    Macromedia Fireworks private, ancillary, unsafe-to-copy chunk
  chunk mkBT at offset 0x0f6da, length 10775
    Macromedia Fireworks private, ancillary, unsafe-to-copy chunk

It’s hard to know which each of the chunks are for and if they are all required for the Fireworks PNG format. From the book on PNG.

In addition to supporting PNG as an output format, Fireworks actually uses PNG as its native file format for day-to-day intermediate saves. This is possible thanks to PNG’s extensible “chunk-based” design, which allows programs to incorporate application-specific data in a well-defined way. Macromedia has embraced this capability, defining at least four custom chunk types that hold various things pertinent to the editor. Unfortunately, one of them (pRVW) violates the PNG naming rules by claiming to be an officially registered, public chunk type, but this was an oversight and should be fixed in version 2.0.

Picture It!

Most everyone has heard of Microsoft Office, the suite of applications used by millions everyday. Less people know about Microsoft Works, which was a lower cost alternative, but was quite popular as a home office suite of applications. One tool which often came with the Works suite was a digital image tool called Picture It!

Picture It! was a photo editing tool first released by Microsoft in 1996 geared to making photo editing easy and affordable.

Picture It! used a wizard type interface which walked you through acquiring an image and adding to it. One of the key features of the software was the ability to “stack” objects like layers. Because of this feature a new file format was used to save this information to disk. Meet the Microsoft Image (Picture) Extension format, commonly known as the MIX file format. It is very similar to the FlashPix image format, which was supposed to be an image file format to solve many delivery issues, but didn’t seem to gain hold despite being created by Kodak, HP, and others. In fact many of the MIX files I found on Microsoft disks are actually FlashPix files.

The MIX extension was also used by another Microsoft program, PhotoDraw, which causes confusion as they were similar, but PhotoDraw has some added features which may not be compatible with Picture It!. Both formats are based on the Microsoft Compound Object (OLE) container, and have a similar structure. Let’s take a look at a MIX file from Picture It! version 1.

7z l PictureIt1-s02.mix                 

--
Path = PictureIt1-s02.mix
Type = Compound
Physical Size = 48128
Extension = compound
Cluster Size = 512
Sector Size = 64

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
                    .....          328          384  [5]Data Object 000001
                    .....          396          448  [5]Transform 000004
                    .....          872          896  [5]Operation 000001
                    .....          320          320  [1]CompObj
                    .....          292          320  [5]Global Info
                    .....          872          896  [5]Operation 000002
                    .....          144          192  [5]Operation 000003
                    .....          684          704  [5]Transform 000008
                    .....         1028         1088  [5]Transform 000009
                    .....          328          384  [5]Data Object 000009
                    .....          324          384  [5]Data Object 000005
2023-12-27 11:04:39 D....                            Data Object Store 000001
                    .....          328          384  [5]Data Object 000010
                    .....        20932        20992  [5]SummaryInformation
                    .....          200          256  [5]Microsoft Embedding Info
2023-12-27 11:04:39 D....                            Data Object Store 000001/Resolution 0001
                    .....         1400         1408  Data Object Store 000001/[5]Image Contents
                    .....          230          256  Data Object Store 000001/[1]CompObj
2023-12-27 11:04:39 D....                            Data Object Store 000001/Resolution 0000
                    .....           28           64  Data Object Store 000001/Resolution 0000/Subimage 0000 Data
                    .....           80          128  Data Object Store 000001/Resolution 0000/Subimage 0000 Header
2023-12-27 11:04:39 D....                            Data Object Store 000001/Resolution 0003
2023-12-27 11:04:39 D....                            Data Object Store 000001/Resolution 0002
                    .....           28           64  Data Object Store 000001/Resolution 0002/Subimage 0000 Data
                    .....          208          256  Data Object Store 000001/Resolution 0002/Subimage 0000 Header
2023-12-27 11:04:39 D....                            Data Object Store 000001/Resolution 0005
2023-12-27 11:04:39 D....                            Data Object Store 000001/Resolution 0004
                    .....           28           64  Data Object Store 000001/Resolution 0004/Subimage 0000 Data
                    .....         1792         1792  Data Object Store 000001/Resolution 0004/Subimage 0000 Header
                    .....          124          128  Data Object Store 000001/[5]SummaryInformation
                    .....           28           64  Data Object Store 000001/Resolution 0005/Subimage 0000 Data
                    .....         6976         7168  Data Object Store 000001/Resolution 0005/Subimage 0000 Header
                    .....           28           64  Data Object Store 000001/Resolution 0003/Subimage 0000 Data
                    .....          544          576  Data Object Store 000001/Resolution 0003/Subimage 0000 Header
                    .....           28           64  Data Object Store 000001/Resolution 0001/Subimage 0000 Data
                    .....          128          128  Data Object Store 000001/Resolution 0001/Subimage 0000 Header
------------------- ----- ------------ ------------  ------------------------
2023-12-27 11:04:39              38698        39872  29 files, 7 folders

This is a simple MIX file with one line of text, but contains a lot of content inside the OLE container. If I try and use the PRONOM registry to identify the file, I get:

sf PictureIt1-s02.mix 
---
siegfried   : 1.11.0
scandate    : 2023-12-27T11:06:32-07:00
signature   : default.sig
created     : 2023-12-17T15:54:41+01:00
identifiers : 
  - name    : 'pronom'
    details : 'DROID_SignatureFile_V116.xml; container-signature-20231127.xml'
---
filename : 'PictureIt1-s02.mix'
filesize : 48128
modified : 2023-12-27T11:04:40-07:00
errors   : 
matches  :
  - ns      : 'pronom'
    id      : 'fmt/111'
    format  : 'OLE2 Compound Document Format'
    version : 
    mime    : 
    class   : 'Text (Structured)'
    basis   : 'byte match at 0, 30'
    warning : 

Hmm, we know it is an OLE compound document, but it should identify as a Picture It! file as PRONOM has defined a PUID for the format. fmt/936 has been defined as “Microsoft Picture It! Image File 1”. So I am not sure why this file from version 1 is not identifying correctly. Let’s take a look. The PRONOM container signature for fmt/936 is looking for this:

    <ContainerSignature Id="17015" ContainerType="OLE2">
      <Description>Microsoft Picture It! Image File</Description>
      <Files>
        <File>
          <Path>CompObj</Path>
          <BinarySignatures>
            <InternalSignatureCollection>
              <InternalSignature ID="17015">
                <ByteSequence Reference="BOFoffset">
                  <SubSequence Position="1" SubSeqMinOffset="32"
                               SubSeqMaxOffset="32">
                    <Sequence>'Microsoft Picture It! version 1 Picture'</Sequence>
                  </SubSequence>
                </ByteSequence>
              </InternalSignature>
            </InternalSignatureCollection>
          </BinarySignatures>
        </File>
      </Files>
    </ContainerSignature>

The container signature is looking into the OLE container for the “CompObj” file (which seems to be required), then looks for the string “Microsoft Picture It! version 1 Picture” starting at the 32nd byte. That is pretty specific. The sample file I am using as an example has the following string of bytes.

hexdump -C PictureIt1-s02/\[1\]CompObj 
00000000  01 00 fe ff 03 0a 00 00  ff ff ff ff 00 68 61 56  |.............haV|
00000010  54 c1 ce 11 85 53 00 aa  00 a1 f9 5b 1e 00 00 00  |T....S.....[....|
00000020  4d 69 63 72 6f 73 6f 66  74 20 50 69 63 74 75 72  |Microsoft Pictur|
00000030  65 20 49 74 21 20 50 69  63 74 75 72 65 00 27 00  |e It! Picture.'.|
00000040  00 00 7b 35 36 36 31 36  38 30 30 2d 43 31 35 34  |..{56616800-C154|
00000050  2d 31 31 43 45 2d 38 35  35 33 2d 30 30 41 41 30  |-11CE-8553-00AA0|
00000060  30 41 31 46 39 35 42 7d  00 13 00 00 00 50 69 63  |0A1F95B}.....Pic|
00000070  74 75 72 65 49 74 21 2e  50 69 63 74 75 72 65 00  |tureIt!.Picture.|

Ok, so this sample has a similar string but is missing the “version 1” text. It seems the samples used to created the PRONOM signature was working off samples which included the version 1 in the header of CompObj. Maybe when Microsoft learned they would be making a version 2, they decided a version number should be included going forward. Let’s take a look a file from version 2 to compare:

hexdump -C PictureIt2-s01/\[1\]CompObj 
00000000  01 00 fe ff 03 0a 00 00  ff ff ff ff 50 28 72 2d  |............P(r-|
00000010  4b 8c d0 11 a9 6f 00 a0  c9 05 41 0d 28 00 00 00  |K....o....A.(...|
00000020  4d 69 63 72 6f 73 6f 66  74 20 50 69 63 74 75 72  |Microsoft Pictur|
00000030  65 20 49 74 21 20 76 65  72 73 69 6f 6e 20 32 20  |e It! version 2 |
00000040  50 69 63 74 75 72 65 00  27 00 00 00 7b 32 44 37  |Picture.'...{2D7|
00000050  32 32 38 35 30 2d 38 43  34 42 2d 31 31 44 30 2d  |22850-8C4B-11D0-|
00000060  41 39 36 46 2d 30 30 41  30 43 39 30 35 34 31 30  |A96F-00A0C905410|
00000070  44 7d 00 f4 39 b2 71 50  00 00 00 4d 00 69 00 63  |D}..9.qP...M.i.c|

Ok, so it looks like they did update the version string for version 2. This file also does not identify correctly. A quick look at the wikipedia page for Microsoft Picture It! tells us they continued to release the software until version 10. Is there a different string for each version?

Diving into this and gathering many samples has brought a lot of variants to surface. Let’s see if we can list all the CompObj header variants.

Version 1 samples:
Picture It! Picture'{56616800-C154-11CE-8553-00AA00A1F95B}
Microsoft Picture It! Picture'{56616800-C154-11CE-8553-00AA00A1F95B}
Microsoft Picture It! version 1 Picture'{56616800-C154-11CE-8553-00AA00A1F95B}
Picture It! Collage'{56616800-C154-11CE-8553-00AA00A1F95B}

Version 2 samples:
Microsoft Picture It! version 2 Picture'{2D722850-8C4B-11D0-A96F-00A0C905410D}

Version 3 samples:
Microsoft Picture It! version 3 Picture'{18B8D020-B4FD-11D0-A97E-00A0C905410D}

Version 4 samples:
Microsoft Picture It! version 4 Picture'{18B8D020-B4FD-11D0-A97E-00A0C905410D}

PhotoDraw version 1 samples:
Microsoft PhotoDraw version 1 Picture'{18B8D020-B4FD-11D0-A97E-00A0C905410D}

PhotoDraw version 2 samples:
Microsoft PhotoDraw version 2 Picture'{18B8D021-B4FD-11D0-A97E-00A0C905410D}

FlashPix samples:
FlashPix Object({56616000-C154-11CE-8553-00AA00A1F95B}
FlashPix Object({56616800-C154-11CE-8553-00AA00A1F95B}
Picture It! FlashPix'{56616700-C154-11CE-8553-00AA00A1F95B}
LPI FlashPix'{56616700-c154-11ce-8553-00aa00a1f95b}
FlashPix_Object'{56616700-C154-11CE-8553-00AA00A1F95B}
'{56616700-C154-11CE-8553-00AA00A1F95B}
Picture It!'{56616700-c154-11ce-8553-00aa00a1f95b}
Flashpix Toolkit Application'{56616700-c154-11ce-0000-000000000000}

Ok, there is a lot to discuss here. First of all, it seems MIX was only used in Picture It! until version 5 (2001), then the Picture It! software used a new format, PNG Plus to store the layered stacks. More on that in a future post! Although some later versions seems to be able to open the older MIX format. Version 4 of the MIX format seems to be the last as the 2001 software had only version 4 files on it. Probably safe to say only the 4 versions are needed for identification.

You may notice the additional unique identifier I included in each format. This is called a Class ID for the OLE format, which A LOT of formats use. Each “format” has a unique ID associated with it to help distinguish it from other formats. This Unique ID could possibly be a better solution for identification. It does cross over with the PhotoDraw format, but the FlashPix format seems to have a unique ID. With all the variations in the version 1 strings, the ID remains the same. For version 3 and 4 the ID is the same, which could mean they are interchangeable. It is also the same as PhotoDraw version 1. Not to complicate things.

So it seems in order to get proper identification of these similar formats we need to:

  • Clean up version 1 identification for fmt/936
  • Add a signature for 2, 3, and 4
  • Add a version 2 signature for the PhotoDraw format
  • Add some additional signature variations for the FlashPix format.

The Class ID’s could be used to distinguish different versions and formats, but many of the ID’s are identical, this could mean they are the same format. But for now we can just add the additional variation strings and it should identify everything for now. The FlashPix format needs more research as there is so many different variations and it’s so close to the MIX format. Take a look at my GitHub submission, maybe you have some additional variations to add?

Adobe Acrobat Capture

During the recent PRONOM Research Week, I noticed a file format with no description and no signature.

x-fmt/217Adobe ACD

All I had to go on was it was an Adobe format and the acronym “ACD”. One of the first results that came up in a google search was a post in the Adobe forums with someone asking what to do with some old ACD and ACI files they found on a disc, circa 2000, labeled “Adobe Capture”. The only thing I remember about Adobe Capture was some scanning tools related to Adobe Acrobat, but I didn’t remember coming across any ACD files related to Acrobat.

Initially it wasn’t easy to find more information on this format. Eventually I was able to narrow it down to stand-alone software adobe released called “Adobe Acrobat Capture”. Originally released in 1995 it was eventually discontinued in 2010. The software was marketed under the ePaper name and connected to Acrobat through the creation of a PDF from scanned images. The software was compatible with many scanner models and would process the scanned images, run Optical Character recognition, and export to a searchable PDF. These tools are built into Adobe Acrobat today.

One of the reasons the software was being so elusive is the fact it was sold with a high price tag and required the use of a hardware key, or dongle, in order to process scans. The hardware key also managed the type of license you purchased which may limit the number of pages you are allowed to scan within a certain period of time. So the software is very difficult to run today, if you do happen to find a copy out there in Internet land.

In order to document these file formats for preservation purposes I needed to find some samples. I was excited to find a demonstration CD on the Internet Archive, but unfortunately it contained no examples of the ACD file format.

A little sleuthing on the Wayback Machine helped me find a few user guides and brochures. I was also able to find there was three versions of Adobe Acrobat Capture. In a Product Brochure, you can see a screenshot of the software with a document open with the ACD extension.

If you are OCD like me you might have noticed the window in this screenshot is typical of the older Windows 3.1 or Windows NT system. So this was indeed an older product released by Adobe.

The Adobe Acrobat Capture 3.0 Demonstration CD-ROM from the Internet Archive luckily has a UserGuide PDF on the disc and was able to help me understand the ACD format a little more.

Looks like the ACD format is an intermediate format used by the software to manage the process between scanning and export to PDF. ACD was also defined as an “Acrobat Capture Document” which makes sense. They were also mentioned as being “multipage files in Acrobat Capture Document (ACD)”. The UserGuide also mentioned an ACP format which it referenced as “one-page files are in Acrobat Capture Page (ACP) format.” So more research is needed.

Lets start with Adobe Acrobat Capture 2.0 as I managed to get a few samples from an installer I found. Here is a hexdump of an ACD file and its corresponding ACI file.

hexdump -C CONTRACT.ACD | head
00000000  02 04 47 47 c9 00 86 b5  01 00 b6 27 02 00 01 00  |..GG.......'....|
00000010  f5 00 5e 00 3b 96 02 00  01 6e 63 6a 00 00 88 68  |..^.;....ncj...h|
00000020  00 00 26 00 44 3a 5c 43  4f 44 45 5c 47 47 5c 50  |..&.D:\CODE\GG\P|
00000030  52 4f 44 55 43 54 2e 33  32 53 5c 49 4e 5c 63 6f  |RODUCT.32S\IN\co|
00000040  6e 74 72 61 63 74 2e 61  63 69 00 00 00 00 00 00  |ntract.aci......|
00000050  7c 33 c0 27 00 40 ff ff  ff 00 03 00 03 00 00 00  ||3.'.@..........|
00000060  00 00 00 00 00 00 40 00  00 00 00 00 00 03 00 00  |......@.........|
00000070  00 00 00 00 00 00 00 40  00 00 00 00 09 00 0a ab  |.......@........|
00000080  04 0b 14 b5 04 39 19 00  40 00 00 00 00 0c 14 b0  |.....9..@.......|
00000090  04 38 19 b0 04 08 00 0a  7f 06 d3 11 89 06 39 17  |.8............9.|

hexdump -C CONTRACT.ACI | head
00000000  49 49 2a 00 b3 0c 02 00  35 80 78 a0 80 35 c0 78  |II*.....5.x..5.x|
00000010  a4 80 35 40 3c 54 40 01  e2 b2 01 e2 b2 01 e2 b2  |..5@<T@.........|
00000020  01 e2 b2 01 e2 b2 01 e2  b2 01 e2 b2 01 e2 b2 01  |................|
00000030  e2 b2 01 e2 b2 01 e2 b2  01 e2 b2 01 e2 b2 01 e2  |................|
00000040  b2 01 e2 b2 01 e2 b2 01  e2 b2 01 e2 b2 01 e2 b2  |................|
00000050  01 e2 b2 01 e2 b2 01 e2  b2 01 e2 b2 01 e2 b2 01  |................|
00000060  e2 b2 01 e2 b2 01 e2 b2  01 e2 b2 01 e2 b2 01 e2  |................|
00000070  b2 01 e2 b2 01 e2 b2 01  e2 b2 01 e2 b2 01 e2 b2  |................|
00000080  01 e2 b2 01 e0 b0 01 e0  b0 01 e0 b0 01 e0 b0 01  |................|
00000090  e0 b0 01 e0 b0 01 e0 b0  01 e0 b0 01 e0 b0 01 e0  |................|

The ACD file is unique, PRONOM and even TrID was unaware of the format. But to the keen observer, the ACI format is very recognizable. You may have seen this header before:

Lets take a closer look at an ACI file to see if they are a true TIFF image or if there is any customization to the format.

tiffinfo CONTRACT.ACI 
=== TIFF directory 0 ===
TIFF Directory at offset 0x20cb3 (134323)
  Subfile Type: (0 = 0x0)
  Image Width: 2544 Image Length: 3295
  Resolution: 300, 300
  Bits/Sample: 1
  Compression Scheme: CCITT RLE
  Photometric Interpretation: min-is-white
  Samples/Pixel: 1
  Rows/Strip: 32
  Planar Configuration: single image plane
  Software: HALO Desktop Imager

exiftool -D CONTRACT.ACI 
    - ExifTool Version Number         : 12.60
    - File Name                       : CONTRACT.ACI
    - Directory                       : TUTORIAL/SAMPOUT
    - File Size                       : 134 kB
    - File Modification Date/Time     : 1995:07:10 16:02:08-06:00
    - File Access Date/Time           : 2023:11:14 15:41:02-07:00
    - File Inode Change Date/Time     : 2023:11:08 08:34:18-07:00
    - File Permissions                : -rwxrwxrwx
    - File Type                       : TIFF
    - File Type Extension             : tif
    - MIME Type                       : image/tiff
    - Exif Byte Order                 : Little-endian (Intel, II)
  254 Subfile Type                    : Full-resolution image
  256 Image Width                     : 2544
  257 Image Height                    : 3295
  258 Bits Per Sample                 : 1
  259 Compression                     : CCITT 1D
  262 Photometric Interpretation      : WhiteIsZero
  273 Strip Offsets                   : (Binary data 625 bytes, use -b option to extract)
  277 Samples Per Pixel               : 1
  278 Rows Per Strip                  : 32
  279 Strip Byte Counts               : (Binary data 448 bytes, use -b option to extract)
  282 X Resolution                    : 300
  283 Y Resolution                    : 300
  305 Software                        : HALO Desktop Imager
    - Image Size                      : 2544x3295
    - Megapixels                      : 8.4

Looks like a true TIFF image with no special tags or unique properties. They are 1-bit TIFF’s compressed with CCITT RLE. Not sure there would be any need to create a special signature for these ACI files.

Looking closer at the ACD file format, we can see they reference ACI files, so probably safe to assume the ACD file doesn’t contain the full raster data for each image:

hexdump -C Report.acd
00000000  02 04 47 47 c9 00 9a 8b  00 00 d4 ce 00 00 03 00  |..GG............|
00000010  f5 02 5f 00 00 61 01 00  01 6e 63 6a 01 00 30 5f  |.._..a...ncj..0_|
00000020  00 00 27 00 63 3a 5c 63  61 70 74 75 72 65 32 5c  |..'.c:\capture2\|
00000030  73 61 6d 70 6c 65 73 5c  6f 75 74 5c 52 65 70 6f  |samples\out\Repo|
00000040  72 74 5f 30 30 30 31 2e  61 63 69 00 00 01 00 00  |rt_0001.aci.....|
00000050  00 00 00 00 00 00 00 00  00 00 e8 03 00 00 01 00  |................|
00000060  01 00 00 00 00 00 00 00  00 00 08 00 52 65 70 6f  |............Repo|
00000070  72 74 30 31 00 00 00 00  70 33 d8 27 00 40 ff ff  |rt01....p3.'.@..|
*
00005f40  07 00 40 6f 00 09 00 40  01 6e 63 6a 02 00 52 2c  |..@o...@.ncj..R,|
00005f50  00 00 27 00 63 3a 5c 63  61 70 74 75 72 65 32 5c  |..'.c:\capture2\|
00005f60  73 61 6d 70 6c 65 73 5c  6f 75 74 5c 52 65 70 6f  |samples\out\Repo|
00005f70  72 74 5f 30 30 30 32 2e  61 63 69 00 00 00 00 00  |rt_0002.aci.....|
00005f80  00 00 00 00 4e 0c fe ff  ff ff e8 03 00 00 01 00  |....N...........|
00005f90  01 00 00 00 00 00 00 00  00 00 08 00 52 65 70 6f  |............Repo|
00005fa0  72 74 30 32 00 00 00 00  4c 31 f0 27 00 40 ff ff  |rt02....L1.'.@..|

From the limited sample set I have access, all the ACD files begin with the same Hex values, “02044747C900”. Along with the common header we can assume there should be at least one ACI file referenced in the first part of the file. Because it is referenced as a filepath, the ACI string would be variable in its offset.

Adobe Acrobat Capture 3.0 turns out to be a different format. But looks familiar………

hexdump -C Contract.acd | head
00000000  50 4b 03 04 14 00 00 00  08 00 3b ba 6e 57 23 9d  |PK........;.nW#.|
00000010  8e b8 3d 00 00 00 3e 00  00 00 09 00 40 00 46 49  |..=...>.....@.FI|
00000020  4c 45 53 2e 4c 53 54 0a  00 20 00 00 00 00 00 00  |LES.LST.. ......|
00000030  00 00 00 80 e6 e9 ca 50  17 da 01 80 e6 e9 ca 50  |.......P.......P|
00000040  17 da 01 80 e6 e9 ca 50  17 da 01 4e 55 18 00 4e  |.......P...NU..N|
00000050  55 43 58 09 00 46 00 49  00 4c 00 45 00 53 00 2e  |UCX..F.I.L.E.S..|
00000060  00 4c 00 53 00 54 00 8b  76 74 76 31 8c e5 e5 f2  |.L.S.T..vtv1....|
00000070  0c 76 f6 f7 0d f0 0f f6  0c 71 b5 0d 09 0a 75 e5  |.v.......q....u.|
00000080  e5 f2 0b f5 75 f3 f4 71  0d b6 35 e4 e5 02 31 fc  |....u..q..5...1.|
00000090  1c 7d 5d 0d 6d 9d f3 f3  4a 8a 12 93 4b f4 12 93  |.}].m...J...K...|

sf Contract.acd 
---
siegfried   : 1.10.1
scandate    : 2023-11-15T09:10:01-07:00
signature   : default.sig
created     : 2023-10-11T15:10:17-06:00
identifiers : 
  - name    : 'pronom'
    details : 'DROID_SignatureFile_V114.xml; container-signature-20230822.xml'
---
filename : 'Contract.acd'
filesize : 79002
modified : 2023-11-14T23:17:53-07:00
errors   : 
matches  :
  - ns      : 'pronom'
    id      : 'x-fmt/263'
    format  : 'ZIP Format'
    version : 
    mime    : 'application/zip'
    basis   : 'byte match at [[0 4] [78886 3] [78980 4]]'
    warning : 'extension mismatch'

Yep, its a zip container file. lets take a peek inside to see what it is composed of.

7z l Contract.acd 
--
Path = Contract.acd
Type = zip
Physical Size = 79002

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2023-11-14 23:17:54 ....A           62           61  FILES.LST
2023-11-14 23:17:54 ....A          410          226  Contract.acd
2023-11-14 23:17:52 ....A       150213        78093  Contract.acp
------------------- ----- ------------ ------------  ------------------------
2023-11-14 23:17:54             150685        78380  3 files

The the Contract ACD file is like a nesting doll, an ACD within an ACD. Lets see what the ACD and ACP is made of.

hexdump -C Contract.acd | head
00000000  00 01 00 00 00 02 04 47  47 2d 01 9a 01 00 00 02  |.......GG-......|
00000010  00 00 00 02 00 01 01 00  00 00 01 00 00 00 04 04  |................|
00000020  00 00 00 09 00 57 69 6e  67 64 69 6e 67 73 05 00  |.....Wingdings..|
00000030  41 72 69 61 6c 0b 00 43  6f 75 72 69 65 72 20 4e  |Arial..Courier N|
00000040  65 77 0f 00 54 69 6d 65  73 20 4e 65 77 20 52 6f  |ew..Times New Ro|
00000050  6d 61 6e 05 01 00 00 00  02 00 00 00 78 01 00 00  |man.........x...|
00000060  0f 00 54 69 6d 65 73 20  4e 65 77 20 52 6f 6d 61  |..Times New Roma|
00000070  6e 00 00 00 20 0b 00 00  c0 0a 00 00 00 00 00 00  |n... ...........|
00000080  00 06 00 00 00 0f 00 54  69 6d 65 73 20 4e 65 77  |.......Times New|
00000090  20 52 6f 6d 61 6e 00 00  00 20 0c 00 00 00 0c 00  | Roman... ......|

hexdump -C Contract.acp | head
00000000  25 50 44 46 2d 31 2e 33  0d 25 e2 e3 cf d3 0d 0a  |%PDF-1.3.%......|
00000010  31 20 30 20 6f 62 6a 0d  3c 3c 20 0d 2f 54 79 70  |1 0 obj.<< ./Typ|
00000020  65 20 2f 43 61 74 61 6c  6f 67 20 0d 2f 50 61 67  |e /Catalog ./Pag|
00000030  65 73 20 32 20 30 20 52  20 0d 2f 53 74 72 75 63  |es 2 0 R ./Struc|
00000040  74 54 72 65 65 52 6f 6f  74 20 34 20 30 20 52 20  |tTreeRoot 4 0 R |
00000050  0d 2f 43 41 50 54 5f 49  6e 66 6f 20 3c 3c 20 2f  |./CAPT_Info << /|
00000060  56 20 33 30 31 20 2f 46  53 20 5b 20 28 57 69 6e  |V 301 /FS [ (Win|
00000070  67 64 69 6e 67 73 29 28  41 72 69 61 6c 29 28 43  |gdings)(Arial)(C|
00000080  6f 75 72 69 65 72 20 4e  65 77 29 28 54 69 6d 65  |ourier New)(Time|
00000090  73 20 4e 65 77 20 52 6f  6d 61 6e 29 5d 20 2f 4c  |s New Roman)] /L|

The ACD has some of the same hex values as the previous version, but with some extra bytes at the beginning and it looks like the ACP is a straight up PDF. But may have some interesting tags, like “CAPT_info”.

The problem we will face when trying to write a signature for this version of ACD is the container signature needs a static file name to reference, and it appears the name of the container is also the name of the ACD file within the container. So every file will be different. I wish there was a way in the PRONOM signature syntax to reference an extension and ignore the filename, but currently there no method to do this. The only thing inside the container which seems to be consistent is the file “FILES.LST”. So lets take a peek inside if it.

hexdump -C FILES.LST | head
00000000  5b 41 43 44 31 5d 0d 0a  49 53 43 4f 4d 50 4f 53  |[ACD1]..ISCOMPOS|
00000010  49 54 45 3d 54 52 55 45  0d 0a 4e 55 4d 46 49 4c  |ITE=TRUE..NUMFIL|
00000020  45 53 3d 31 0d 0a 46 49  4c 45 4e 41 4d 45 31 3d  |ES=1..FILENAME1=|
00000030  43 6f 6e 74 72 61 63 74  2e 61 63 70 0d 0a        |Contract.acp..|

Ok, there seems to be some static information that is unique to the ACD format. I bet the string “[ACD1]” would be sufficient enough to make a solid signature.

This is a good format example of a limited amount of information on the file format used by a well known company which has become obsolete and disappeared. Take a look at my signatures, maybe you have some old ACD files you were unaware of!

Composite File Management System

In honor of World Digital Preservation Day, I wanted to write a little about format headers, the magic that makes some files more easily identifiable than others.

When it comes to binary file formats, some developers decide to make the format clearly identifiable in a header and others choose to make it ambiguous. Others have a little fun with leaving little clues and references to popular culture.

A couple of my favorites based on their header.

A couple of my current least favorites:

Like I said some developers make it very obvious what software created the file format and others seem to make things difficult. I understand there is a need to optimize files to keep them from getting bloated and taking up too much space, but many of the size limits from the early days of computing are not an issue anymore. Can’t we be more clear when designing a file format?

Today I want to document one format which was very easy to identify as it spelled out its format very verbosely, but because of the lack of additional documentation makes it very hard to preserve.

Meet the Composite File Management System file format:

hexdump -C sample.br4
00000000  43 43 6d 46 20 2d 20 55  6e 69 76 65 72 73 61 6c  |CCmF - Universal|
00000010  20 2d 20 41 78 69 6f 6d  20 2d 20 41 47 50 20 2d  | - Axiom - AGP -|
00000020  20 43 6f 6d 70 6f 73 69  74 65 20 46 69 6c 65 20  | Composite File |
00000030  4d 61 6e 61 67 65 6d 65  6e 74 20 53 79 73 74 65  |Management Syste|
00000040  6d 20 28 55 6e 69 76 65  72 73 61 6c 29 20 2d 20  |m (Universal) - |
00000050  43 72 65 61 74 65 64 20  62 79 20 41 6e 64 72 65  |Created by Andre|
00000060  61 20 50 65 73 73 69 6e  6f 2c 20 44 65 63 65 6d  |a Pessino, Decem|
00000070  62 65 72 20 31 39 39 35  20 28 76 65 72 73 2e 20  |ber 1995 (vers. |
00000080  35 29 20 2d 20 43 6f 70  79 72 69 67 68 74 28 63  |5) - Copyright(c|
00000090  29 20 31 39 39 35 2d 39  36 20 62 79 20 4d 65 74  |) 1995-96 by Met|
000000a0  61 54 6f 6f 6c 73 2c 20  49 6e 63 2e 20 2d 20 50  |aTools, Inc. - P|
000000b0  72 6f 75 64 6c 79 20 6d  61 64 65 20 69 6e 20 74  |roudly made in t|
000000c0  68 65 20 55 53 41 2c 20  6c 61 6e 64 20 6f 66 20  |he USA, land of |
000000d0  74 68 65 20 66 72 65 65  2c 20 68 6f 6d 65 20 6f  |the free, home o|
000000e0  66 20 74 68 65 20 62 72  61 76 65 2e 00 00 00 00  |f the brave.....|
000000f0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

Where to start? First off, this is the Bryce 4 file format. Bryce was a 3D modeling, animation software developed by MetaTools, later MetaCreations. Metacreations was also the developer of popular software Ray Dream Studio/Infini DFractal Design Painter, and Kai’s Power Tools.

Secondly, this format refers to a Universal File Management System or CCmF, which I have found to be the file format for many other extensions, some of which are .goo, .brc, .br3, .br4, .br5, .sfp, .shp, .obp. It doesn’t always have the verbose header, some of them have the following:

hexdump -C Tutorial.obp | head
00000000  20 20 20 20 20 20 20 20  20 20 20 20 20 20 20 20  |                |
*
00000050  20 20 20 20 20 20 20 20  20 20 20 20 20 20 43 43  |              CC|
00000060  6d 46 69 6c 65 3a 3a 6b  49 64 65 6e 74 69 66 79  |mFile::kIdentify|
00000070  34 20 20 20 20 20 20 20  20 20 20 20 20 20 20 20  |4               |
00000080  20 20 20 20 20 20 20 20  20 20 20 20 20 20 20 20  |                |

Different, but still contains the CCmF identification string. Others have the verbose header, but further down inside the file.

With this format being used with so many well known software titles, I assumed information on the format would we readily available. Alas, not so much. The format even had the name of the creator! “Created by Andrea Pessino, December 1995”. So I reached out. He was on Twitter and I asked about the file format and if there was any documentation available. Twitter (X) has since deleted his responses after he deleted his account, but he told me he wasn’t sure where the documentation might be. One other developer also commented and confirmed they didn’t know where any of the documentation went after they left.

MetaCreations sold Bryce to Corel in 2000, then in 2004 sold it to Daz3D, the current owners. It’s not actively developed anymore being that it was never made into a 64bit application. A blog post explains the format a little more, but concludes it is a secret known only to Daz.

It seems there is a community who would like to see Bryce more open, maybe even open-sourced. This thread discusses the format and the underlying Axiom format used.

The creator Andrea Pessino was able to track down some documentation on the CCmF file structure for me. He explained Axiom was an entire codebase for all MetaTools/Creations applications and plugins. So the CCmF system was more than a file format. The documentation included some information on versioning of a CCmF.

There seems to be a few versions of the CCmF file structure.

  • CCmFile::kIdentify which corresponds with December 1995 (vers. 5)
  • CCmFile::kIdentify2 which corresponds with March 1997 (vers. 7)
  • CCmFile::kIdentify3 which corresponds with October 1998 (vers. 9)
  • CCmFile::kDfFormat which is a Generic Composite File

The documentation given to me was up to date for 1998, but after Corel purchased Bryce there was some updates made as many material files have the identifier “CCmFile::kIdentify4“.

Bryce 6 & 7 were released by Daz3D and have a different file header. They have the extension .BR6 & .BR7 with the header:

hexdump -C Bryce7-s01.br7 | head  
00000000  42 72 79 63 65 5f 36 2e  30 5f 46 69 6c 65 00 00  |Bryce_6.0_File..|
00000010  11 00 00 00 d4 07 00 00  00 20 00 00 e5 07 00 00  |......... ......|
00000020  00 0a 00 00 00 10 00 00  00 08 78 9c 63 64 60 60  |..........x.cd``|
00000030  60 04 e2 8c cc f4 0c 85  e4 9c fc d2 14 85 92 d4  |`...............|
00000040  8a 92 d2 a2 54 86 11 05  18 a1 18 04 82 76 c8 b5  |....T........v..|
00000050  be 0e 7c 60 8f 4e 93 67  f2 07 32 f5 d1 0e 30 31  |..|`.N.g..2...01|
00000060  40 fc ca 0c c5 60 bf 33  a2 ab da e2 8c c0 70 e0  |@....`.3......p.|
00000070  00 22 58 a0 9c ff 2a 40  fc bf 16 88 ff c3 c3 2e  |."X...*@........|
00000080  13 64 20 83 82 13 50 29  50 ad 17 50 ef 3c 20 ce  |.d ...P)P..P.< .|
00000090  72 66 64 86 19 31 cd 09  42 57 b9 80 71 43 9d 0b  |rfd..1..BW..qC..|

I still need to gather more samples from the various extensions related to this format and the software related to them. More work to do understanding the different uses of the short CCmFile string and the more detailed header and the differences between objects, materials, and models. When I asked Andrea why he used such a verbose file header, his answer was basically, why not!

Apple Mail

There really is no “Macintosh Format”, but there sure are a lot of formats you only find on the MacOS. From Resource Forks and iWork formats to unique sound formats, MacOS has them all! Majority of cross-platform software vendors have done a much better job in recent years in making their file formats the same across platforms, but for Apple, they love to make things unique, just for their platform.

Take EMLX for example. Seems to be a trend to add “X” to the end of an older format to breath new life into it. The EML format, or Electronic Mail, has existed for a few decades now, but in 2005 Apple updated their Apple Mail application to use a new format, EMLX.

As far as I know, Apple hasn’t released any documentation on the EMLX format, but many folks out there have asked the question and have been able to “reverse engineer” the format. Lets take a look.

An EMLX file consists of three parts:

  • bytecount on first line;
  • email content in MIME format (headers, body, attachments);
  • Apple property list (plist) with metadata.

The bytecount is a variable number which consists of the total bytes starting from the start of the MIME format, including HTML, to the start of the XML property list. Lets look at a simple EMLX.

The byte count is on line 1 with the MIME email (EML) taking up the 556 bytes, then the XML plist at the end. You may ask, what is a plist? Well, it is another Apple (originally NextStep) invention which is embedded throughout the MacOS operating system. A Plist is usually an XML with keys but can also be in a binary format. The Plist can contain properties of the email within Apple Mail like special color flags, tagged as junk, date received and last reviewed.

If you do happen across an EMLX file or group of them, there are a few tools you can use to convert them to a plain old EML. There are python libraries or many other tools to do the job.

But first we need to be sure of identification beyond the extension. Adding this file format to PRONOM would help in identification for preservation purposes. If ran through PRONOM today we get:

filename : '9.emlx'
filesize : 18582
modified : 2023-10-26T22:16:25-06:00
errors   : 
matches  :
  - ns      : 'pronom'
    id      : 'fmt/950'
    format  : 'MIME Email'
    version : '1.0'
    mime    : 'message/rfc822'
    class   : 'Text (Structured)'
    basis   : 'byte match at [[31 17] [599 4] [339 6] [426 6] [90 14]]'
    warning : 'extension mismatch'

Because the format has a EML plain text format within its structure, it is assumed to be an EML file. While technically accurate, Identifying as a unique EMLX format would be beneficial in a preservation system so you can properly assign risk and choose the right tool to parse or migrate.

In looking at the three parts of an EMLX format, we know the EML file is not a good way to show the difference as they are the same structure. The byte count on the first line is variable, so there is no static byte sequence to use for identification. That leaves the Plist section at the end to distinguish the difference.

The PRONOM entry for a Plist looks for the typical XML strings present in most XML files, but then uses the root element “<plist version=”1.0″>” for identification. We could combine the existing EML signature and the Plist signature to identify an EMLX, or just take the existing EML signature and put in a small byte sequence for the closing of the </plist> tag near the EOF? There would be a need for a priority over EML, both would essentially accomplish the same thing.

Take a look at latter idea on my GitHub page and tell me which makes the most sense.

No bad deed….

I had access to my first Macintosh computer around 1987. My father brought it home and I spent hours on it playing games and occasionally writing reports for school. The Macintosh Plus computer had one floppy drive and no hard drive. I remember playing the game Orbiter which had two floppy disks and right in the middle of game play it would pause and ask me to insert disk 2, then quickly ask for disk 1 again. The struggle was real. I spent years using many different Macintosh computers and now own more than I wish to admit. I’m preserving them!

The wild world of digital preservation has been a little lacking on the Macintosh side of things as I have come to realize. There still not a great way to manage Resource Forks in many preservation systems and the identification tools are mainly focused on the data bytetreams and not any system specific attributes Macintosh used often.

The PRONOM registry has either referenced early Macintosh specific formats or missed them entirely so I have been slowly working on a few to close that gap.

Interestingly enough, many Microsoft programs initially made their GUI debuts on the early Macintosh before making their way to Windows. Excel is one I am working on, as Version 1 is not identifiable in PRONOM, it was Macintosh only at the time.

Another is PowerPoint, I recently submitted two new signatures to PRONOM.

fmt/1747: Microsoft PowerPoint Presentation v2.x. Full entry added.
fmt/1748: Microsoft PowerPoint Presentation v3.x. Full entry added.
fmt/1866: Microsoft Powerpoint for Macintosh v.2. Full entry added.
fmt/1867: Microsoft Powerpoint for Macintosh v.3. Full entry added.

PowerPoint was initially released in 1987 on the Macintosh platform. It was developed by a company called ForeThought. Version 1.0 on the Macintosh was under this name, until it was bought by Microsoft only three months after being released. The history of PowerPoint can be discovered at Robert Gaskins, one of the original developers, website and book he wrote. The available information provided by Microsoft is only for the OLE format, covering versions 4.0 until 2003.

So, lets take a look at the Powerpoint original file format, before OLE.

   Type/Creator      RF      DF  Date         Filename
f  SLDS/PPNT         0       932 Oct 10 19:10 PowerPoint-v1

Luckily the early PowerPoint files did not have a Resource Fork. The Data Fork, if you haven’t noticed, has an interesting set of hex values at the beginning of the file. 0BADDEED is the first 4 bytes. If we look at a PowerPoint version 2 file from Windows.

The file format is the same, but because of the weird world of endianness, the first few bytes are in reverse order, EDDEAD0B.

Obviously we need to discuss this magic number and the meaning behind “Bad Deed”. This question was asked previously by the digital preservation community. I have a previous blog post about the use of words for the magic number CAFEBEEF as it was used with with JAVA class files and Express Publisher in the 1990’s. BADDEED looks like another clever use of the hex values that formed words. But was there a story behind the words? Joe Carrano asked if this string might be hexspeak. I wanted to know more so I asked some one who might know.

Robert Gaskins was kind enough to chat with me for a bit about the early days of PowerPoint.

I had a theory on the possible meaning behind BADDEED, so I asked him what the feeling was like between Apple and Microsoft at the time. I had heard for years that PowerPoint was originally created for the Macintosh, but Robert informed me:

  In fact, PowerPoint was designed first for Microsoft Windows, 

and its first spec shows that: “All the screen shots, menus, and 

dialogs were set up to look like Microsoft Windows, not like 

Macintosh.”  (Gaskins, Sweating Bullets, p. 92)  You can see that 

spec here.

A year later, we concluded that we would be forced to ship 

on Mac first, although we still thought that Windows was the 

big opportunity and thought that Mac was risky.  “We just didn’t think 

we could successfully ship a product for Windows, yet, though we planned 

to later. (Gaskins, Sweating Bullets, p. 105)  The considerations are 

summarized in my June 1986 product marketing document.

Of course, we turned out to have been right all along.  PowerPoint on 

Mac was much loved, but sales remained poor because Mac sales were 

so poor.  It was only after we shipped on Windows that PowerPoint gained 

the dominant market share which has characterized it ever since, and 

Windows PPT outsold Mac PPT very quickly. (Gaskins, Sweating Bullets, p. 403)

So my original thought was that there was some bad feelings around this Apple, Microsoft battle which has been the sentiment for quite some time. So when I asked if any of that influenced the use of BADDEED, I was told:

So, far from being disgruntled by expanding PowerPoint to Windows, 

that had been our goal all along, and its achievement was the most 

important success we had.

I judge that you are fully aware of all that, and that 

your question is more, “was there any bad deed signified 

by the Mac hex value chosen?”  No, it was just the poverty 

of choice when you only have six letters.

So there you have it. The use of the hex values 0x0BADDEED, was simply chosen from a limited set of values when looking at words hexadecimal could spell. I guess I should never let the truth get in the way of a good story.

I continued to have a wonderful conversation with Robert and also asked him for some details on the rest of the PowerPoint file format. I was hoping there might be some documentation out there explaining the early format before Microsoft took over. Robert said:

 I don’t know of any such documentation apart from the official 

Microsoft support files available online.  I don’t have any such 

information.  I know that Dennis Austin deposited some of our 

working files at the Computer History Museum (not online):

https://archive.computerhistory.org/resources/access/text/finding-aids/102733943-Austin/102733943-Austin.pdf

and it’s likely that some information is there–if nothing 

else, it claims to contain a source code listing for PPT 1.0 

which would contain the code to read the file format.

So there might be some information in at the Computer History Museum worth looking into.

As far as I could tell from the available online information, there is a few differences between Version 1.0 and Version 2.0, the biggest being the fact that 1.0 did not have an option to print in color, amount a few other minor things. Here is a screenshot of a page from the Microsoft PowerPoint 2.0 documentation on archive.org.

I suppose with the signature additions of the Macintosh and Windows versions 2.0 and 3.0 of the PowerPoint file format in PRONOM, that should cover most needs. Currently my PowerPoint 1.0 files identify at 2.0 files, so I may need to have them adjust the PUID to include both versions 1.0 and 2.0 as they are so similar. If I am able to find a difference or get my hands on the original source code I may find a better solution.

BINHEX

Working with files in todays world is much different than before. Today getting files back and forth from the cloud or through email is relatively easy, unlike the early days when we used FTP sites and needed to encode our data to properly transfer. I remember using an FTP program on my old Mac called Fetch. We had to determine if the content was to be transferred as text or binary.

Picking the right encoding was critical to getting the content transferred correctly, this was even more critical when working with Macintosh files which needed a resource fork and/or finder attributes to work properly. In those cases a MacBinary or BinHex file was required! Fetch would automatically identify those formats and decode them for you.

If you need a refresher on MacBinary and AppleSingle, you can view my iPres 2022 presentation.

One format I didn’t spend much if any time on is the BinHex format. BinHex was a format born out of necessity to move files back and forth across the World Web Web, bulletin boards, AOL, Compuserve, and the like. The FTP program Fetch glossary describes BinHex as:

BinHex (sometimes called BinHex4) is a format for representing a Macintosh file in text form.

The Macintosh file is converted to a series of lines, each made up of letters, numbers, and

punctuation. Because BinHex files are simply text, they can be sent through most electronic mail

systems and stored on most computers. However the conversion to text makes the file larger, so it

takes longer to transmit a file in BinHex format than if the file was represented some other way.

The suffix “.hqx” usually indicates a BinHex format file.

You can still find many of these HQX files floating around the interwebs and on older CDs from the 1990’s. One such CD recently came into my possession. I managed to get a copy of the book “Internet File Formats“, by Tim Kientzie. It came with a CD-ROM with lots of goodies included. Some sample files, specifications, and software. The disc itself is an ISO 9660 partitioned disc, but includes a few Macintosh formats, so the author put many of the software files in the HQX format to maintain the much needed resource fork Macintosh applications need in order to run.

I initially ran the whole disc through DROID to get an idea what was on the disc and if any sample formats were unidentified (something I do regularly), and found majority of the HQX files didn’t identify as they should have to PRONOM PUID x-fmt/416. The signature is an older one, from 2010, but since the format isn’t updated anymore it should be solid. Or so I thought.

Since BINHEX files are encoded as text, lets take a look at a couple of these from the disc which didn’t identify.

The PRONOM signature currently is:

File extension: hqx	
Name	BinHex Binary Text
Description	Header: (This file must be converted with BinHex
Byte sequences	
Position type	Absolute from BOF
Offset	0	 
Value	28546869732066696C65206D75737420626520636F6E76657274656420776974682042696E486578

That “Value” listed in hexadecimal decodes to: “(This file must be converted with BinHex” as listed in the description. We can see this line in the file above, but the signature assumes the value begins at offset 0 from the beginning of the file. So its looking for that value at the start of the file, but this file seems to have some additional text before the value. What does the specs say?

The BinHex 4.0 format was created in 1985 and defined in RFC 1741.

   The whole file is considered as a stream of bits.  This stream will
   be divided in blocks of 6 bits and then converted to one of 64
   characters contained in a table.  The characters in this table have
   been chosen for maximum noise protection.  The format will start
   with a ":" (first character on a line) and end with a ":".
   There will be a maximum of 64 characters on a line.  It must be
   preceded, by this comment, starting in column 1 (it does not start
   in column 1 in this document):

    (This file must be converted with BinHex 4.0)

   Any text before this comment is to be ignored.

   The characters used is:

    !"#$%&'()*+,- 012345689@ABCDEFGHIJKLMNPQRSTUVXYZ[`abcdefhijklmpqr

Ok, so in the specs we can see the “Value” string must be there, but according to the specification, any text before this comment is to be ignored. So adding some instructions and even an email header at the beginning is ok, as long as the value string is there right before the encoded data.

We also learn a couple interesting things. The first character of the first line after the string should be a “:” and the last line should end with a “:” as well. That could help make the signature more solid. We also learn there are a maximum of 64 characters per line. The last line will probably not have full maximum, but the previous lines should…. I wonder if we could use this fixed position from the initial “:” to add even more strength to the signature.

So an updated PRONOM signature might look like:

BOF: {0-4084}28546869732066696C65206D75737420626520636F6E76657274656420776974682042696E486578{6-9}3A

EOF: 3A (Max Offset 64)

Adding the 4,084 bytes at the beginning allow for additional text. This value worked for my samples but there could be others out there with more. The {6-9} bytes in between the string and the colon account for the various way newlines are encoded. Sometimes is one “0A” byte, other times it is “OD”, and others its both. After testing, adding values in the signature to account for the 64 byte line can fail if the file has only one line, so I left it out.

The EOF should just be the colon (3A), but I found many of my samples had various line endings and other random characters. Hence the 64 bytes for max offset.

Also, the current PRONOM entry doesn’t include the Mime-Type, which is: “application/mac-binhex40”

Hopefully this update will add some strength to the signature and follow the specification closer. The new signature even works on files with extra content at the beginning!

This image has an empty alt attribute; its file name is long-binhex-header.png

There are a number of software titles you can use to encode and decode a BinHex file. On a modern Mac, try using The Unarchiver, or Stuffit Expander. From the commandline, you can use the macutil library or the CLI version of Unarchiver. Although the MacOS has a built in utility to decode BinHex files. If you are using a classic version of Macintosh OS, you can find a number of utilities on Macintosh Garden.

Oh, and also, the CD-ROM I mentioned earlier has a few “fun” features. Not sure if they are on purpose or if errors were made during mastering, but a few filenames have some hidden extra characters and one folder puts any tool traversing the directory into a loop, even droid. Have fun!

Apple Package Format

Let’s talk about Apple’s iWork software. Apple’s Office Suite of applications was first released in 2005 and provided a WordProcessor (Pages), Presentations (Keynote), and a little later, Spreadsheet (Numbers). They are exclusive to the Macintosh and iOS devices.

iWork was released in a few different versions. They get a little confusing as each application has its own version which all seemed to unify and stabilize in 2020. Here is a matrix of major versions.

VersionPackage or ZIP
iWork ’05Package
iWork ’06Package
iWork ’08Package
iWork ’09ZIP
iWork 2013Package
iWork 2014ZIP
iWork 2019ZIP
iWork 2020ZIP

You may already be aware but MacOS can sometimes be weird. I use the term weird in a loving, sometimes proud way, but I admit, there was some “odd” choices made in regards to how applications and documents are used and stored files on a Mac.

On early Macintosh computers Apple used an interesting method of storing resources for applications and some file formats. The Resource Fork for an application contained all the “resources” needed to run in the operating system. It would contain all the icons, warning screens, graphics, sounds, etc. This help true until Mac OS X came along and then Apple started using a bundle or package format. Still in use today, what appears to be a single file or application is actually a folder of all the resources needed to run the application.

Show Package Contents

By right clicking or control clicking on the icon you can open the folder and see all the contents which make up the Application.

Directory listing of Pages.app on MacOS

Nifty right? The MacOS which knows which extensions to treat as a package. If you were to move the application over to another system it would be a folder with the extension “.app”.

For an application I can see how this makes sense as it will only execute in the MacOS environment. The problem comes in when you use the same package method for the documents the application creates.

Contents of Pages version 1 sample file.

So instead of a single “file” with a bytestream, you get a folder of files which make up the file format. Here is Apple’s description:

Document Packages

If your document file formats are getting too complex to manage because of several disparate types of data, you might consider adopting a package format for your documents. Document packages give the illusion of a single document to users but provide you with flexibility in how you store the document data internally. Especially if you use several different types of standard data formats, such as JPEG, GIF, or XML, document packages make accessing and managing that data much easier.

Apple actually defines two similar methods:

Although bundles and packages are sometimes referred to interchangeably, they actually represent very distinct concepts:

  • package is any directory that the Finder presents to the user as if it were a single file.
  • bundle is a directory with a standardized hierarchical structure that holds executable code and the resources used by that code.

A couple years ago a processed digital collection made its way down to me. It had been processed by a new digital archivist and when I went to prepare the collection for preservation, I found a folder with the extension .pages and inside the folder a whole directory of files. Many of which they had renamed and arranged. Needless to say, I had to track down the original disk so I could properly preserve the file.

So looking back at the earlier table, iWork switched back and forth between the package format and a ZIP container. For preservation purposes, the ZIP container is easier to maintain outside the MacOS. Lets look a little closer at each. If you would like to follow along I have copied a few samples onto a hybrid ISO.

iWork ’05 through iWork ’08 used the same package format and structure. Because they are a package format, they are difficult to preserve as original files. I suppose you could zip them up, but probably the best option is to open with a current version of Pages and save to the newer ZIP container format.

tree iWork08/Keynote-06.key 
├── Contents
│   └── PkgInfo
├── QuickLook
│   └── Thumbnail.jpg
├── index.apxl.gz
└── theme-files
    ├── Blue 2.jpg
    ├── Blue 2.tif
    ├── Cool Gray-2.jpg
    ├── Cool Gray.tif
    ├── Green-8.jpg
    ├── Green.tif
    ├── Headlines_bullet.pdf
    ├── Headlines_star.pdf
    ├── Orange 2.tif
    ├── Orange_2.jpg
    ├── Purple-6.jpg
    ├── Purple.tif
    ├── Red.jpg
    ├── Red.tif
    ├── endpoints.pdf
    └── headlines_hi-res.jpg

iWork ’09 changed this practice. The documents saved from Pages, Keynote, and Numbers were contained in a ZIP file and can be identified using the PRONOM registry container signatures.

filename : 'iWork 2013/Pages2013-Sample09.pages'
filesize : 105900
modified : 2019-11-21T20:36:00-07:00
matches  :
  - ns      : 'pronom'
    id      : 'fmt/1439'
    format  : 'Apple iWork Pages'
    version : '09'
    class   : 'Word Processor'
    basis   : 'extension match pages; container name index.xml with byte match at 195, 76' 
Sample09.pages
Type = zip
WARNINGS:
Headers Error
Physical Size = 105900

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2019-11-21 20:36:00 .....       364773        22413  index.xml
2019-11-21 20:36:00 .....         7007         7007  Hardcover_bullet_black.png
2019-11-21 20:36:00 .....        69176        69176  Simple_Noise_2x.jpg
2019-11-21 20:36:00 .....          232          232  buildVersionHistory.plist
2019-11-21 20:36:00 .....         6406         6406  QuickLook/Thumbnail.png
------------------- ----- ------------ ------------  ------------------------
2019-11-21 20:36:00             447594       105234  5 files

Then Apple went back to a Package format with iWork 2013. For reasons unknown. But the content and structure changed. Its a package format with a Index.zip instead of index.xml

Pages2013-Sample.pages
├── Data
│   └── Hardcover_bullet_black-13.png
├── Index.zip
├── Metadata
│   ├── BuildVersionHistory.plist
│   ├── DocumentIdentifier
│   └── Properties.plist
├── preview-micro.jpg
├── preview-web.jpg
└── preview.jpg

3 directories, 8 files

The ZIP within the package contains a new Apple format. IWA

Pages2013-Sample.pages/Index.zip
Type = zip
Physical Size = 39361

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2019-11-21 20:47:14 .....         3860         3860  Index/Document.iwa
2019-11-21 20:47:14 .....           26           26  Index/Tables/DataList.iwa
2019-11-21 20:47:14 .....          336          336  Index/ViewState.iwa
2019-11-21 20:47:14 .....          160          160  Index/CalculationEngine.iwa
2019-11-21 20:47:14 .....          121          121  Index/DocumentStylesheet.iwa
2019-11-21 20:47:14 .....        31931        31931  Index/ThemeStylesheet.iwa
2019-11-21 20:47:14 .....           22           22  Index/AnnotationAuthorStorage.iwa
2019-11-21 20:47:14 .....         1889         1889  Index/Metadata.iwa
------------------- ----- ------------ ------------  ------------------------
2019-11-21 20:47:14              38345        38345  8 files

Luckily Apple came to their senses and went back to the ZIP container format for iWork 2014 and later. The container signature looks for the IWA file Apple started using with iWork 2013.

filename : 'iWork 2014/Pages2014-Sample.pages'
filesize : 66256
modified : 2019-11-22T00:03:56-07:00
errors   : 
matches  :
  - ns      : 'pronom'
    id      : 'fmt/1441'
    format  : 'Apple iWork Document'
    version : '14'
    class   : 'Presentation, Spreadsheet, Word Processor'
    basis   : 'extension match pages; container name Index/Document.iwa with byte match at 16, 6; name Metadata/Properties.plist with name only'
Path = iWork 2014/Pages2014-Sample.pages
Type = zip
Physical Size = 66256

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2019-11-22 00:03:54 .....         3930         3930  Index/Document.iwa
2019-11-22 00:03:54 .....          364          364  Index/ViewState.iwa
2019-11-22 00:03:54 .....          206          206  Index/CalculationEngine.iwa
2019-11-22 00:03:54 .....        33573        33573  Index/DocumentStylesheet.iwa
2019-11-22 00:03:54 .....           22           22  Index/AnnotationAuthorStorage.iwa
2019-11-22 00:03:54 .....           23           23  Index/DocumentMetadata.iwa
2019-11-22 00:03:54 .....         8761         8761  Index/Metadata.iwa
2019-11-22 00:03:54 .....          322          322  Metadata/Properties.plist
2019-11-22 00:03:54 .....           36           36  Metadata/DocumentIdentifier
2019-11-22 00:03:54 .....          273          273  Metadata/BuildVersionHistory.plist
2019-11-22 00:03:54 .....        14611        14611  preview.jpg
2019-11-22 00:03:54 .....          838          838  preview-micro.jpg
2019-11-22 00:03:54 .....         1571         1571  preview-web.jpg
------------------- ----- ------------ ------------  ------------------------
2019-11-22 00:03:54              64530        64530  13 files

Now iWork was not the only Apple software to use the Package/Bundle format for their documents. Be advised the following software may save to the package format.

I remember a few years ago, Trent Reznor (NIN) decided to release a few of his tracks in the Garageband format. A little harder to find these days, but the good old wayback machine kept a copy for us! Grab them here. Be warned, they may be in the package format. Thanks Apple!

GEDCOM

One of the first PRONOM signatures I submitted was for a format I felt responsible for, considering where I worked. This is the GEDCOM format, which is an acronym for GEnealogical Data COMmunication. At the time I submitted the signature the format hadn’t been updated in years.

Very recently it has seen a renewed interest from those in the Genealogical community. In 2021 the format was renewed with a Version 7 specification with the purpose of simplifying and clarifying the format. In addition a new format was released to handle storing multimedia files in a container called GED-ZIP.

My first attempt at a signature was based on the specification generally, but with the new version released, I thought it might be good to revisit this format and see if we need to make any adjustments. There needs to be a new signature for the GED-ZIP format as well.

The original signature, fmt/851, created for PRONOM is:

302048454144{0-1024}47454443(0D0A|0D|0A)322056455253

It has an offset of 0-3 to account for any Unicode BOM, but starts with “0 HEAD”; this is the required start to a GEDCOM file. The next bits can be a source of the software which created the GEDCOM, using the tag “SOUR” which can also include a version of the software and name and address of the developer. This can take a bit of space so we include 0-1024 bytes for this information. The next tag is the subrecord of HEAD, “GEDC”, then the next subrecord, “VERS”. Most GEDCOM validations will look for HEAD.GEDC.VERS for the version of GEDCOM the file claims to conform with. The hex values, (0D0A|0D|0A), is the hard return accounting for the different systems that could write the GEDCOM.

A minimal GEDCOM version 5.5 would contain the following.

0 HEAD
1 GEDC
2 VERS 5.5
0 TRLR

The end of the file is marked by the tag “TRLR” in reference to a Trailer. I didn’t include this in my initial signature, but probably should have.

GEDCOM files have been around a long time, the first draft was released in 1984, but the GEDCOM structure we see now really didn’t come along until version 3 in 1987, when the format was standardized and made public. The HEAD.GEDC.VERS wasn’t standardized until version 4. You can see the history here.

So moving forward we should probably have a new PUID for Version 3, Version 4, Version 5 and the new Version 7 and leave the existing signature as is.

Version 3 only requires the tags HEAD, SOUR, DEST and the ending TRLR.

BOF 302048454144(0D0A|0D|0A)3120534F5552{0-128}312044455354
EOF 302054524C52

Version 4 requires the HEAD.GEDC.VERS sequence.

BOF 302048454144{0-1024}47454443(0D0A|0D|0A)3220564552532034
EOF 302054524C52

Version 5 is similar.

BOF 302048454144{0-1024}47454443(0D0A|0D|0A)3220564552532035
EOF 302054524C52

Version 7 is also similar.

BOF 302048454144{0-1024}47454443(0D0A|0D|0A)3220564552532037
EOF 302054524C52

For the new GED-ZIP format we need to create a container signature as the format is a ZIP file but with a GEDCOM inside. The GED-ZIP specifications states:

A GEDCOM ZIP file should:
• include exactly one GEDCOM file with the name “gedcom.ged”
• include all the multimedia objects references by that GEDCOM file
• not include unreferenced multimedia objects

Our Container signature would look like this:

<ContainerSignature Id="1000" ContainerType="ZIP">
 <Description>GEDZIP</Description>
  <Files>
   <File>
     <Path>gedcom.ged</Path>
      <BinarySignatures>
       <InternalSignatureCollection>                    
	 <InternalSignature ID="300">
	  <ByteSequence Reference="BOFoffset">
	    <SubSequence Position="1" SubSeqMinOffset="0" SubSeqMaxOffset="3">
	      <Sequence>30 20 48 45 41 44</Sequence>
	    </SubSequence>
	  </ByteSequence>
	</InternalSignature>
      </InternalSignatureCollection>
     </BinarySignatures>
    </File>               
   </Files>
</ContainerSignature>

I recently learned of a variation on the GEDCOM format which can cause a lot of confusion. The software Family Tree Maker could export to the GEDCOM format, but had a checkbox which, unchecked, allowed you to not abbreviate the tags. The tags in the GEDCOM format are expected just the way they are, which makes me wonder why they would do something so confusing. You can read more about this format here.

I was recently made aware a few of these rouge “GEDCOM” files were out there, in the wild, causing confusion during identification. My first thought was to adjust the signature to make it a little more loose to fit these variations, but then discovered they are not GEDCOM files. In fact later versions of FTM forgot they did this and would error when you tried to import them back into the software. I think it would be wise to identify these FTM GEDCOM variants, just so one is aware of the difference and can then decide how to handle them properly.

The format was named “FTW TEXT”, so we can use that to call the new signature. Instead of “0 HEAD”, “0 HEADER” is used, instead of “0 SOUR”, “0 SOURCE” is used, and instead of “0 TRLR” at the end, “0 TRAILER” is used.

BOF 3020484541444552(0D0A|0D|0A)3120534F55524345
EOF 3020545241494C4552

It was fun to look back at this format and try and improve on it a bit. I learned more than I did when I initially wrote the signature and hopefully documented it well enough. The FTM variant was an interesting twist I was not expecting, which I am sure will show up again in the future. Take a look at the signatures and samples I updated and let me know what you think.

MP4 & 360

Recently I have been exploring the MP4 format, more specifically the ISO Base Media File Format. It appears to be quite the versatile format. Based on the general Box/Atom format. Don’t mean to go much into the format here as there are so many formats which use this structure, like Quicktime MOV, Jpeg2000, to the more recent Canon RAW CR3. I have also been digging into the DASH MP4 format, but we’ll save that for a later time.

One of the more interesting uses of MP4 lately is 360 or spherical video. They are becoming more and more popular with content creators and also used for mapping like Google street view.

A while back I picked up a Insta360 Nano S camera. It attached directly to my iPhone. With a camera on each side it could capture images and video which could later be processed to produce some interesting results.

Of course it needs to be processed first so it doesn’t look like you are peering out of your peephole. Insta360 provides software for you to process the video into a regular video or some fun creative spherical video that makes you look like you are walking on a small globe.

The formats produced by the Insta360 Nano S are plain old JPG and MP4, but uses the extensions .INSP and .INSV respectively. Neither of which are documented in PRONOM yet. But because of the nature of 360 camera’s there is a little more under the hood. If you would like to look at some samples you can find some here.

The INSP file begins like any other EXIF JPEG file, but ends with a little additional info.

The 360 cameras have some additional information from the different gyros and accelerometers, as well as GPS information. The INSP file stores much of this information after the end of the JPG format. You can also see a string of alphanumeric numbers at the end, which is consistent with most of the files I have seen. One python parser of the additional data calls it the magic number. “8db42d694ccc418790edff439fe026bf” would make a good pattern for a signature.

The INSV files are similar, except they use the MP4 base media format.

Mediainfo indeed sees the file as an MPEG-4 with a AVC codec, but with a invalid extension.

Complete name                            : VID_20210222_170428_005.insv
Format                                   : MPEG-4
Format profile                           : JVT
Codec ID                                 : avc1 (avc1/isom)
File size                                : 41.1 MiB
Duration                                 : 7 s 608 ms
Overall bit rate mode                    : Variable
Overall bit rate                         : 45.4 Mb/s
Encoded date                             : UTC 2021-02-22 17:04:18
Tagged date                              : UTC 2021-02-22 17:04:18
IsTruncated                              : Yes
FileExtension_Invalid                    : braw mov mp4 m4v m4a m4b m4p m4r 3ga 3gpa 3gpp 3gp 3gpp2 3g2 k3g jpm jpx mqv ismv isma ismt f4a f4b f4v

In addition to a video and audio track, there is a text track.

Text
ID                                       : 3
Format                                   : Timed Text
Codec ID                                 : text
Duration                                 : 7 s 600 ms
Bit rate mode                            : Constant
Bit rate                                 : 240 b/s
Frame rate                               : 10.000 FPS
Stream size                              : 228 Bytes (0%)
Title                                    : Ambarella EXT
Language                                 : English
Forced                                   : No
Encoded date                             : UTC 2021-02-22 17:04:18
Tagged date                              : UTC 2021-02-22 17:04:18

With a little Exiftool magic, thank you Phil, we can see some of the extra data within the video file.

Serial Number                   : ISS2418ND7XH4H
Model                           : Insta360 Nano S
Firmware                        : v1.17.12.3_build1
Parameters                      : 2 947.866 946.388 964.646 0.000 0.000 90.000 942.993 2891.656 952.520 -0.682 -1.501 89.186 3840 1920 1040
Preview Image                   : (Binary data 578944 bytes, use -b option to extract)
Time Code                       : 62.155
Accelerometer                   : 0.0717358812689781 0.837667405605316 -0.541449248790741
Angular Velocity                : -0.00380666344426572 -0.0143540045246482 0.0170918852090836

Thanks to tools like Exiftool and MediaInfo we can take a peek into some of these formats. New ways of using the existing formats and new formats entirely keep popping up making it hard to know exactly what you have. Initially I just assumed the Insta360 formats didn’t need anything extra as they just used well known format with their own extension, but I needed to look a little closer. Many other cameras are now putting additional data at the end of a standard JPG. It will be interesting to see what new ideas camera developers come up in the coming years.

GoPro has a 360 camera as well and looking at a sample .360 file, I can see it also uses an MP4 base media format, but uses two video tracks to store video from the two cameras. Might need to dig into that format soon as well.