FlashPix

January 19, 2024 by Thor Leave a comment

Is there a perfect raster image format? TIFF has been around quite some time and is generally accepted as a preferred preservation format. There have been a few attempts to have a single file contain multiple resolutions with the purpose of providing resolutions for different uses, lower-resolution for web and higher-resolution for print. Even the semi popular JPEG2000 added multiple resolutions to improve the JPEG format. Kodak came up with a few ideas to do this as well. The Kodak PCD, PhotoCD or Image PAC files was one that was used for awhile before it was abandoned. Another was FlashPix.

I briefly mentioned FlashPix on an earlier post about the Microsoft Picture It! format. They are extremely similar. Both. have the same basic structure in a Compound Object format. Some of the FlashPix files generated by Picture It! even have the same identifiers in the CompObj header.

FlashPix was supposed to be the answer to all the problems with storing bitmap image data and how we view the web. Kodak partnered with some big names, Microsoft Corporation, Hewlett-Packard Company and Live Picture, Inc, were among them. Kodak marketed the format and even included it as a native file format to some of its new digital cameras. The format was made official in June of 1996, with a Whitepaper explaining all the benefits and architecture. There was a lot of hype, some even calling it, “Not your Grandma’s format“. Many graphics software started to include support for the new format, including Adobe Photoshop. So what happened, why didn’t the format catch on? Some say it was the size of storing multiple resolutions in one file, others believe it was the complicated Compound Object structure that lead to its demise. Either way, the format had a lot of hype in the late 1990’s, but by the year 2000, it had gone silent and all the websites went away.

FlashPix did have a big impact, and there were many software and hardware devices which were made compatible. There are a few stories left behind of those who scanned all their photos to the FlashPix format only to find a few years later it was unsupported on more modern computers. There was also a few early digital camera’s which could capture directly to the format. Take my Kodak DC260 zoom camera, circa 1998. Changing the Capture Preferences, I can switch between a JPG and FPX.

Using exiftool we can take a look at one of the images from the camera:

exiftool P0004795.FPX
ExifTool Version Number         : 12.73
File Name                       : P0004795.FPX
Directory                       : GitHub/digicam_corpus/Kodak/DC260/DC260_01
File Size                       : 251 kB
File Modification Date/Time     : 2024:01:06 12:54:20-07:00
File Access Date/Time           : 2024:01:06 13:20:46-07:00
File Inode Change Date/Time     : 2024:01:06 13:04:34-07:00
File Permissions                : -rwxrwxrwx
File Type                       : FPX
File Type Extension             : fpx
MIME Type                       : image/vnd.fpx
Code Page                       : Unicode UTF-16, little endian
Data Object ID                  : 13BC5A58-6B90-1B6B-12C9-0800201177F8
Data Object Status              : Exists, Not Purgeable
Creating Transform              : Source Image
Using Transforms                : 
Cached Image Height             : 1024
Cached Image Width              : 1536
Comp Obj User Type Len          : 16
Comp Obj User Type              : FlashPix_Object
Visible Outputs                 : 1
Maximum Image Index             : 1
Maximum Transform Index         : 0
Maximum Operation Index         : 0
Thumbnail Clip                  : (Binary data 18480 bytes, use -b option to extract)
Revision Number                 : 1
Create Date                     : 2024:01:06 12:53:29
Modify Date                     : 2024:01:06 12:53:29
Software                        : KODAK DIGITAL SCIENCE DC260
Image Width                     : 1536
Image Height                    : 1024
Subimage Width                  : 1536
Subimage Height                 : 1024
Subimage Color                  : RGB
Subimage Numerical Format       : 8-bit, Unsigned
Decimation Method               : None (Full-sized Image)
JPEG Tables                     : (Binary data 558 bytes, use -b option to extract)
Number Of Resolutions           : 1
Max JPEG Table Index            : 1
Scene Type                      : Original Scene
Software Release                : KODAK DIGITAL SCIENCE DC260
Make                            : Eastman Kodak Company
Camera Model Name               : KODAK DIGITAL SCIENCE DC260
Serial Number                   : 7577
Exposure Time                   : 1/180
F Number                        : 4.7
Exposure Program                : Program AE
Exposure Compensation           : 0
Subject Distance                : 0.520 m
Metering Mode                   : Center-weighted average
Light Source                    : Unknown
Focal Length                    : 24.0 mm
Max Aperture Value              : 4.6
Flash                           : No Flash
Exposure Index                  : 90
Sharpness Approximation         : 0
File Source                     : Digital Camera
Sensing Method                  : One-chip color area
Extension Create Date           : 2024:01:06 12:53:29
Extension Modify Date           : 2024:01:06 12:53:29
Creating Application            : Picoss
Extension Name                  : ijuhsimasa
Extension Persistence           : Always Valid
Extension Description           : Data Object Store 000001
Storage-Stream Pathname         : /Data Object Store 000001
Extension Class ID              : 56616000-C154-11CE-8553-00AA00A1F95B
Used Extension Numbers          : 1
Screen Nail                     : (Binary data 4304 bytes, use -b option to extract)
Subimage Tile Count             : 384
Subimage Tile Width             : 64
Subimage Tile Height            : 64
Num Channels                    : 3
Audio Stream                    : (Binary data 30780 bytes, use -b option to extract)
Aperture                        : 4.7
Image Size                      : 1536x1024
Megapixels                      : 1.6
Shutter Speed                   : 1/180
Preview Image                   : (Binary data 4164 bytes, use -b option to extract)
Focal Length                    : 24.0 mm

The file also does identify in PRONOM:

sf P0004795.FPX 
---
siegfried   : 1.11.0
scandate    : 2024-01-17T23:13:59-07:00
signature   : default.sig
created     : 2023-12-17T15:54:41+01:00
identifiers : 
  - name    : 'pronom'
    details : 'DROID_SignatureFile_V116.xml; container-signature-20231127.xml'
---
filename : 'P0004795.FPX'
filesize : 250880
modified : 2024-01-06T12:54:20-07:00
errors   : 
matches  :
  - ns      : 'pronom'
    id      : 'x-fmt/56'
    format  : 'Kodak FlashPix Image'
    version : 
    mime    : 'image/vnd.fpx'
    class   : 'Image (Raster)'
    basis   : 'extension match fpx; container name CompObj with byte match at 53, 36 (signature 2/2)'
    warning :

If you notice, PRONOM has two signatures for the FlashPix format, this image was identified with signature #2. The first signature looks for the string “FlashPix Object”, but the second looks for the CLSID which is unique to each compound object format. FlashPix has the CLSID: {56616700-c154-11ce-8553-00aa00a1f95b}. Looking at many of the other samples I have there is much variation on the use of the string and CLSID.

FlashPix samples:
FlashPix Object({56616000-C154-11CE-8553-00AA00A1F95B}
FlashPix Object({56616800-C154-11CE-8553-00AA00A1F95B}
Picture It! FlashPix'{56616700-C154-11CE-8553-00AA00A1F95B}
LPI FlashPix'{56616700-c154-11ce-8553-00aa00a1f95b}
FlashPix_Object'{56616700-C154-11CE-8553-00AA00A1F95B}
'{56616700-C154-11CE-8553-00AA00A1F95B}
Picture It!'{56616700-c154-11ce-8553-00aa00a1f95b}
Flashpix Toolkit Application'{56616700-c154-11ce-0000-000000000000}

The images from the Kodak Camera use “FlashPix_Object” string so with the underscore it doesn’t match the first signature, but others I made using Picture It! software used a couple variations. Many don’t use the string at all. Others use a sightly different CLSID in both uppercase and lowercase. We will have to suggest adjustments to the current signature to identify them all.

Looking at the contents of the OLE container we can see some interesting things.

Path = P0004795.FPX
Type = Compound
Physical Size = 250880
Extension = compound
Cluster Size = 512
Sector Size = 64

Size         Compressed     Name
------------ ------------  ------------------------
188          192           [5]Data Object 000001
272          320           [1]CompObj
388          448           [5]Extension List
144          192           [5]Global Info
                           Data Object Store 000001
18704        18944         [5]SummaryInformation
816          832           Data Object Store 000001/[5]Image Contents
272          320           Data Object Store 000001/[1]CompObj
988          1024          Data Object Store 000001/[5]Extension List
1624         1664          Data Object Store 000001/[5]Image Info
4332         4608          Data Object Store 000001/[5]Screen Nail_bd0100609719a180
                           Data Object Store 000001/Resolution 0005
                           Data Object Store 000001/Audio_bd0100609719a180
1112         1152          Data Object Store 000001/[5]KDC_bd0100609719a180
72           128           Data Object Store 000001/[5]SummaryInformation
108          128           Data Object Store 000001/Audio_bd0100609719a180/[5]Audio Info
30808        31232         Data Object Store 000001/Audio_bd0100609719a180/Audio Stream 000000
6208         6656          Data Object Store 000001/Resolution 0005/Subimage 0000 Header
176378       176640        Data Object Store 000001/Resolution 0005/Subimage 0000 Data
------------ ------------  ------------------------
242414       244480        16 files, 3 folders

The main CompObj is where we find the identification information, but the Data Object Store 000001 directory is where all the image data is stored. In a multiple resolution image we might see additional Resolution directories. You may also notice a mention of an Audio directory. Yes, this image was captured and then audio was recorded with it. Not a video, but an audio clip associated with the image. FlashPix can contain audio streams. This isn’t the first time we have seen this, HP camera’s also have this function which as it turns out is stored in a FlashPix exif extension within a JPEG.

The FlashPix native format may have disappeared, but the format lives on as an extension to Exif data, allowing you to embed audio and other media within a JPEG file. The code for FlashPix was given to ImageMagick and is maintained by them.

Presto!

January 12, 2024 by Thor Leave a comment

Working in preservation and archiving for the last few years has caused me to change a habit most people use everyday. The double-click. I am usually opening a file in a hex editor or control clicking on a file to open it in a different software application than is default. Maybe it’s just me, but having control over opening a file is essential. The thought of double-clicking on a file and the uncertainty of what is actually happening scares me a little.

Of course opening an application executable requires a double-click or a right-click/open process and from there you can open the file of your choosing. Executables are run-able files because they have the required pieces for the operating system and cpu to interpret and well; run. We need executables in order to make sense of the files we preserve. Without something to interpret our the data in our files they are just a bunch of one’s & zero’s.

Take a PDF for example. By itself, it is hard to make sense of the file. You need Acrobat Reader, or any number of other executable software programs to open and render the PDF.

But what if you could take a file and wrap it in an executable so it is all self contained, the file format and an executable in one file! No separate software needed! On the surface this seems like a great idea, which is why a few software companies had this as an option. An early competitor of PDF, Common Ground had the option to embed the DP file into a self contained viewer. Many archive software tools have the ability to make “self-extracting” executables as well. One obvious downside is being unable to execute on a different platform or a later operating system. But at the time they were very convenient.

One software in particular added the option to export a few different formats into a special wrapper making them viewable on any Windows machine.

New Soft Technology Corporation Presto! PageManager is document management software which can view many different file types. The software helps manage document and photo scanning and keep everything organized. The software often came bundled with home consumer scanners, such as the UMAX Astra scanner I bought years ago. With the Windows version of the software you can take one or more photos and “wrap” them into a Presto! Wrapper.

Once exported to a Presto! Wrapper the files within have a portable viewer wrapped up with them. One double-click and Presto!, you can view, rotate, export, and print your images. The wrapper has a your typical .EXE extension and identifies as such.

sf Presto6-s02.EXE
---
siegfried   : 1.11.0
scandate    : 2024-01-09T23:39:36-07:00
signature   : default.sig
created     : 2023-12-17T15:54:41+01:00
identifiers : 
  - name    : 'pronom'
    details : 'DROID_SignatureFile_V116.xml; container-signature-20231127.xml'
---
filename : 'Presto6-s02.EXE'
filesize : 818301
modified : 2024-01-07T23:48:01-07:00
errors   : 
matches  :
  - ns      : 'pronom'
    id      : 'fmt/899'
    format  : 'Windows Portable Executable'
    version : '32 bit'
    mime    : 'application/vnd.microsoft.portable-executable'
    class   : 
    basis   : 'extension match exe; byte match at [[0 2] [232 94]]'

hexdump -C Presto6-s02.EXE | head
00000000  4d 5a 90 00 03 00 00 00  04 00 00 00 ff ff 00 00  |MZ..............|
00000010  b8 00 00 00 00 00 00 00  40 00 00 00 00 00 00 00  |........@.......|
00000020  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000030  00 00 00 00 00 00 00 00  00 00 00 00 e8 00 00 00  |................|
00000040  0e 1f ba 0e 00 b4 09 cd  21 b8 01 4c cd 21 54 68  |........!..L.!Th|
00000050  69 73 20 70 72 6f 67 72  61 6d 20 63 61 6e 6e 6f  |is program canno|
00000060  74 20 62 65 20 72 75 6e  20 69 6e 20 44 4f 53 20  |t be run in DOS |
00000070  6d 6f 64 65 2e 0d 0d 0a  24 00 00 00 00 00 00 00  |mode....$.......|
00000080  99 72 8f bf dd 13 e1 ec  dd 13 e1 ec dd 13 e1 ec  |.r..............|
00000090  5e 0f ef ec dc 13 e1 ec  b2 0c eb ec d6 13 e1 ec  |^...............|

The preservation of executables is, in my opinion, complicated. Running a 32 bit executable on a computer today might not even work. Then we have to get into the license of using the software and wether the license allows us to use it freely in perpetuity. So as much as this is an executable, knowing it is also a wrapper for regular images is important to know as an option for preservation. The files wrapped inside can be exported and preserved as a solution. So what makes this executable unique. Let’s look a little closer.

00005000  00 00 00 00 11 2e 40 00  00 10 40 00 80 1f 40 00  |......@...@...@.|
00005010  c0 24 40 00 00 00 00 00  00 00 00 00 00 00 00 00  |.$@.............|
00005020  50 6d 76 69 65 77 20 69  73 20 63 6c 6f 73 65 2e  |Pmview is close.|
00005030  00 00 00 00 5c 00 00 00  74 6d 70 00 5c 54 45 4d  |....\...tmp.\TEM|
00005040  50 00 00 00 20 4e 65 77  53 6f 66 74 20 56 69 65  |P... NewSoft Vie|
00005050  77 65 72 00 34 31 36 44  37 30 36 43 36 31 37 39  |wer.416D706C6179|
00005060  36 35 37 32 00 00 00 00  41 6d 70 6c 61 79 65 72  |6572....Amplayer|
00005070  00 00 00 00 70 6d 76 69  65 77 2e 65 78 65 00 00  |....pmview.exe..|
00005080  41 6d 70 6c 61 79 65 72  2e 65 78 65 20 67 72 65  |Amplayer.exe gre|
00005090  65 74 2e 69 64 20 56 00  41 6d 70 6c 61 79 65 72  |et.id V.Amplayer|
000050a0  2e 65 78 65 00 00 00 00  2e 2e 00 00 2e 00 00 00  |.exe............|
000050b0  5c 2a 2e 2a 00 00 00 00  4c 6f 63 61 6c 20 41 70  |\*.*....Local Ap|
000050c0  70 57 69 7a 61 72 64 2d  47 65 6e 65 72 61 74 65  |pWizard-Generate|
000050d0  64 20 41 70 70 6c 69 63  61 74 69 6f 6e 73 00 00  |d Applications..|
000050e0  57 72 61 70 70 65 72 00  43 45 78 70 76 77 44 6f  |Wrapper.CExpvwDo|
000050f0  63 00 00 00 43 45 78 70  76 77 56 69 65 77 00 00  |c...CExpvwView..|

It is indeed a wrapper, the header looks like any other EXE file, but a little further into the file we can see some specifics to the viewer. In all my samples I can see the string “NewsSoft Viewer“. That might be enough to distinguish it from other executables. See some samples here.

I guess part of the question is wether identifying specific software executables is needed in preservation. Arn’t they all executables and should be treated similar? This isn’t the first type of executables I have seen like this. awhile back I came across another home software which allowed you to make a slideshow, complete with audio and wrap it into an executable to put on a disk so playback was easy for the user and nothing additional was needed. The software is called Family Album Creator, use at your own risk.

PNG Plus

January 5, 2024 by Thor Leave a comment

Usually in the software world file formats are fairly efficient, the structure is meant to provide a way to store the data of the software being used. There isn’t much need to add additional unnecessary additions. This isn’t always true, but in the early days, disk space was expensive so compression and efficiency ruled. There also wasn’t much need to hide anything or complicate things. That is unless it is intended. This makes me think of two things, Polyglots and Steganography.

Steganography is the art of embedding data within an image. With digital images you can hide another image within the main image by using the most and least significant bits. Fun use of technology, but not something you normally would find in your regular desktop software.

Ange is the master at polyglots. If you haven’t watched his presentation on funky file formats, you are missing out.

.@Gynvael’s png/zip polyglot visualized (cf his article in the latest @pagedout_zine) pic.twitter.com/5BR6GLoB98
— Ange (@angealbertini) December 18, 2023

Imagine my surprise when I was researching the Picture It! software and the MIX file format only to discover Microsoft decided to make their own polyglot of sorts for their PNG Plus format which replaced the MIX format, then both obsolete when Digital Image was discontinued in 2007. The PNG Plus format was the native format for the Microsoft Picture It! and Digital Image software often found with the Microsoft Works or Digital Imaging suite of software.

According to the help within Digital Image:

The PNG Plus format uses the standard PNG extension but provides saving of layers and pages within the PNG format. Since the PNG format cannot do this natively, how did Microsoft accomplish this? Well, by throwing an OLE container into the middle of the file of course!

PNG Plus files are your regular PNG format and will identify as such. But they are just a low resolution thumbnail of the full image. Let’s take a look:

exiftool PictureIt7-s02.png 
ExifTool Version Number         : 12.70
File Name                       : PictureIt7-s02.png
File Size                       : 26 kB
File Modification Date/Time     : 2023:12:26 22:01:58-07:00
File Access Date/Time           : 2024:01:01 12:31:07-07:00
File Inode Change Date/Time     : 2023:12:26 22:01:58-07:00
File Permissions                : -rwx------
File Type                       : PNG
File Type Extension             : png
MIME Type                       : image/png
Image Width                     : 500
Image Height                    : 333
Bit Depth                       : 8
Color Type                      : RGB with Alpha
Compression                     : Deflate/Inflate
Filter                          : Adaptive
Interlace                       : Noninterlaced
SRGB Rendering                  : Perceptual
Gamma                           : 2.2
White Point X                   : 0.3127
White Point Y                   : 0.329
Red X                           : 0.64
Red Y                           : 0.33
Green X                         : 0.3
Green Y                         : 0.6
Blue X                          : 0.15
Blue Y                          : 0.06
Warning                  : [minor] Text/EXIF chunk(s) found after PNG IDAT (may be ignored by some readers)
Title                           : PictureIt7-s02
Image Size                      : 500x333
Megapixels                      : 0.167

Looks like there is some additional data after the IDAT chunk.

hexdump -C PictureIt7-s02.png | head
00000000  89 50 4e 47 0d 0a 1a 0a  00 00 00 0d 49 48 44 52  |.PNG........IHDR|
00000010  00 00 01 f4 00 00 01 4d  08 06 00 00 00 f6 13 9d  |.......M........|
00000020  37 00 00 00 01 73 52 47  42 00 ae ce 1c e9 00 00  |7....sRGB.......|
00000030  00 04 67 41 4d 41 00 00  b1 8f 0b fc 61 05 00 00  |..gAMA......a...|
00000040  00 20 63 48 52 4d 00 00  7a 26 00 00 80 84 00 00  |. cHRM..z&......|
00000050  fa 00 00 00 80 e8 00 00  75 30 00 00 ea 60 00 00  |........u0...`..|
00000060  3a 98 00 00 17 70 9c ba  51 3c 00 00 24 f4 49 44  |:....p..Q<..$.ID|
00000070  41 54 78 5e ed dd 4d a8  15 57 be 28 f0 1e 08 1e  |ATx^..M..W.(....|
00000080  e3 47 8e 49 ab c7 d8 81  03 09 41 9c 28 38 e8 80  |.G.I......A.(8..|
00000090  d0 9c 0e 08 0e 1a 11 c2  15 07 5e 5a 07 4d c7 2b  |..........^Z.M.+|

The header looks the same as any PNG file, so lets look a little further:

00002560  ff 1f fa 5f 90 66 c9 e6  ad 88 00 00 00 00 63 6d  |..._.f........cm|
00002570  4f 44 4e 88 09 c1 00 00  40 00 63 70 49 70 d0 cf  |ODN.....@.cpIp..|
00002580  11 e0 a1 b1 1a e1 00 00  00 00 00 00 00 00 00 00  |................|
00002590  00 00 00 00 00 00 3e 00  03 00 fe ff 09 00 06 00  |......>.........|
000025a0  00 00 00 00 00 00 00 00  00 00 01 00 00 00 01 00  |................|
000025b0  00 00 00 00 00 00 00 10  00 00 02 00 00 00 01 00  |................|
*
00002970  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff 52 00  |..............R.|
00002980  6f 00 6f 00 74 00 20 00  45 00 6e 00 74 00 72 00  |o.o.t. .E.n.t.r.|
00002990  79 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |y...............|
000029a0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000029b0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 16 00  |................|
000029c0  05 00 ff ff ff ff ff ff  ff ff 01 00 00 00 7e 7f  |..............~.|
000029d0  3f b5 a5 f6 86 43 a1 a1  a3 02 24 d2 88 ef 00 00  |?....C....$.....|
000029e0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000029f0  00 00 03 00 00 00 40 12  00 00 00 00 00 00 44 00  |......@.......D.|
00002a00  61 00 74 00 61 00 53 00  74 00 6f 00 72 00 65 00  |a.t.a.S.t.o.r.e.|
00002a10  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00003930  00 00 00 00 00 00 00 00  00 00 00 00 00 00 43 48  |..............CH|
00003940  4e 4b 49 4e 4b 20 04 00  07 00 0c 00 00 03 00 02  |NKINK ..........|
00003950  00 00 00 0a 00 00 f8 01  0c 00 ff ff ff ff 18 00  |................|
00003960  54 45 58 54 00 00 01 00  00 00 54 45 58 54 00 02  |TEXT......TEXT..|
00003970  00 00 22 00 00 00 18 00  46 44 50 50 00 00 43 00  |..".....FDPP..C.|
00003980  4f 00 4e 00 54 00 45 00  4e 00 54 00 53 00 00 00  |O.N.T.E.N.T.S...|
00003990  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
000039f0  00 00 1f 00 00 00 00 0a  00 00 00 00 00 00 01 00  |................|
00003a00  43 00 6f 00 6d 00 70 00  4f 00 62 00 6a 00 00 00  |C.o.m.p.O.b.j...|
00003a10  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00004530  00 00 00 00 00 00 00 00  00 00 00 00 00 00 01 00  |................|
00004540  fe ff 03 0a 00 00 ff ff  ff ff 00 00 00 00 00 00  |................|
00004550  00 00 00 00 00 00 00 00  00 00 1a 00 00 00 51 75  |..............Qu|
00004560  69 6c 6c 39 36 20 53 74  6f 72 79 20 47 72 6f 75  |ill96 Story Grou|
00004570  70 20 43 6c 61 73 73 00  ff ff ff ff 01 00 00 00  |p Class.........|
00004580  00 00 00 00 f4 39 b2 71  00 00 00 00 00 00 00 00  |.....9.q........|
00004590  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00006570  00 00 00 00 00 00 00 00  00 00 00 00 00 00 ba 84  |................|
00006580  43 51 00 00 00 18 69 54  58 74 54 69 74 6c 65 00  |CQ....iTXtTitle.|
00006590  00 00 00 00 50 69 63 74  75 72 65 49 74 37 2d 73  |....PictureIt7-s|
000065a0  30 32 3a 70 9c 00 00 00  00 14 74 45 58 74 54 69  |02:p......tEXtTi|
000065b0  74 6c 65 00 50 69 63 74  75 72 65 49 74 37 2d 73  |tle.PictureIt7-s|
000065c0  30 32 f2 8f d5 89 00 00  00 00 49 45 4e 44 ae 42  |02........IEND.B|
000065d0  60 82                                             |`.|

What what do we have here? Near the end of the file before the IEND chunk is an OLE file with the very recognizable hex values of “D0CF11E0“. Let’s strip out the OLE file and take a look.

Path = PictureIt7-s02-ole
Type = Compound
WARNINGS:
There are data after the end of archive
Physical Size = 8704
Tail Size = 7764
Extension = compound
Cluster Size = 512
Sector Size = 64

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2023-12-26 22:01:58 D....                            DataStore
2023-12-26 22:01:58 D....                            Text
                    .....         2560         2560  Text/CONTENTS
                    .....           86          128  Text/[1]CompObj
                    .....           96          128  DataStore/3
                    .....            4           64  DataStore/1
                    .....          121          128  DataStore/0
                    .....           57           64  DataStore/2
                    .....           98          128  DataStore/5
                    .....            4           64  DataStore/4
                    .....         1254         1280  DataStore/7
                    .....            4           64  DataStore/6
                    .....            4           64  DataStore/8
------------------- ----- ------------ ------------  ------------------------
2023-12-26 22:01:58               4288         4672  11 files, 2 folders

Interesting, I don’t think I have come across a standard format with a container embedded within. I have come across many OLE and ZIP containers which contain other common formats within, but this format is definitely unique. Others have added features in the IDAT chunk, such as a web shell. I am sure there are others out there. The CompObj file found within the Text directory is very similar to the Microsoft Works and Publisher format. Although trying to open the file in Publisher doesn’t work!

hexdump -C PictureIt7-s02-ole/Text/\[1\]CompObj | head
00000000  01 00 fe ff 03 0a 00 00  ff ff ff ff 00 00 00 00  |................|
00000010  00 00 00 00 00 00 00 00  00 00 00 00 1a 00 00 00  |................|
00000020  51 75 69 6c 6c 39 36 20  53 74 6f 72 79 20 47 72  |Quill96 Story Gr|
00000030  6f 75 70 20 43 6c 61 73  73 00 ff ff ff ff 01 00  |oup Class.......|
00000040  00 00 00 00 00 00 f4 39  b2 71 00 00 00 00 00 00  |.......9.q......|
00000050  00 00 00 00 00 00                                 |......|

PRONOM uses binary and container signatures to identify file formats. Even though this file format contains a valid OLE container, because it is within a regular binary file format, I don’t believe a container signature would work. The difficulty will be to clearly identify this new format without falsely identifying a regular PNG instead. The OLE file format header is not in a consistent location to use a specific offset. Making the string a variable location can causes some undo processing, so lets look to see if there is anything else we can use to make a positive ID.

The PNG file format is based on chunks, you have to have IHDR, then an IDAT and the IEND chunk. If we take a look at a regular PNG file using a libpng tool pngcheck, we see this:

pngcheck -cvt rgb-8.png 
File: rgb-8.png (759 bytes)
  chunk IHDR at offset 0x0000c, length 13
    256 x 256 image, 24-bit RGB, non-interlaced
  chunk tEXt at offset 0x00025, length 44, keyword: Copyright
    ? 2013,2015 John Cunningham Bowler
  chunk iTXt at offset 0x0005d, length 116, keyword: Licensing
    compressed, language tag = en
    no translated keyword, 101 bytes of UTF-8 text
  chunk IDAT at offset 0x000dd, length 518
    zlib: deflated, 32K window, maximum compression
  chunk IEND at offset 0x002ef, length 0
No errors detected in rgb-8.png (5 chunks, 99.6% compression).

The required chunk are there, but a couple extra, the tEXt and iTXt, which are textual metadata you can add. Now lets look at a PNG Plus file:

pngcheck -cvt PictureIt7-s02.png         
File: PictureIt7-s02.png (26066 bytes)
  chunk IHDR at offset 0x0000c, length 13
    500 x 333 image, 32-bit RGB+alpha, non-interlaced
  chunk sRGB at offset 0x00025, length 1
    rendering intent = perceptual
  chunk gAMA at offset 0x00032, length 4: 0.45455
  chunk cHRM at offset 0x00042, length 32
    White x = 0.3127 y = 0.329,  Red x = 0.64 y = 0.33
    Green x = 0.3 y = 0.6,  Blue x = 0.15 y = 0.06
  chunk IDAT at offset 0x0006e, length 9460
    zlib: deflated, 32K window, fast compression
  chunk cmOD at offset 0x0256e, length 0
    Microsoft Picture It private, ancillary, unsafe-to-copy chunk
  chunk cpIp at offset 0x0257a, length 16384
    Microsoft Picture It private, ancillary, safe-to-copy chunk
  chunk iTXt at offset 0x06586, length 24, keyword: Title
    uncompressed, no language tag
    no translated keyword, 15 bytes of UTF-8 text
  chunk tEXt at offset 0x065aa, length 20, keyword: Title
    PictureIt7-s02
  chunk IEND at offset 0x065ca, length 0
No errors detected in PictureIt7-s02.png (10 chunks, 96.1% compression).

It looks like we have the required chunks and some textual chunks but also a couple chunks which pngcheck describes as private and identify’s them as Microsoft Picture It chunks. The cpIp chunk is the one which contains the OLE container. This is the chunk we need to identify in a signature. The problem is the offset for the cpIp chunk is not the same each time. Here is one from Digital Image 10 Pro.

  chunk cpIp at offset 0x737a7, length 245760
    Microsoft Picture It private, ancillary, safe-to-copy chunk

Significantly further in the file that the other example. These samples currently identify as PNG 1.2 files. PRONOM fmt/13 so we can use the signature and add to it, but it currently doesn’t look for IDAT only the iTXt chunk, which is probably not optimal. For PNG Plus, lets get the header which includes IHDR, IDAT, then the cpIp chunk then an end of file sequence for IEND. Take a look at my signature and samples, I am curious how many PNG Plus files are out there hidden to the world.

Turns out there is another PNG flavor which has been enhanced to allow for layers and pages. Adobe Fireworks uses a PNG format as their native format. They also use private chunks, but not within an OLE container. They use additional chunks, but before the IDAT chunk:

  chunk prVW at offset 0x00092, length 1700
    Macromedia Fireworks preview chunk (private, ancillary, unsafe to copy)
  chunk mkBF at offset 0x00742, length 72
    Macromedia Fireworks private, ancillary, unsafe-to-copy chunk
  chunk mkTS at offset 0x00796, length 36716
    Macromedia Fireworks(?) private, ancillary, unsafe-to-copy chunk
  chunk mkBS at offset 0x0970e, length 190
    Macromedia Fireworks private, ancillary, unsafe-to-copy chunk
  chunk mkBT at offset 0x097d8, length 1251
    Macromedia Fireworks private, ancillary, unsafe-to-copy chunk
  chunk mkBT at offset 0x09cc7, length 1358
    Macromedia Fireworks private, ancillary, unsafe-to-copy chunk
  chunk mkBT at offset 0x0a221, length 1145
    Macromedia Fireworks private, ancillary, unsafe-to-copy chunk
  chunk mkBT at offset 0x0a6a6, length 339
    Macromedia Fireworks private, ancillary, unsafe-to-copy chunk
  chunk mkBT at offset 0x0a805, length 695
    Macromedia Fireworks private, ancillary, unsafe-to-copy chunk
  chunk mkBT at offset 0x0aac8, length 3799
    Macromedia Fireworks private, ancillary, unsafe-to-copy chunk
  chunk mkBT at offset 0x0b9ab, length 7733
    Macromedia Fireworks private, ancillary, unsafe-to-copy chunk
  chunk mkBT at offset 0x0d7ec, length 2741
    Macromedia Fireworks private, ancillary, unsafe-to-copy chunk
  chunk mkBT at offset 0x0e2ad, length 5153
    Macromedia Fireworks private, ancillary, unsafe-to-copy chunk
  chunk mkBT at offset 0x0f6da, length 10775
    Macromedia Fireworks private, ancillary, unsafe-to-copy chunk

It’s hard to know which each of the chunks are for and if they are all required for the Fireworks PNG format. From the book on PNG.

In addition to supporting PNG as an output format, Fireworks actually uses PNG as its native file format for day-to-day intermediate saves. This is possible thanks to PNG’s extensible “chunk-based” design, which allows programs to incorporate application-specific data in a well-defined way. Macromedia has embraced this capability, defining at least four custom chunk types that hold various things pertinent to the editor. Unfortunately, one of them (pRVW) violates the PNG naming rules by claiming to be an officially registered, public chunk type, but this was an oversight and should be fixed in version 2.0.

Picture It!

December 29, 2023 by Thor 1 Comment

Most everyone has heard of Microsoft Office, the suite of applications used by millions everyday. Less people know about Microsoft Works, which was a lower cost alternative, but was quite popular as a home office suite of applications. One tool which often came with the Works suite was a digital image tool called Picture It!

Picture It! was a photo editing tool first released by Microsoft in 1996 geared to making photo editing easy and affordable.

Picture It! used a wizard type interface which walked you through acquiring an image and adding to it. One of the key features of the software was the ability to “stack” objects like layers. Because of this feature a new file format was used to save this information to disk. Meet the Microsoft Image (Picture) Extension format, commonly known as the MIX file format. It is very similar to the FlashPix image format, which was supposed to be an image file format to solve many delivery issues, but didn’t seem to gain hold despite being created by Kodak, HP, and others. In fact many of the MIX files I found on Microsoft disks are actually FlashPix files.

The MIX extension was also used by another Microsoft program, PhotoDraw, which causes confusion as they were similar, but PhotoDraw has some added features which may not be compatible with Picture It!. Both formats are based on the Microsoft Compound Object (OLE) container, and have a similar structure. Let’s take a look at a MIX file from Picture It! version 1.

7z l PictureIt1-s02.mix                 

--
Path = PictureIt1-s02.mix
Type = Compound
Physical Size = 48128
Extension = compound
Cluster Size = 512
Sector Size = 64

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
                    .....          328          384  [5]Data Object 000001
                    .....          396          448  [5]Transform 000004
                    .....          872          896  [5]Operation 000001
                    .....          320          320  [1]CompObj
                    .....          292          320  [5]Global Info
                    .....          872          896  [5]Operation 000002
                    .....          144          192  [5]Operation 000003
                    .....          684          704  [5]Transform 000008
                    .....         1028         1088  [5]Transform 000009
                    .....          328          384  [5]Data Object 000009
                    .....          324          384  [5]Data Object 000005
2023-12-27 11:04:39 D....                            Data Object Store 000001
                    .....          328          384  [5]Data Object 000010
                    .....        20932        20992  [5]SummaryInformation
                    .....          200          256  [5]Microsoft Embedding Info
2023-12-27 11:04:39 D....                            Data Object Store 000001/Resolution 0001
                    .....         1400         1408  Data Object Store 000001/[5]Image Contents
                    .....          230          256  Data Object Store 000001/[1]CompObj
2023-12-27 11:04:39 D....                            Data Object Store 000001/Resolution 0000
                    .....           28           64  Data Object Store 000001/Resolution 0000/Subimage 0000 Data
                    .....           80          128  Data Object Store 000001/Resolution 0000/Subimage 0000 Header
2023-12-27 11:04:39 D....                            Data Object Store 000001/Resolution 0003
2023-12-27 11:04:39 D....                            Data Object Store 000001/Resolution 0002
                    .....           28           64  Data Object Store 000001/Resolution 0002/Subimage 0000 Data
                    .....          208          256  Data Object Store 000001/Resolution 0002/Subimage 0000 Header
2023-12-27 11:04:39 D....                            Data Object Store 000001/Resolution 0005
2023-12-27 11:04:39 D....                            Data Object Store 000001/Resolution 0004
                    .....           28           64  Data Object Store 000001/Resolution 0004/Subimage 0000 Data
                    .....         1792         1792  Data Object Store 000001/Resolution 0004/Subimage 0000 Header
                    .....          124          128  Data Object Store 000001/[5]SummaryInformation
                    .....           28           64  Data Object Store 000001/Resolution 0005/Subimage 0000 Data
                    .....         6976         7168  Data Object Store 000001/Resolution 0005/Subimage 0000 Header
                    .....           28           64  Data Object Store 000001/Resolution 0003/Subimage 0000 Data
                    .....          544          576  Data Object Store 000001/Resolution 0003/Subimage 0000 Header
                    .....           28           64  Data Object Store 000001/Resolution 0001/Subimage 0000 Data
                    .....          128          128  Data Object Store 000001/Resolution 0001/Subimage 0000 Header
------------------- ----- ------------ ------------  ------------------------
2023-12-27 11:04:39              38698        39872  29 files, 7 folders

This is a simple MIX file with one line of text, but contains a lot of content inside the OLE container. If I try and use the PRONOM registry to identify the file, I get:

sf PictureIt1-s02.mix 
---
siegfried   : 1.11.0
scandate    : 2023-12-27T11:06:32-07:00
signature   : default.sig
created     : 2023-12-17T15:54:41+01:00
identifiers : 
  - name    : 'pronom'
    details : 'DROID_SignatureFile_V116.xml; container-signature-20231127.xml'
---
filename : 'PictureIt1-s02.mix'
filesize : 48128
modified : 2023-12-27T11:04:40-07:00
errors   : 
matches  :
  - ns      : 'pronom'
    id      : 'fmt/111'
    format  : 'OLE2 Compound Document Format'
    version : 
    mime    : 
    class   : 'Text (Structured)'
    basis   : 'byte match at 0, 30'
    warning :

Hmm, we know it is an OLE compound document, but it should identify as a Picture It! file as PRONOM has defined a PUID for the format. fmt/936 has been defined as “Microsoft Picture It! Image File 1”. So I am not sure why this file from version 1 is not identifying correctly. Let’s take a look. The PRONOM container signature for fmt/936 is looking for this:

    <ContainerSignature Id="17015" ContainerType="OLE2">
      <Description>Microsoft Picture It! Image File</Description>
      <Files>
        <File>
          <Path>CompObj</Path>
          <BinarySignatures>
            <InternalSignatureCollection>
              <InternalSignature ID="17015">
                <ByteSequence Reference="BOFoffset">
                  <SubSequence Position="1" SubSeqMinOffset="32"
                               SubSeqMaxOffset="32">
                    <Sequence>'Microsoft Picture It! version 1 Picture'</Sequence>
                  </SubSequence>
                </ByteSequence>
              </InternalSignature>
            </InternalSignatureCollection>
          </BinarySignatures>
        </File>
      </Files>
    </ContainerSignature>

The container signature is looking into the OLE container for the “CompObj” file (which seems to be required), then looks for the string “Microsoft Picture It! version 1 Picture” starting at the 32nd byte. That is pretty specific. The sample file I am using as an example has the following string of bytes.

hexdump -C PictureIt1-s02/\[1\]CompObj 
00000000  01 00 fe ff 03 0a 00 00  ff ff ff ff 00 68 61 56  |.............haV|
00000010  54 c1 ce 11 85 53 00 aa  00 a1 f9 5b 1e 00 00 00  |T....S.....[....|
00000020  4d 69 63 72 6f 73 6f 66  74 20 50 69 63 74 75 72  |Microsoft Pictur|
00000030  65 20 49 74 21 20 50 69  63 74 75 72 65 00 27 00  |e It! Picture.'.|
00000040  00 00 7b 35 36 36 31 36  38 30 30 2d 43 31 35 34  |..{56616800-C154|
00000050  2d 31 31 43 45 2d 38 35  35 33 2d 30 30 41 41 30  |-11CE-8553-00AA0|
00000060  30 41 31 46 39 35 42 7d  00 13 00 00 00 50 69 63  |0A1F95B}.....Pic|
00000070  74 75 72 65 49 74 21 2e  50 69 63 74 75 72 65 00  |tureIt!.Picture.|

Ok, so this sample has a similar string but is missing the “version 1” text. It seems the samples used to created the PRONOM signature was working off samples which included the version 1 in the header of CompObj. Maybe when Microsoft learned they would be making a version 2, they decided a version number should be included going forward. Let’s take a look a file from version 2 to compare:

hexdump -C PictureIt2-s01/\[1\]CompObj 
00000000  01 00 fe ff 03 0a 00 00  ff ff ff ff 50 28 72 2d  |............P(r-|
00000010  4b 8c d0 11 a9 6f 00 a0  c9 05 41 0d 28 00 00 00  |K....o....A.(...|
00000020  4d 69 63 72 6f 73 6f 66  74 20 50 69 63 74 75 72  |Microsoft Pictur|
00000030  65 20 49 74 21 20 76 65  72 73 69 6f 6e 20 32 20  |e It! version 2 |
00000040  50 69 63 74 75 72 65 00  27 00 00 00 7b 32 44 37  |Picture.'...{2D7|
00000050  32 32 38 35 30 2d 38 43  34 42 2d 31 31 44 30 2d  |22850-8C4B-11D0-|
00000060  41 39 36 46 2d 30 30 41  30 43 39 30 35 34 31 30  |A96F-00A0C905410|
00000070  44 7d 00 f4 39 b2 71 50  00 00 00 4d 00 69 00 63  |D}..9.qP...M.i.c|

Ok, so it looks like they did update the version string for version 2. This file also does not identify correctly. A quick look at the wikipedia page for Microsoft Picture It! tells us they continued to release the software until version 10. Is there a different string for each version?

Diving into this and gathering many samples has brought a lot of variants to surface. Let’s see if we can list all the CompObj header variants.

Version 1 samples:
Picture It! Picture'{56616800-C154-11CE-8553-00AA00A1F95B}
Microsoft Picture It! Picture'{56616800-C154-11CE-8553-00AA00A1F95B}
Microsoft Picture It! version 1 Picture'{56616800-C154-11CE-8553-00AA00A1F95B}
Picture It! Collage'{56616800-C154-11CE-8553-00AA00A1F95B}

Version 2 samples:
Microsoft Picture It! version 2 Picture'{2D722850-8C4B-11D0-A96F-00A0C905410D}

Version 3 samples:
Microsoft Picture It! version 3 Picture'{18B8D020-B4FD-11D0-A97E-00A0C905410D}

Version 4 samples:
Microsoft Picture It! version 4 Picture'{18B8D020-B4FD-11D0-A97E-00A0C905410D}

PhotoDraw version 1 samples:
Microsoft PhotoDraw version 1 Picture'{18B8D020-B4FD-11D0-A97E-00A0C905410D}

PhotoDraw version 2 samples:
Microsoft PhotoDraw version 2 Picture'{18B8D021-B4FD-11D0-A97E-00A0C905410D}

FlashPix samples:
FlashPix Object({56616000-C154-11CE-8553-00AA00A1F95B}
FlashPix Object({56616800-C154-11CE-8553-00AA00A1F95B}
Picture It! FlashPix'{56616700-C154-11CE-8553-00AA00A1F95B}
LPI FlashPix'{56616700-c154-11ce-8553-00aa00a1f95b}
FlashPix_Object'{56616700-C154-11CE-8553-00AA00A1F95B}
'{56616700-C154-11CE-8553-00AA00A1F95B}
Picture It!'{56616700-c154-11ce-8553-00aa00a1f95b}
Flashpix Toolkit Application'{56616700-c154-11ce-0000-000000000000}

Ok, there is a lot to discuss here. First of all, it seems MIX was only used in Picture It! until version 5 (2001), then the Picture It! software used a new format, PNG Plus to store the layered stacks. More on that in a future post! Although some later versions seems to be able to open the older MIX format. Version 4 of the MIX format seems to be the last as the 2001 software had only version 4 files on it. Probably safe to say only the 4 versions are needed for identification.

You may notice the additional unique identifier I included in each format. This is called a Class ID for the OLE format, which A LOT of formats use. Each “format” has a unique ID associated with it to help distinguish it from other formats. This Unique ID could possibly be a better solution for identification. It does cross over with the PhotoDraw format, but the FlashPix format seems to have a unique ID. With all the variations in the version 1 strings, the ID remains the same. For version 3 and 4 the ID is the same, which could mean they are interchangeable. It is also the same as PhotoDraw version 1. Not to complicate things.

So it seems in order to get proper identification of these similar formats we need to:

Clean up version 1 identification for fmt/936
Add a signature for 2, 3, and 4
Add a version 2 signature for the PhotoDraw format
Add some additional signature variations for the FlashPix format.

The Class ID’s could be used to distinguish different versions and formats, but many of the ID’s are identical, this could mean they are the same format. But for now we can just add the additional variation strings and it should identify everything for now. The FlashPix format needs more research as there is so many different variations and it’s so close to the MIX format. Take a look at my GitHub submission, maybe you have some additional variations to add?

Adobe Acrobat Capture

November 17, 2023 by Thor 1 Comment

During the recent PRONOM Research Week, I noticed a file format with no description and no signature.

x-fmt/217

Ado be ACD

All I had to go on was it was an Adobe format and the acronym “ACD”. One of the first results that came up in a google search was a post in the Adobe forums with someone asking what to do with some old ACD and ACI files they found on a disc, circa 2000, labeled “Adobe Capture”. The only thing I remember about Adobe Capture was some scanning tools related to Adobe Acrobat, but I didn’t remember coming across any ACD files related to Acrobat.

Initially it wasn’t easy to find more information on this format. Eventually I was able to narrow it down to stand-alone software adobe released called “Adobe Acrobat Capture”. Originally released in 1995 it was eventually discontinued in 2010. The software was marketed under the ePaper name and connected to Acrobat through the creation of a PDF from scanned images. The software was compatible with many scanner models and would process the scanned images, run Optical Character recognition, and export to a searchable PDF. These tools are built into Adobe Acrobat today.

One of the reasons the software was being so elusive is the fact it was sold with a high price tag and required the use of a hardware key, or dongle, in order to process scans. The hardware key also managed the type of license you purchased which may limit the number of pages you are allowed to scan within a certain period of time. So the software is very difficult to run today, if you do happen to find a copy out there in Internet land.

In order to document these file formats for preservation purposes I needed to find some samples. I was excited to find a demonstration CD on the Internet Archive, but unfortunately it contained no examples of the ACD file format.

A little sleuthing on the Wayback Machine helped me find a few user guides and brochures. I was also able to find there was three versions of Adobe Acrobat Capture. In a Product Brochure, you can see a screenshot of the software with a document open with the ACD extension.

If you are OCD like me you might have noticed the window in this screenshot is typical of the older Windows 3.1 or Windows NT system. So this was indeed an older product released by Adobe.

The Adobe Acrobat Capture 3.0 Demonstration CD-ROM from the Internet Archive luckily has a UserGuide PDF on the disc and was able to help me understand the ACD format a little more.

Looks like the ACD format is an intermediate format used by the software to manage the process between scanning and export to PDF. ACD was also defined as an “Acrobat Capture Document” which makes sense. They were also mentioned as being “multipage files in Acrobat Capture Document (ACD)”. The UserGuide also mentioned an ACP format which it referenced as “one-page files are in Acrobat Capture Page (ACP) format.” So more research is needed.

Lets start with Adobe Acrobat Capture 2.0 as I managed to get a few samples from an installer I found. Here is a hexdump of an ACD file and its corresponding ACI file.

hexdump -C CONTRACT.ACD | head
00000000  02 04 47 47 c9 00 86 b5  01 00 b6 27 02 00 01 00  |..GG.......'....|
00000010  f5 00 5e 00 3b 96 02 00  01 6e 63 6a 00 00 88 68  |..^.;....ncj...h|
00000020  00 00 26 00 44 3a 5c 43  4f 44 45 5c 47 47 5c 50  |..&.D:\CODE\GG\P|
00000030  52 4f 44 55 43 54 2e 33  32 53 5c 49 4e 5c 63 6f  |RODUCT.32S\IN\co|
00000040  6e 74 72 61 63 74 2e 61  63 69 00 00 00 00 00 00  |ntract.aci......|
00000050  7c 33 c0 27 00 40 ff ff  ff 00 03 00 03 00 00 00  ||3.'.@..........|
00000060  00 00 00 00 00 00 40 00  00 00 00 00 00 03 00 00  |......@.........|
00000070  00 00 00 00 00 00 00 40  00 00 00 00 09 00 0a ab  |.......@........|
00000080  04 0b 14 b5 04 39 19 00  40 00 00 00 00 0c 14 b0  |.....9..@.......|
00000090  04 38 19 b0 04 08 00 0a  7f 06 d3 11 89 06 39 17  |.8............9.|

hexdump -C CONTRACT.ACI | head
00000000  49 49 2a 00 b3 0c 02 00  35 80 78 a0 80 35 c0 78  |II*.....5.x..5.x|
00000010  a4 80 35 40 3c 54 40 01  e2 b2 01 e2 b2 01 e2 b2  |..5@<T@.........|
00000020  01 e2 b2 01 e2 b2 01 e2  b2 01 e2 b2 01 e2 b2 01  |................|
00000030  e2 b2 01 e2 b2 01 e2 b2  01 e2 b2 01 e2 b2 01 e2  |................|
00000040  b2 01 e2 b2 01 e2 b2 01  e2 b2 01 e2 b2 01 e2 b2  |................|
00000050  01 e2 b2 01 e2 b2 01 e2  b2 01 e2 b2 01 e2 b2 01  |................|
00000060  e2 b2 01 e2 b2 01 e2 b2  01 e2 b2 01 e2 b2 01 e2  |................|
00000070  b2 01 e2 b2 01 e2 b2 01  e2 b2 01 e2 b2 01 e2 b2  |................|
00000080  01 e2 b2 01 e0 b0 01 e0  b0 01 e0 b0 01 e0 b0 01  |................|
00000090  e0 b0 01 e0 b0 01 e0 b0  01 e0 b0 01 e0 b0 01 e0  |................|

The ACD file is unique, PRONOM and even TrID was unaware of the format. But to the keen observer, the ACI format is very recognizable. You may have seen this header before:

Lets take a closer look at an ACI file to see if they are a true TIFF image or if there is any customization to the format.

tiffinfo CONTRACT.ACI 
=== TIFF directory 0 ===
TIFF Directory at offset 0x20cb3 (134323)
  Subfile Type: (0 = 0x0)
  Image Width: 2544 Image Length: 3295
  Resolution: 300, 300
  Bits/Sample: 1
  Compression Scheme: CCITT RLE
  Photometric Interpretation: min-is-white
  Samples/Pixel: 1
  Rows/Strip: 32
  Planar Configuration: single image plane
  Software: HALO Desktop Imager

exiftool -D CONTRACT.ACI 
    - ExifTool Version Number         : 12.60
    - File Name                       : CONTRACT.ACI
    - Directory                       : TUTORIAL/SAMPOUT
    - File Size                       : 134 kB
    - File Modification Date/Time     : 1995:07:10 16:02:08-06:00
    - File Access Date/Time           : 2023:11:14 15:41:02-07:00
    - File Inode Change Date/Time     : 2023:11:08 08:34:18-07:00
    - File Permissions                : -rwxrwxrwx
    - File Type                       : TIFF
    - File Type Extension             : tif
    - MIME Type                       : image/tiff
    - Exif Byte Order                 : Little-endian (Intel, II)
  254 Subfile Type                    : Full-resolution image
  256 Image Width                     : 2544
  257 Image Height                    : 3295
  258 Bits Per Sample                 : 1
  259 Compression                     : CCITT 1D
  262 Photometric Interpretation      : WhiteIsZero
  273 Strip Offsets                   : (Binary data 625 bytes, use -b option to extract)
  277 Samples Per Pixel               : 1
  278 Rows Per Strip                  : 32
  279 Strip Byte Counts               : (Binary data 448 bytes, use -b option to extract)
  282 X Resolution                    : 300
  283 Y Resolution                    : 300
  305 Software                        : HALO Desktop Imager
    - Image Size                      : 2544x3295
    - Megapixels                      : 8.4

Looks like a true TIFF image with no special tags or unique properties. They are 1-bit TIFF’s compressed with CCITT RLE. Not sure there would be any need to create a special signature for these ACI files.

Looking closer at the ACD file format, we can see they reference ACI files, so probably safe to assume the ACD file doesn’t contain the full raster data for each image:

hexdump -C Report.acd
00000000  02 04 47 47 c9 00 9a 8b  00 00 d4 ce 00 00 03 00  |..GG............|
00000010  f5 02 5f 00 00 61 01 00  01 6e 63 6a 01 00 30 5f  |.._..a...ncj..0_|
00000020  00 00 27 00 63 3a 5c 63  61 70 74 75 72 65 32 5c  |..'.c:\capture2\|
00000030  73 61 6d 70 6c 65 73 5c  6f 75 74 5c 52 65 70 6f  |samples\out\Repo|
00000040  72 74 5f 30 30 30 31 2e  61 63 69 00 00 01 00 00  |rt_0001.aci.....|
00000050  00 00 00 00 00 00 00 00  00 00 e8 03 00 00 01 00  |................|
00000060  01 00 00 00 00 00 00 00  00 00 08 00 52 65 70 6f  |............Repo|
00000070  72 74 30 31 00 00 00 00  70 33 d8 27 00 40 ff ff  |rt01....p3.'.@..|
*
00005f40  07 00 40 6f 00 09 00 40  01 6e 63 6a 02 00 52 2c  |..@o...@.ncj..R,|
00005f50  00 00 27 00 63 3a 5c 63  61 70 74 75 72 65 32 5c  |..'.c:\capture2\|
00005f60  73 61 6d 70 6c 65 73 5c  6f 75 74 5c 52 65 70 6f  |samples\out\Repo|
00005f70  72 74 5f 30 30 30 32 2e  61 63 69 00 00 00 00 00  |rt_0002.aci.....|
00005f80  00 00 00 00 4e 0c fe ff  ff ff e8 03 00 00 01 00  |....N...........|
00005f90  01 00 00 00 00 00 00 00  00 00 08 00 52 65 70 6f  |............Repo|
00005fa0  72 74 30 32 00 00 00 00  4c 31 f0 27 00 40 ff ff  |rt02....L1.'.@..|

From the limited sample set I have access, all the ACD files begin with the same Hex values, “02044747C900”. Along with the common header we can assume there should be at least one ACI file referenced in the first part of the file. Because it is referenced as a filepath, the ACI string would be variable in its offset.

Adobe Acrobat Capture 3.0 turns out to be a different format. But looks familiar………

hexdump -C Contract.acd | head
00000000  50 4b 03 04 14 00 00 00  08 00 3b ba 6e 57 23 9d  |PK........;.nW#.|
00000010  8e b8 3d 00 00 00 3e 00  00 00 09 00 40 00 46 49  |..=...>.....@.FI|
00000020  4c 45 53 2e 4c 53 54 0a  00 20 00 00 00 00 00 00  |LES.LST.. ......|
00000030  00 00 00 80 e6 e9 ca 50  17 da 01 80 e6 e9 ca 50  |.......P.......P|
00000040  17 da 01 80 e6 e9 ca 50  17 da 01 4e 55 18 00 4e  |.......P...NU..N|
00000050  55 43 58 09 00 46 00 49  00 4c 00 45 00 53 00 2e  |UCX..F.I.L.E.S..|
00000060  00 4c 00 53 00 54 00 8b  76 74 76 31 8c e5 e5 f2  |.L.S.T..vtv1....|
00000070  0c 76 f6 f7 0d f0 0f f6  0c 71 b5 0d 09 0a 75 e5  |.v.......q....u.|
00000080  e5 f2 0b f5 75 f3 f4 71  0d b6 35 e4 e5 02 31 fc  |....u..q..5...1.|
00000090  1c 7d 5d 0d 6d 9d f3 f3  4a 8a 12 93 4b f4 12 93  |.}].m...J...K...|

sf Contract.acd 
---
siegfried   : 1.10.1
scandate    : 2023-11-15T09:10:01-07:00
signature   : default.sig
created     : 2023-10-11T15:10:17-06:00
identifiers : 
  - name    : 'pronom'
    details : 'DROID_SignatureFile_V114.xml; container-signature-20230822.xml'
---
filename : 'Contract.acd'
filesize : 79002
modified : 2023-11-14T23:17:53-07:00
errors   : 
matches  :
  - ns      : 'pronom'
    id      : 'x-fmt/263'
    format  : 'ZIP Format'
    version : 
    mime    : 'application/zip'
    basis   : 'byte match at [[0 4] [78886 3] [78980 4]]'
    warning : 'extension mismatch'

Yep, its a zip container file. lets take a peek inside to see what it is composed of.

7z l Contract.acd 
--
Path = Contract.acd
Type = zip
Physical Size = 79002

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2023-11-14 23:17:54 ....A           62           61  FILES.LST
2023-11-14 23:17:54 ....A          410          226  Contract.acd
2023-11-14 23:17:52 ....A       150213        78093  Contract.acp
------------------- ----- ------------ ------------  ------------------------
2023-11-14 23:17:54             150685        78380  3 files

The the Contract ACD file is like a nesting doll, an ACD within an ACD. Lets see what the ACD and ACP is made of.

hexdump -C Contract.acd | head
00000000  00 01 00 00 00 02 04 47  47 2d 01 9a 01 00 00 02  |.......GG-......|
00000010  00 00 00 02 00 01 01 00  00 00 01 00 00 00 04 04  |................|
00000020  00 00 00 09 00 57 69 6e  67 64 69 6e 67 73 05 00  |.....Wingdings..|
00000030  41 72 69 61 6c 0b 00 43  6f 75 72 69 65 72 20 4e  |Arial..Courier N|
00000040  65 77 0f 00 54 69 6d 65  73 20 4e 65 77 20 52 6f  |ew..Times New Ro|
00000050  6d 61 6e 05 01 00 00 00  02 00 00 00 78 01 00 00  |man.........x...|
00000060  0f 00 54 69 6d 65 73 20  4e 65 77 20 52 6f 6d 61  |..Times New Roma|
00000070  6e 00 00 00 20 0b 00 00  c0 0a 00 00 00 00 00 00  |n... ...........|
00000080  00 06 00 00 00 0f 00 54  69 6d 65 73 20 4e 65 77  |.......Times New|
00000090  20 52 6f 6d 61 6e 00 00  00 20 0c 00 00 00 0c 00  | Roman... ......|

hexdump -C Contract.acp | head
00000000  25 50 44 46 2d 31 2e 33  0d 25 e2 e3 cf d3 0d 0a  |%PDF-1.3.%......|
00000010  31 20 30 20 6f 62 6a 0d  3c 3c 20 0d 2f 54 79 70  |1 0 obj.<< ./Typ|
00000020  65 20 2f 43 61 74 61 6c  6f 67 20 0d 2f 50 61 67  |e /Catalog ./Pag|
00000030  65 73 20 32 20 30 20 52  20 0d 2f 53 74 72 75 63  |es 2 0 R ./Struc|
00000040  74 54 72 65 65 52 6f 6f  74 20 34 20 30 20 52 20  |tTreeRoot 4 0 R |
00000050  0d 2f 43 41 50 54 5f 49  6e 66 6f 20 3c 3c 20 2f  |./CAPT_Info << /|
00000060  56 20 33 30 31 20 2f 46  53 20 5b 20 28 57 69 6e  |V 301 /FS [ (Win|
00000070  67 64 69 6e 67 73 29 28  41 72 69 61 6c 29 28 43  |gdings)(Arial)(C|
00000080  6f 75 72 69 65 72 20 4e  65 77 29 28 54 69 6d 65  |ourier New)(Time|
00000090  73 20 4e 65 77 20 52 6f  6d 61 6e 29 5d 20 2f 4c  |s New Roman)] /L|

The ACD has some of the same hex values as the previous version, but with some extra bytes at the beginning and it looks like the ACP is a straight up PDF. But may have some interesting tags, like “CAPT_info”.

The problem we will face when trying to write a signature for this version of ACD is the container signature needs a static file name to reference, and it appears the name of the container is also the name of the ACD file within the container. So every file will be different. I wish there was a way in the PRONOM signature syntax to reference an extension and ignore the filename, but currently there no method to do this. The only thing inside the container which seems to be consistent is the file “FILES.LST”. So lets take a peek inside if it.

hexdump -C FILES.LST | head
00000000  5b 41 43 44 31 5d 0d 0a  49 53 43 4f 4d 50 4f 53  |[ACD1]..ISCOMPOS|
00000010  49 54 45 3d 54 52 55 45  0d 0a 4e 55 4d 46 49 4c  |ITE=TRUE..NUMFIL|
00000020  45 53 3d 31 0d 0a 46 49  4c 45 4e 41 4d 45 31 3d  |ES=1..FILENAME1=|
00000030  43 6f 6e 74 72 61 63 74  2e 61 63 70 0d 0a        |Contract.acp..|

Ok, there seems to be some static information that is unique to the ACD format. I bet the string “[ACD1]” would be sufficient enough to make a solid signature.

This is a good format example of a limited amount of information on the file format used by a well known company which has become obsolete and disappeared. Take a look at my signatures, maybe you have some old ACD files you were unaware of!

Composite File Management System

November 3, 2023 by Thor 1 Comment

In honor of World Digital Preservation Day, I wanted to write a little about format headers, the magic that makes some files more easily identifiable than others.

When it comes to binary file formats, some developers decide to make the format clearly identifiable in a header and others choose to make it ambiguous. Others have a little fun with leaving little clues and references to popular culture.

A couple of my favorites based on their header.

Early CoolEdit / Audition files began with the string “COOLNESS”.
Medi8or format with string “MatchWare Medi8or Version 3.00”
MacCaption with string “File Format=MacCaption_MCC V2.0”
HyperWriter format with string “HyperWriter!”
ExpressPublisher and AnFX Java Movie with hex values “CAFEBEEF”
TIFF format which has at bytes 2-3 a “An arbitrary but carefully chosen number (42)“

A couple of my current least favorites:

MP3 format, which can have no header just frames and which clash with everything.
Canvas format which the early versions (CVS) have no standard header.
Leica Cyclone PTS format with just point cloud data, no headers.
Adobe Flash (FLA) later versions where the ZIP container is non standard and throws a Central Directory error.

Like I said some developers make it very obvious what software created the file format and others seem to make things difficult. I understand there is a need to optimize files to keep them from getting bloated and taking up too much space, but many of the size limits from the early days of computing are not an issue anymore. Can’t we be more clear when designing a file format?

Today I want to document one format which was very easy to identify as it spelled out its format very verbosely, but because of the lack of additional documentation makes it very hard to preserve.

Meet the Composite File Management System file format:

hexdump -C sample.br4
00000000  43 43 6d 46 20 2d 20 55  6e 69 76 65 72 73 61 6c  |CCmF - Universal|
00000010  20 2d 20 41 78 69 6f 6d  20 2d 20 41 47 50 20 2d  | - Axiom - AGP -|
00000020  20 43 6f 6d 70 6f 73 69  74 65 20 46 69 6c 65 20  | Composite File |
00000030  4d 61 6e 61 67 65 6d 65  6e 74 20 53 79 73 74 65  |Management Syste|
00000040  6d 20 28 55 6e 69 76 65  72 73 61 6c 29 20 2d 20  |m (Universal) - |
00000050  43 72 65 61 74 65 64 20  62 79 20 41 6e 64 72 65  |Created by Andre|
00000060  61 20 50 65 73 73 69 6e  6f 2c 20 44 65 63 65 6d  |a Pessino, Decem|
00000070  62 65 72 20 31 39 39 35  20 28 76 65 72 73 2e 20  |ber 1995 (vers. |
00000080  35 29 20 2d 20 43 6f 70  79 72 69 67 68 74 28 63  |5) - Copyright(c|
00000090  29 20 31 39 39 35 2d 39  36 20 62 79 20 4d 65 74  |) 1995-96 by Met|
000000a0  61 54 6f 6f 6c 73 2c 20  49 6e 63 2e 20 2d 20 50  |aTools, Inc. - P|
000000b0  72 6f 75 64 6c 79 20 6d  61 64 65 20 69 6e 20 74  |roudly made in t|
000000c0  68 65 20 55 53 41 2c 20  6c 61 6e 64 20 6f 66 20  |he USA, land of |
000000d0  74 68 65 20 66 72 65 65  2c 20 68 6f 6d 65 20 6f  |the free, home o|
000000e0  66 20 74 68 65 20 62 72  61 76 65 2e 00 00 00 00  |f the brave.....|
000000f0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

Where to start? First off, this is the Bryce 4 file format. Bryce was a 3D modeling, animation software developed by MetaTools, later MetaCreations. Metacreations was also the developer of popular software Ray Dream Studio/Infini D, Fractal Design Painter, and Kai’s Power Tools.

Secondly, this format refers to a Universal File Management System or CCmF, which I have found to be the file format for many other extensions, some of which are .goo, .brc, .br3, .br4, .br5, .sfp, .shp, .obp. It doesn’t always have the verbose header, some of them have the following:

hexdump -C Tutorial.obp | head
00000000  20 20 20 20 20 20 20 20  20 20 20 20 20 20 20 20  |                |
*
00000050  20 20 20 20 20 20 20 20  20 20 20 20 20 20 43 43  |              CC|
00000060  6d 46 69 6c 65 3a 3a 6b  49 64 65 6e 74 69 66 79  |mFile::kIdentify|
00000070  34 20 20 20 20 20 20 20  20 20 20 20 20 20 20 20  |4               |
00000080  20 20 20 20 20 20 20 20  20 20 20 20 20 20 20 20  |                |

Different, but still contains the CCmF identification string. Others have the verbose header, but further down inside the file.

With this format being used with so many well known software titles, I assumed information on the format would we readily available. Alas, not so much. The format even had the name of the creator! “Created by Andrea Pessino, December 1995”. So I reached out. He was on Twitter and I asked about the file format and if there was any documentation available. Twitter (X) has since deleted his responses after he deleted his account, but he told me he wasn’t sure where the documentation might be. One other developer also commented and confirmed they didn’t know where any of the documentation went after they left.

MetaCreations sold Bryce to Corel in 2000, then in 2004 sold it to Daz3D, the current owners. It’s not actively developed anymore being that it was never made into a 64bit application. A blog post explains the format a little more, but concludes it is a secret known only to Daz.

It seems there is a community who would like to see Bryce more open, maybe even open-sourced. This thread discusses the format and the underlying Axiom format used.

The creator Andrea Pessino was able to track down some documentation on the CCmF file structure for me. He explained Axiom was an entire codebase for all MetaTools/Creations applications and plugins. So the CCmF system was more than a file format. The documentation included some information on versioning of a CCmF.

There seems to be a few versions of the CCmF file structure.

CCmFile::kIdentify which corresponds with December 1995 (vers. 5)
CCmFile::kIdentify2 which corresponds with March 1997 (vers. 7)
CCmFile::kIdentify3 which corresponds with October 1998 (vers. 9)
CCmFile::kDfFormat which is a Generic Composite File

The documentation given to me was up to date for 1998, but after Corel purchased Bryce there was some updates made as many material files have the identifier “CCmFile::kIdentify4“.

Bryce 6 & 7 were released by Daz3D and have a different file header. They have the extension .BR6 & .BR7 with the header:

hexdump -C Bryce7-s01.br7 | head  
00000000  42 72 79 63 65 5f 36 2e  30 5f 46 69 6c 65 00 00  |Bryce_6.0_File..|
00000010  11 00 00 00 d4 07 00 00  00 20 00 00 e5 07 00 00  |......... ......|
00000020  00 0a 00 00 00 10 00 00  00 08 78 9c 63 64 60 60  |..........x.cd``|
00000030  60 04 e2 8c cc f4 0c 85  e4 9c fc d2 14 85 92 d4  |`...............|
00000040  8a 92 d2 a2 54 86 11 05  18 a1 18 04 82 76 c8 b5  |....T........v..|
00000050  be 0e 7c 60 8f 4e 93 67  f2 07 32 f5 d1 0e 30 31  |..|`.N.g..2...01|
00000060  40 fc ca 0c c5 60 bf 33  a2 ab da e2 8c c0 70 e0  |@....`.3......p.|
00000070  00 22 58 a0 9c ff 2a 40  fc bf 16 88 ff c3 c3 2e  |."X...*@........|
00000080  13 64 20 83 82 13 50 29  50 ad 17 50 ef 3c 20 ce  |.d ...P)P..P.< .|
00000090  72 66 64 86 19 31 cd 09  42 57 b9 80 71 43 9d 0b  |rfd..1..BW..qC..|

I still need to gather more samples from the various extensions related to this format and the software related to them. More work to do understanding the different uses of the short CCmFile string and the more detailed header and the differences between objects, materials, and models. When I asked Andrea why he used such a verbose file header, his answer was basically, why not!

Apple Mail

October 27, 2023 by Thor Leave a comment

There really is no “Macintosh Format”, but there sure are a lot of formats you only find on the MacOS. From Resource Forks and iWork formats to unique sound formats, MacOS has them all! Majority of cross-platform software vendors have done a much better job in recent years in making their file formats the same across platforms, but for Apple, they love to make things unique, just for their platform.

Take EMLX for example. Seems to be a trend to add “X” to the end of an older format to breath new life into it. The EML format, or Electronic Mail, has existed for a few decades now, but in 2005 Apple updated their Apple Mail application to use a new format, EMLX.

As far as I know, Apple hasn’t released any documentation on the EMLX format, but many folks out there have asked the question and have been able to “reverse engineer” the format. Lets take a look.

An EMLX file consists of three parts:

bytecount on first line;
email content in MIME format (headers, body, attachments);
Apple property list (plist) with metadata.

The bytecount is a variable number which consists of the total bytes starting from the start of the MIME format, including HTML, to the start of the XML property list. Lets look at a simple EMLX.

The byte count is on line 1 with the MIME email (EML) taking up the 556 bytes, then the XML plist at the end. You may ask, what is a plist? Well, it is another Apple (originally NextStep) invention which is embedded throughout the MacOS operating system. A Plist is usually an XML with keys but can also be in a binary format. The Plist can contain properties of the email within Apple Mail like special color flags, tagged as junk, date received and last reviewed.

If you do happen across an EMLX file or group of them, there are a few tools you can use to convert them to a plain old EML. There are python libraries or many other tools to do the job.

But first we need to be sure of identification beyond the extension. Adding this file format to PRONOM would help in identification for preservation purposes. If ran through PRONOM today we get:

filename : '9.emlx'
filesize : 18582
modified : 2023-10-26T22:16:25-06:00
errors   : 
matches  :
  - ns      : 'pronom'
    id      : 'fmt/950'
    format  : 'MIME Email'
    version : '1.0'
    mime    : 'message/rfc822'
    class   : 'Text (Structured)'
    basis   : 'byte match at [[31 17] [599 4] [339 6] [426 6] [90 14]]'
    warning : 'extension mismatch'

Because the format has a EML plain text format within its structure, it is assumed to be an EML file. While technically accurate, Identifying as a unique EMLX format would be beneficial in a preservation system so you can properly assign risk and choose the right tool to parse or migrate.

In looking at the three parts of an EMLX format, we know the EML file is not a good way to show the difference as they are the same structure. The byte count on the first line is variable, so there is no static byte sequence to use for identification. That leaves the Plist section at the end to distinguish the difference.

The PRONOM entry for a Plist looks for the typical XML strings present in most XML files, but then uses the root element “<plist version=”1.0″>” for identification. We could combine the existing EML signature and the Plist signature to identify an EMLX, or just take the existing EML signature and put in a small byte sequence for the closing of the </plist> tag near the EOF? There would be a need for a priority over EML, both would essentially accomplish the same thing.

Take a look at latter idea on my GitHub page and tell me which makes the most sense.

No bad deed….

October 13, 2023 by Thor Leave a comment

I had access to my first Macintosh computer around 1987. My father brought it home and I spent hours on it playing games and occasionally writing reports for school. The Macintosh Plus computer had one floppy drive and no hard drive. I remember playing the game Orbiter which had two floppy disks and right in the middle of game play it would pause and ask me to insert disk 2, then quickly ask for disk 1 again. The struggle was real. I spent years using many different Macintosh computers and now own more than I wish to admit. I’m preserving them!

The wild world of digital preservation has been a little lacking on the Macintosh side of things as I have come to realize. There still not a great way to manage Resource Forks in many preservation systems and the identification tools are mainly focused on the data bytetreams and not any system specific attributes Macintosh used often.

The PRONOM registry has either referenced early Macintosh specific formats or missed them entirely so I have been slowly working on a few to close that gap.

Interestingly enough, many Microsoft programs initially made their GUI debuts on the early Macintosh before making their way to Windows. Excel is one I am working on, as Version 1 is not identifiable in PRONOM, it was Macintosh only at the time.

Another is PowerPoint, I recently submitted two new signatures to PRONOM.

fmt/1747: Microsoft PowerPoint Presentation v2.x. Full entry added.
fmt/1748: Microsoft PowerPoint Presentation v3.x. Full entry added.
fmt/1866: Microsoft Powerpoint for Macintosh v.2. Full entry added.
fmt/1867: Microsoft Powerpoint for Macintosh v.3. Full entry added.

PowerPoint was initially released in 1987 on the Macintosh platform. It was developed by a company called ForeThought. Version 1.0 on the Macintosh was under this name, until it was bought by Microsoft only three months after being released. The history of PowerPoint can be discovered at Robert Gaskins, one of the original developers, website and book he wrote. The available information provided by Microsoft is only for the OLE format, covering versions 4.0 until 2003.

So, lets take a look at the Powerpoint original file format, before OLE.

   Type/Creator      RF      DF  Date         Filename
f  SLDS/PPNT         0       932 Oct 10 19:10 PowerPoint-v1

Luckily the early PowerPoint files did not have a Resource Fork. The Data Fork, if you haven’t noticed, has an interesting set of hex values at the beginning of the file. 0BADDEED is the first 4 bytes. If we look at a PowerPoint version 2 file from Windows.

The file format is the same, but because of the weird world of endianness, the first few bytes are in reverse order, EDDEAD0B.

Obviously we need to discuss this magic number and the meaning behind “Bad Deed”. This question was asked previously by the digital preservation community. I have a previous blog post about the use of words for the magic number CAFEBEEF as it was used with with JAVA class files and Express Publisher in the 1990’s. BADDEED looks like another clever use of the hex values that formed words. But was there a story behind the words? Joe Carrano asked if this string might be hexspeak. I wanted to know more so I asked some one who might know.

Robert Gaskins was kind enough to chat with me for a bit about the early days of PowerPoint.

I had a theory on the possible meaning behind BADDEED, so I asked him what the feeling was like between Apple and Microsoft at the time. I had heard for years that PowerPoint was originally created for the Macintosh, but Robert informed me:

In fact, PowerPoint was designed first for Microsoft Windows,

and its first spec shows that: “All the screen shots, menus, and

dialogs were set up to look like Microsoft Windows, not like

Macintosh.” (Gaskins, Sweating Bullets, p. 92) You can see that

spec here.

A year later, we concluded that we would be forced to ship

on Mac first, although we still thought that Windows was the

big opportunity and thought that Mac was risky. “We just didn’t think

we could successfully ship a product for Windows, yet, though we planned

to later. (Gaskins, Sweating Bullets, p. 105) The considerations are

summarized in my June 1986 product marketing document.

Of course, we turned out to have been right all along. PowerPoint on

Mac was much loved, but sales remained poor because Mac sales were

so poor. It was only after we shipped on Windows that PowerPoint gained

the dominant market share which has characterized it ever since, and

Windows PPT outsold Mac PPT very quickly. (Gaskins, Sweating Bullets, p. 403)

So my original thought was that there was some bad feelings around this Apple, Microsoft battle which has been the sentiment for quite some time. So when I asked if any of that influenced the use of BADDEED, I was told:

So, far from being disgruntled by expanding PowerPoint to Windows,

that had been our goal all along, and its achievement was the most

important success we had.

I judge that you are fully aware of all that, and that

your question is more, “was there any bad deed signified

by the Mac hex value chosen?” No, it was just the poverty

of choice when you only have six letters.

So there you have it. The use of the hex values 0x0BADDEED, was simply chosen from a limited set of values when looking at words hexadecimal could spell. I guess I should never let the truth get in the way of a good story.

I continued to have a wonderful conversation with Robert and also asked him for some details on the rest of the PowerPoint file format. I was hoping there might be some documentation out there explaining the early format before Microsoft took over. Robert said:

I don’t know of any such documentation apart from the official

Microsoft support files available online. I don’t have any such

information. I know that Dennis Austin deposited some of our

working files at the Computer History Museum (not online):

https://archive.computerhistory.org/resources/access/text/finding-aids/102733943-Austin/102733943-Austin.pdf

and it’s likely that some information is there–if nothing

else, it claims to contain a source code listing for PPT 1.0

which would contain the code to read the file format.

So there might be some information in at the Computer History Museum worth looking into.

As far as I could tell from the available online information, there is a few differences between Version 1.0 and Version 2.0, the biggest being the fact that 1.0 did not have an option to print in color, amount a few other minor things. Here is a screenshot of a page from the Microsoft PowerPoint 2.0 documentation on archive.org.

I suppose with the signature additions of the Macintosh and Windows versions 2.0 and 3.0 of the PowerPoint file format in PRONOM, that should cover most needs. Currently my PowerPoint 1.0 files identify at 2.0 files, so I may need to have them adjust the PUID to include both versions 1.0 and 2.0 as they are so similar. If I am able to find a difference or get my hands on the original source code I may find a better solution.

BINHEX

September 29, 2023 by Thor 1 Comment

Working with files in todays world is much different than before. Today getting files back and forth from the cloud or through email is relatively easy, unlike the early days when we used FTP sites and needed to encode our data to properly transfer. I remember using an FTP program on my old Mac called Fetch. We had to determine if the content was to be transferred as text or binary.

Picking the right encoding was critical to getting the content transferred correctly, this was even more critical when working with Macintosh files which needed a resource fork and/or finder attributes to work properly. In those cases a MacBinary or BinHex file was required! Fetch would automatically identify those formats and decode them for you.

If you need a refresher on MacBinary and AppleSingle, you can view my iPres 2022 presentation.

One format I didn’t spend much if any time on is the BinHex format. BinHex was a format born out of necessity to move files back and forth across the World Web Web, bulletin boards, AOL, Compuserve, and the like. The FTP program Fetch glossary describes BinHex as:

BinHex (sometimes called BinHex4) is a format for representing a Macintosh file in text form.

The Macintosh file is converted to a series of lines, each made up of letters, numbers, and

punctuation. Because BinHex files are simply text, they can be sent through most electronic mail

systems and stored on most computers. However the conversion to text makes the file larger, so it

takes longer to transmit a file in BinHex format than if the file was represented some other way.

The suffix “.hqx” usually indicates a BinHex format file.

You can still find many of these HQX files floating around the interwebs and on older CDs from the 1990’s. One such CD recently came into my possession. I managed to get a copy of the book “Internet File Formats“, by Tim Kientzie. It came with a CD-ROM with lots of goodies included. Some sample files, specifications, and software. The disc itself is an ISO 9660 partitioned disc, but includes a few Macintosh formats, so the author put many of the software files in the HQX format to maintain the much needed resource fork Macintosh applications need in order to run.

I initially ran the whole disc through DROID to get an idea what was on the disc and if any sample formats were unidentified (something I do regularly), and found majority of the HQX files didn’t identify as they should have to PRONOM PUID x-fmt/416. The signature is an older one, from 2010, but since the format isn’t updated anymore it should be solid. Or so I thought.

Since BINHEX files are encoded as text, lets take a look at a couple of these from the disc which didn’t identify.

The PRONOM signature currently is:

File extension: hqx	
Name	BinHex Binary Text
Description	Header: (This file must be converted with BinHex
Byte sequences	
Position type	Absolute from BOF
Offset	0	 
Value	28546869732066696C65206D75737420626520636F6E76657274656420776974682042696E486578

That “Value” listed in hexadecimal decodes to: “(This file must be converted with BinHex” as listed in the description. We can see this line in the file above, but the signature assumes the value begins at offset 0 from the beginning of the file. So its looking for that value at the start of the file, but this file seems to have some additional text before the value. What does the specs say?

The BinHex 4.0 format was created in 1985 and defined in RFC 1741.

   The whole file is considered as a stream of bits.  This stream will
   be divided in blocks of 6 bits and then converted to one of 64
   characters contained in a table.  The characters in this table have
   been chosen for maximum noise protection.  The format will start
   with a ":" (first character on a line) and end with a ":".
   There will be a maximum of 64 characters on a line.  It must be
   preceded, by this comment, starting in column 1 (it does not start
   in column 1 in this document):

    (This file must be converted with BinHex 4.0)

   Any text before this comment is to be ignored.

   The characters used is:

    !"#$%&'()*+,- 012345689@ABCDEFGHIJKLMNPQRSTUVXYZ[`abcdefhijklmpqr

Ok, so in the specs we can see the “Value” string must be there, but according to the specification, any text before this comment is to be ignored. So adding some instructions and even an email header at the beginning is ok, as long as the value string is there right before the encoded data.

We also learn a couple interesting things. The first character of the first line after the string should be a “:” and the last line should end with a “:” as well. That could help make the signature more solid. We also learn there are a maximum of 64 characters per line. The last line will probably not have full maximum, but the previous lines should…. I wonder if we could use this fixed position from the initial “:” to add even more strength to the signature.

So an updated PRONOM signature might look like:

BOF: {0-4084}28546869732066696C65206D75737420626520636F6E76657274656420776974682042696E486578{6-9}3A

EOF: 3A (Max Offset 64)

Adding the 4,084 bytes at the beginning allow for additional text. This value worked for my samples but there could be others out there with more. The {6-9} bytes in between the string and the colon account for the various way newlines are encoded. Sometimes is one “0A” byte, other times it is “OD”, and others its both. After testing, adding values in the signature to account for the 64 byte line can fail if the file has only one line, so I left it out.

The EOF should just be the colon (3A), but I found many of my samples had various line endings and other random characters. Hence the 64 bytes for max offset.

Also, the current PRONOM entry doesn’t include the Mime-Type, which is: “application/mac-binhex40”

Hopefully this update will add some strength to the signature and follow the specification closer. The new signature even works on files with extra content at the beginning!

This image has an empty alt attribute; its file name is long-binhex-header.png

There are a number of software titles you can use to encode and decode a BinHex file. On a modern Mac, try using The Unarchiver, or Stuffit Expander. From the commandline, you can use the macutil library or the CLI version of Unarchiver. Although the MacOS has a built in utility to decode BinHex files. If you are using a classic version of Macintosh OS, you can find a number of utilities on Macintosh Garden.

Oh, and also, the CD-ROM I mentioned earlier has a few “fun” features. Not sure if they are on purpose or if errors were made during mastering, but a few filenames have some hidden extra characters and one folder puts any tool traversing the directory into a loop, even droid. Have fun!

Apple Package Format

September 1, 2023 by Thor Leave a comment

Let’s talk about Apple’s iWork software. Apple’s Office Suite of applications was first released in 2005 and provided a WordProcessor (Pages), Presentations (Keynote), and a little later, Spreadsheet (Numbers). They are exclusive to the Macintosh and iOS devices.

iWork was released in a few different versions. They get a little confusing as each application has its own version which all seemed to unify and stabilize in 2020. Here is a matrix of major versions.

Version	Package or ZIP
iWork ’05	Package
iWork ’06	Package
iWork ’08	Package
iWork ’09	ZIP
iWork 2013	Package
iWork 2014	ZIP
iWork 2019	ZIP
iWork 2020	ZIP

You may already be aware but MacOS can sometimes be weird. I use the term weird in a loving, sometimes proud way, but I admit, there was some “odd” choices made in regards to how applications and documents are used and stored files on a Mac.

On early Macintosh computers Apple used an interesting method of storing resources for applications and some file formats. The Resource Fork for an application contained all the “resources” needed to run in the operating system. It would contain all the icons, warning screens, graphics, sounds, etc. This held true until Mac OS X came along and then Apple started using a bundle or package format. Still in use today, what appears to be a single file or application is actually a folder of all the resources needed to run the application.

By right clicking or control clicking on the icon you can open the folder and see all the contents which make up the Application.

Nifty right? The MacOS which knows which extensions to treat as a package. If you were to move the application over to another system it would be a folder with the extension “.app”.

For an application I can see how this makes sense as it will only execute in the MacOS environment. The problem comes in when you use the same package method for the documents the application creates.

Contents of Pages version 1 sample file.

So instead of a single “file” with a bytestream, you get a folder of files which make up the file format. Here is Apple’s description:

Document Packages

If your document file formats are getting too complex to manage because of several disparate types of data, you might consider adopting a package format for your documents. Document packages give the illusion of a single document to users but provide you with flexibility in how you store the document data internally. Especially if you use several different types of standard data formats, such as JPEG, GIF, or XML, document packages make accessing and managing that data much easier.

Apple actually defines two similar methods:

Although bundles and packages are sometimes referred to interchangeably, they actually represent very distinct concepts:

A package is any directory that the Finder presents to the user as if it were a single file.

A bundle is a directory with a standardized hierarchical structure that holds executable code and the resources used by that code.

A couple years ago a processed digital collection made its way down to me. It had been processed by a new digital archivist and when I went to prepare the collection for preservation, I found a folder with the extension .pages and inside the folder a whole directory of files. Many of which they had renamed and arranged. Needless to say, I had to track down the original disk so I could properly preserve the file.

So looking back at the earlier table, iWork switched back and forth between the package format and a ZIP container. For preservation purposes, the ZIP container is easier to maintain outside the MacOS. Lets look a little closer at each. If you would like to follow along I have copied a few samples onto a hybrid ISO.

iWork ’05 through iWork ’08 used the same package format and structure. Because they are a package format, they are difficult to preserve as original files. I suppose you could zip them up, but probably the best option is to open with a current version of Pages and save to the newer ZIP container format.

tree iWork08/Keynote-06.key 
├── Contents
│   └── PkgInfo
├── QuickLook
│   └── Thumbnail.jpg
├── index.apxl.gz
└── theme-files
    ├── Blue 2.jpg
    ├── Blue 2.tif
    ├── Cool Gray-2.jpg
    ├── Cool Gray.tif
    ├── Green-8.jpg
    ├── Green.tif
    ├── Headlines_bullet.pdf
    ├── Headlines_star.pdf
    ├── Orange 2.tif
    ├── Orange_2.jpg
    ├── Purple-6.jpg
    ├── Purple.tif
    ├── Red.jpg
    ├── Red.tif
    ├── endpoints.pdf
    └── headlines_hi-res.jpg

iWork ’09 changed this practice. The documents saved from Pages, Keynote, and Numbers were contained in a ZIP file and can be identified using the PRONOM registry container signatures.

filename : 'iWork 2013/Pages2013-Sample09.pages'
filesize : 105900
modified : 2019-11-21T20:36:00-07:00
matches  :
  - ns      : 'pronom'
    id      : 'fmt/1439'
    format  : 'Apple iWork Pages'
    version : '09'
    class   : 'Word Processor'
    basis   : 'extension match pages; container name index.xml with byte match at 195, 76'

Sample09.pages
Type = zip
WARNINGS:
Headers Error
Physical Size = 105900

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2019-11-21 20:36:00 .....       364773        22413  index.xml
2019-11-21 20:36:00 .....         7007         7007  Hardcover_bullet_black.png
2019-11-21 20:36:00 .....        69176        69176  Simple_Noise_2x.jpg
2019-11-21 20:36:00 .....          232          232  buildVersionHistory.plist
2019-11-21 20:36:00 .....         6406         6406  QuickLook/Thumbnail.png
------------------- ----- ------------ ------------  ------------------------
2019-11-21 20:36:00             447594       105234  5 files

Then Apple went back to a Package format with iWork 2013. For reasons unknown. But the content and structure changed. Its a package format with a Index.zip instead of index.xml

Pages2013-Sample.pages
├── Data
│   └── Hardcover_bullet_black-13.png
├── Index.zip
├── Metadata
│   ├── BuildVersionHistory.plist
│   ├── DocumentIdentifier
│   └── Properties.plist
├── preview-micro.jpg
├── preview-web.jpg
└── preview.jpg

3 directories, 8 files

The ZIP within the package contains a new Apple format. IWA

Pages2013-Sample.pages/Index.zip
Type = zip
Physical Size = 39361

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2019-11-21 20:47:14 .....         3860         3860  Index/Document.iwa
2019-11-21 20:47:14 .....           26           26  Index/Tables/DataList.iwa
2019-11-21 20:47:14 .....          336          336  Index/ViewState.iwa
2019-11-21 20:47:14 .....          160          160  Index/CalculationEngine.iwa
2019-11-21 20:47:14 .....          121          121  Index/DocumentStylesheet.iwa
2019-11-21 20:47:14 .....        31931        31931  Index/ThemeStylesheet.iwa
2019-11-21 20:47:14 .....           22           22  Index/AnnotationAuthorStorage.iwa
2019-11-21 20:47:14 .....         1889         1889  Index/Metadata.iwa
------------------- ----- ------------ ------------  ------------------------
2019-11-21 20:47:14              38345        38345  8 files

Luckily Apple came to their senses and went back to the ZIP container format for iWork 2014 and later. The container signature looks for the IWA file Apple started using with iWork 2013.

filename : 'iWork 2014/Pages2014-Sample.pages'
filesize : 66256
modified : 2019-11-22T00:03:56-07:00
errors   : 
matches  :
  - ns      : 'pronom'
    id      : 'fmt/1441'
    format  : 'Apple iWork Document'
    version : '14'
    class   : 'Presentation, Spreadsheet, Word Processor'
    basis   : 'extension match pages; container name Index/Document.iwa with byte match at 16, 6; name Metadata/Properties.plist with name only'

Path = iWork 2014/Pages2014-Sample.pages
Type = zip
Physical Size = 66256

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2019-11-22 00:03:54 .....         3930         3930  Index/Document.iwa
2019-11-22 00:03:54 .....          364          364  Index/ViewState.iwa
2019-11-22 00:03:54 .....          206          206  Index/CalculationEngine.iwa
2019-11-22 00:03:54 .....        33573        33573  Index/DocumentStylesheet.iwa
2019-11-22 00:03:54 .....           22           22  Index/AnnotationAuthorStorage.iwa
2019-11-22 00:03:54 .....           23           23  Index/DocumentMetadata.iwa
2019-11-22 00:03:54 .....         8761         8761  Index/Metadata.iwa
2019-11-22 00:03:54 .....          322          322  Metadata/Properties.plist
2019-11-22 00:03:54 .....           36           36  Metadata/DocumentIdentifier
2019-11-22 00:03:54 .....          273          273  Metadata/BuildVersionHistory.plist
2019-11-22 00:03:54 .....        14611        14611  preview.jpg
2019-11-22 00:03:54 .....          838          838  preview-micro.jpg
2019-11-22 00:03:54 .....         1571         1571  preview-web.jpg
------------------- ----- ------------ ------------  ------------------------
2019-11-22 00:03:54              64530        64530  13 files

Now iWork was not the only Apple software to use the Package/Bundle format for their documents. Be advised the following software may save to the package format.

I remember a few years ago, Trent Reznor (NIN) decided to release a few of his tracks in the Garageband format. A little harder to find these days, but the good old wayback machine kept a copy for us! Grab them here. Be warned, they may be in the package format. Thanks Apple!