Common Ground

If digital preservation had an extension it most likely would be .DP

Unfortunately, it’s taken. Say hello to Digital Paper.

In the early 1990’s, folks started to share documents with each other through the their phone lines. The early internet, BBS, AOL, CompuServe and the like allowed people to share ideas through applications like Word/WordPerfect Documents. Most people had a copy of the popular software and that software could open documents from their competitors, but fonts were always a problem. Technically a font is software as well and needs a license to be used. Also printers at the time dictated what the document might look like when opened, so your document may look different on someone else’s computer. This lead to a few innovations in the software market Digital Paper.

The idea is simple, create a format which could be opened with a free viewer which includes all the parts to make it look and print just like it was intended to. You may have already guessed who the winner in this space tuned out to be, yes, the PDF format. You can’t tell the history of the PDF Format without mentioning others that tried their luck to be the leader in portable document formats . WordPerfect’s Envoy format was one, Common Ground Digital Paper was another.

No Hands Software which started in 1990, developed the idea of making your documents truly portable. They released the Common Ground Maker and Viewer software in 1993. By 1996 the company was doing so well they were bought for $6 million by Hummingbird Ltd. PDF soon became so ubiquitous, formats like Common Ground and Envoy fizzled out. That doesn’t mean they didn’t have a big impact and still can be found in quite a few places.

Apple was one of the bigger users for awhile, but the format can still be found floating around today.

The Common Ground Digital Paper has some similarities to the PDF format, but the biggest different is the format is proprietary and not open like PDF. Another difference is you could embed the viewer into the file, this would make an executable on both Windows and Macintosh. Very convenient for sending to those who may not have the viewer or can’t install the viewer on their system.

Common Ground had two different viewers, a pro viewer with more features and a Mini Viewer with basic features and which was free to download and distribute from their website. Unfortunately, they linked to an FTP site which no longer is available and so finding the viewers today can be difficult.

I came across a boxed version 1 for Macintosh of the software a few years back, but have yet to find other full versions. The software did change hands a bit, but seems to have topped out at Version 4 in the late 1990’s. Let’s take a look at the file format for the samples we do have.

Version 1 for the Macintosh was the first I believe, coming to Windows shortly afterwards. The format was even assigned a MimeType for use on the web and the application gives us a little insight into the format.

The commonground file format does have versions (two at the moment). They *are* internally documented with a file signature, allowing commonground viewers to automatically handle both old and new format files. Therefore, I don’t believe a ‘version’ parameter is needed.

A Content-Type of “application/commonground” indicates a document in the Common Ground portable file format, also known as Digital Paper.

Encoding considerations: Common Ground files are in a binary format. Some encoding will be necessary for MIME mailers as in application/octet-stream. Common Ground files for the Macintosh are encoded in the data fork of a Macintosh file. The file type is APPL, the creator is CGVM.

If we look at a sample from Version 1 for the Macintosh we find the follow hex values:

hexdump -C CG-s01.dp | head
00000000  00 00 03 56 00 00 04 d9  43 47 44 43 00 00 00 00  |...V....CGDC....|
00000010  96 6c 00 07 04 b4 03 de  00 00 00 00 02 da 02 28  |.l.............(|
00000020  00 11 02 ff 0c 00 ff ff  ff ff 00 00 00 00 00 00  |................|
00000030  00 00 02 28 00 00 02 da  00 00 00 00 00 00 00 01  |...(............|
00000040  00 0a 00 05 00 05 00 15  02 23 00 32 00 05 80 02  |.........#.2....|
00000050  00 15 7f fe 00 2c 00 09  00 03 06 47 65 6e 65 76  |.....,.....Genev|
00000060  61 00 00 03 00 03 00 0d  00 0c 00 2e 00 04 00 00  |a...............|
00000070  00 00 00 2b 06 11 07 54  65 73 74 69 6e 67 00 01  |...+...Testing..|
00000080  00 0a ff e1 ff e2 02 f9  02 46 00 03 00 00 00 0d  |.........F......|
00000090  00 00 00 28 02 d5 01 05  05 2d 20 31 20 2d 00 ff  |...(.....- 1 -..|

In all the samples I have the first 8 bytes are not consistent, but the next four bytes are. CGDC, which happens to be the registered type on the Macintosh. Convenient. But it appears later versions are not the same.

hexdump -C MANUAL.DP | head
00000000  00 00 00 20 00 00 b7 f4  44 50 4c 33 00 00 00 04  |... ....DPL3....|
00000010  00 00 00 00 00 00 00 00  3b 60 53 df 00 00 00 00  |........;`S.....|
00000020  00 00 00 18 00 00 b4 da  00 00 b4 c2 00 00 03 3e  |...............>|
00000030  78 00 79 00 7a 00 7b 00  00 00 00 77 01 01 00 0c  |x.y.z.{....w....|
00000040  00 01 02 01 00 00 00 97  fe ed f0 05 00 b7 86 04  |................|
00000050  5f 05 f7 01 00 03 ed f0  02 00 3d 00 ff 45 75 72  |_.........=..Eur|
00000060  6c 20 00 01 07 ff bf 05  9f 00 01 08 a3 05 fb ba  |l ..............|
00000070  02 fa f1 00 ff ff 00 11  ff 68 74 74 70 3a 2f 2f  |.........http://|
00000080  77 ff 77 77 2e 47 53 50  2e 43 b9 43 1c 0f 03 04  |w.ww.GSP.C.C....|
00000090  95 05 c8 0d 00 cc fb 05  e3 13 06 15 6d 61 69 6c  |............mail|

hexdump -C dpwhite.dp | head
00000000  00 00 00 18 00 01 79 17  44 50 4c 32 00 00 00 00  |......y.DPL2....|
00000010  00 00 00 00 00 00 00 00  00 00 00 18 00 01 76 de  |..............v.|
00000020  00 01 76 c6 00 00 04 b2  00 00 00 00 00 00 00 00  |..v.............|
00000030  00 00 00 1e 01 01 00 0c  00 00 01 01 00 00 00 12  |................|
00000040  00 01 00 01 00 00 00 00  0c 4e 09 60 01 2c 01 2c  |.........N.`.,.,|
00000050  00 64 00 00 00 02 00 00  00 00 00 a2 01 01 00 0c  |.d..............|
00000060  00 01 02 01 00 00 00 e2  fa ed f0 22 ed f1 0c 4e  |..........."...N|
00000070  09 60 00 ff e1 01 26 0a  83 08 3b ff ff 6a ff 6a  |.`....&...;..j.j|
00000080  0c e4 09 f6 01 ff 2c 01  2c 00 08 00 64 00 df 00  |......,.,...d...|
00000090  01 01 00 03 ed f0 0f 00  79 0a 1c 0f 28 07 42 41  |........y...(.BA|

These files are from a later version and have a different string at byte 8. DPL2 & DPL3. In the MiniViewer you can request document information and it provides some basic metadata for each file.

I only have one example of the DPL3, but a couple examples of DPL2, and it seems like DPL2 comes from a Version 3 DP Maker and DPL3 comes from Version 4 Maker. Need to see if I can find a Version 2 file and see if it follows the same pattern.

Two of my favorite CD-ROM’s on Internet Archive are Dr. Dobb’s The Essential Books on File Formats and Internet File Formats, both have copies of the Mini Viewer.

One of features similar to PDF is the ability to password protect certain features. This is what the document information looks like.

The header is the same, but the plain text usually seen in the file is no longer visible, so it appears the rest of the file is encrypted.

hexdump -C password.dp | head 
00000000  00 00 5d 95 00 00 06 94  43 47 44 43 00 00 00 01  |..].....CGDC....|
00000010  8e 3b 18 7e c5 16 f8 e0  0f f5 6f 32 2f 34 36 81  |.;.~......o2/46.|
00000020  4b 8a 03 da 9e 1a 85 6c  36 e4 39 f2 5a 2a a2 5f  |K......l6.9.Z*._|
00000030  81 83 65 ee 9c 16 d0 2d  2d c3 04 df 69 c8 06 0d  |..e....--...i...|
00000040  77 df 27 19 33 59 f6 05  61 4e 2c a6 58 27 47 26  |w.'.3Y..aN,.X'G&|
00000050  fe 6b 3c 06 7e cb 7f fb  33 f8 64 ed 05 54 b4 7d  |.k<.~...3.d..T.}|
00000060  c7 b5 e3 c2 df 40 53 63  ef 8e 10 1c c7 58 bd 28  |.....@Sc.....X.(|
00000070  9b 8a 2c 8f ae 82 33 f7  ff d4 3c 96 5c b4 08 69  |..,...3...<.\..i|
00000080  1f 00 af ce a7 56 93 27  07 cc 39 97 17 22 49 d7  |.....V.'..9.."I.|
00000090  5b 89 9b e6 b7 b1 5c 38  75 ba 08 ee 66 d0 9a d2  |[.....\8u...f...|

This file format is not currently in PRONOM. From what I have gathered I could add three signatures. There could be some other variations out there and the password protection needs to be considered. Maybe I’ll take Nick Gault’s offer and request the format which was available starting in the middle of 1995. Think they’ll deliver?

No bad deed….

I had access to my first Macintosh computer around 1987. My father brought it home and I spent hours on it playing games and occasionally writing reports for school. The Macintosh Plus computer had one floppy drive and no hard drive. I remember playing the game Orbiter which had two floppy disks and right in the middle of game play it would pause and ask me to insert disk 2, then quickly ask for disk 1 again. The struggle was real. I spent years using many different Macintosh computers and now own more than I wish to admit. I’m preserving them!

The wild world of digital preservation has been a little lacking on the Macintosh side of things as I have come to realize. There still not a great way to manage Resource Forks in many preservation systems and the identification tools are mainly focused on the data bytetreams and not any system specific attributes Macintosh used often.

The PRONOM registry has either referenced early Macintosh specific formats or missed them entirely so I have been slowly working on a few to close that gap.

Interestingly enough, many Microsoft programs initially made their GUI debuts on the early Macintosh before making their way to Windows. Excel is one I am working on, as Version 1 is not identifiable in PRONOM, it was Macintosh only at the time.

Another is PowerPoint, I recently submitted two new signatures to PRONOM.

fmt/1747: Microsoft PowerPoint Presentation v2.x. Full entry added.
fmt/1748: Microsoft PowerPoint Presentation v3.x. Full entry added.
fmt/1866: Microsoft Powerpoint for Macintosh v.2. Full entry added.
fmt/1867: Microsoft Powerpoint for Macintosh v.3. Full entry added.

PowerPoint was initially released in 1987 on the Macintosh platform. It was developed by a company called ForeThought. Version 1.0 on the Macintosh was under this name, until it was bought by Microsoft only three months after being released. The history of PowerPoint can be discovered at Robert Gaskins, one of the original developers, website and book he wrote. The available information provided by Microsoft is only for the OLE format, covering versions 4.0 until 2003.

So, lets take a look at the Powerpoint original file format, before OLE.

   Type/Creator      RF      DF  Date         Filename
f  SLDS/PPNT         0       932 Oct 10 19:10 PowerPoint-v1

Luckily the early PowerPoint files did not have a Resource Fork. The Data Fork, if you haven’t noticed, has an interesting set of hex values at the beginning of the file. 0BADDEED is the first 4 bytes. If we look at a PowerPoint version 2 file from Windows.

The file format is the same, but because of the weird world of endianness, the first few bytes are in reverse order, EDDEAD0B.

Obviously we need to discuss this magic number and the meaning behind “Bad Deed”. This question was asked previously by the digital preservation community. I have a previous blog post about the use of words for the magic number CAFEBEEF as it was used with with JAVA class files and Express Publisher in the 1990’s. BADDEED looks like another clever use of the hex values that formed words. But was there a story behind the words? Joe Carrano asked if this string might be hexspeak. I wanted to know more so I asked some one who might know.

Robert Gaskins was kind enough to chat with me for a bit about the early days of PowerPoint.

I had a theory on the possible meaning behind BADDEED, so I asked him what the feeling was like between Apple and Microsoft at the time. I had heard for years that PowerPoint was originally created for the Macintosh, but Robert informed me:

  In fact, PowerPoint was designed first for Microsoft Windows, 

and its first spec shows that: “All the screen shots, menus, and 

dialogs were set up to look like Microsoft Windows, not like 

Macintosh.”  (Gaskins, Sweating Bullets, p. 92)  You can see that 

spec here.

A year later, we concluded that we would be forced to ship 

on Mac first, although we still thought that Windows was the 

big opportunity and thought that Mac was risky.  “We just didn’t think 

we could successfully ship a product for Windows, yet, though we planned 

to later. (Gaskins, Sweating Bullets, p. 105)  The considerations are 

summarized in my June 1986 product marketing document.

Of course, we turned out to have been right all along.  PowerPoint on 

Mac was much loved, but sales remained poor because Mac sales were 

so poor.  It was only after we shipped on Windows that PowerPoint gained 

the dominant market share which has characterized it ever since, and 

Windows PPT outsold Mac PPT very quickly. (Gaskins, Sweating Bullets, p. 403)

So my original thought was that there was some bad feelings around this Apple, Microsoft battle which has been the sentiment for quite some time. So when I asked if any of that influenced the use of BADDEED, I was told:

So, far from being disgruntled by expanding PowerPoint to Windows, 

that had been our goal all along, and its achievement was the most 

important success we had.

I judge that you are fully aware of all that, and that 

your question is more, “was there any bad deed signified 

by the Mac hex value chosen?”  No, it was just the poverty 

of choice when you only have six letters.

So there you have it. The use of the hex values 0x0BADDEED, was simply chosen from a limited set of values when looking at words hexadecimal could spell. I guess I should never let the truth get in the way of a good story.

I continued to have a wonderful conversation with Robert and also asked him for some details on the rest of the PowerPoint file format. I was hoping there might be some documentation out there explaining the early format before Microsoft took over. Robert said:

 I don’t know of any such documentation apart from the official 

Microsoft support files available online.  I don’t have any such 

information.  I know that Dennis Austin deposited some of our 

working files at the Computer History Museum (not online):

https://archive.computerhistory.org/resources/access/text/finding-aids/102733943-Austin/102733943-Austin.pdf

and it’s likely that some information is there–if nothing 

else, it claims to contain a source code listing for PPT 1.0 

which would contain the code to read the file format.

So there might be some information in at the Computer History Museum worth looking into.

As far as I could tell from the available online information, there is a few differences between Version 1.0 and Version 2.0, the biggest being the fact that 1.0 did not have an option to print in color, amount a few other minor things. Here is a screenshot of a page from the Microsoft PowerPoint 2.0 documentation on archive.org.

I suppose with the signature additions of the Macintosh and Windows versions 2.0 and 3.0 of the PowerPoint file format in PRONOM, that should cover most needs. Currently my PowerPoint 1.0 files identify at 2.0 files, so I may need to have them adjust the PUID to include both versions 1.0 and 2.0 as they are so similar. If I am able to find a difference or get my hands on the original source code I may find a better solution.

Gone in a Flash

This week I am at the annual iPres digital preservation conference. It is an amazing week of meeting colleagues and old friends who share the same passion of digital preservation. Outside of this community and my co-workers, talking about file formats and digital preservation usually bores people to death and I can hear some of them mumble under their breath, “nerd”! I term I am happy to accept.

At the conference, which is in lovely Urbana-Champaign Illinois this year, I am trying to soak in all the amazing talks and conversations about the challenges facing our work. There were a couple great workshops on Persistent Identifiers and Digital Object Storage Criteria. The presentations I made were of course on File Formats, documentation, and obsolescence. One talk before my panel conversation was about the ubiquitous Adobe Flash format.

The paper, “Around for Decades, Gone in a Flash: How we dealt with Flash objects and the National Archives of the Netherlands” was presented by Lotte Wijsman and Marin Rappard. They knew they had flash objects in their web archives and wanted to go through the process of how they might be preserved and accessed. They started out looking for any files with “FLA”, “SWF”, and “FLV” as extensions. This proved problematic as there were references to those extensions within other documents and objects. They then used DROID to identify the flash formats. “SWF” has quite a number of format PUID’s.

PUIDFormat NameFormat VersionExtension
fmt/104Macromedia Flash1swf,
fmt/105Macromedia Flash2swf,
fmt/106Macromedia Flash3swf,
fmt/107Macromedia Flash4swf,
fmt/108Macromedia Flash5swf,
fmt/109Macromedia Flash6swf,
fmt/110Macromedia Flash7swf,
fmt/505Adobe Flash8swf,
fmt/506Adobe Flash9swf,
fmt/507Adobe Flash10swf,
fmt/757Adobe Flash11swf,
fmt/758Adobe Flash12swf,
fmt/759Adobe Flash13swf,
fmt/760Adobe Flash14swf,
fmt/761Adobe Flash15swf,
fmt/762Adobe Flash16swf,
fmt/763Adobe Flash17swf,
fmt/764Adobe Flash18swf,
fmt/765Adobe Flash19swf,
fmt/766Adobe Flash20swf,
fmt/767Adobe Flash21swf,
fmt/768Adobe Flash22swf,
fmt/769Adobe Flash23swf,
fmt/770Adobe Flash24swf,
fmt/771Adobe Flash25swf,
fmt/772Adobe Flash26swf,
fmt/773Adobe Flash27swf,
fmt/774Adobe Flash28swf,
fmt/775Adobe Flash29swf,
fmt/776Adobe Flash30swf,

Even the Macromedia/Adobe Flash Video format has a PRONOM PUID, x-fmt/382.

The format missing from PRONOM is the FLA format. FLA is the native format for Macromedia/Adobe Flash for saving the source project of your Flash document. SWF files are compiled from the FLA source. This means the the SWF will be the most common format found on the web and in public places, but the FLA format might be more often found on local drives and working folders.

The Flash format and software was actually created by Future Wave software in 1996 as FutureSplash Animator, but was shortly bought by Macromedia later that year and Flash 1.0 was born. FutureSplash used the extension .SPA, but Macromedia changed it to FLA.

The format was initially based on the Microsoft Compound File Format or the OLE container format.

oledir Flash4-S01.fla 
oledir 0.54 - http://decalage.info/python/oletools
OLE directory entries in file Flash4-S01.fla:
----+------+-------+----------------------+-----+-----+-----+--------+------
id  |Status|Type   |Name                  |Left |Right|Child|1st Sect|Size  
----+------+-------+----------------------+-----+-----+-----+--------+------
0   |<Used>|Root   |Root Entry            |-    |-    |1    |5       |4416  
1   |<Used>|Stream |Contents              |2    |-    |-    |6       |4013  
2   |<Used>|Stream |Page 1                |-    |-    |-    |0       |329   
3   |unused|Empty  |                      |-    |-    |-    |0       |0     
----+----------------------------+------+--------------------------------------
id  |Name                        |Size  |CLSID                                 
----+----------------------------+------+--------------------------------------
0   |Root Entry                  |-     |597CAA70-72AA-11CF-831E-524153480000  
1   |Contents                    |4013  |                                      
2   |Page 1                      |329   |   

The FLA format stayed with OLE until Adobe Flash CS5, which the format changed to use a ZIP container to store all the content.

Flash5.5-S01.fla
Type = zip
Physical Size = 216632

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2022-07-09 11:57:46 .....           25           25  mimetype
2022-07-09 11:57:46 .....            9            9  Flash5.5-S01.xfl
2022-07-09 11:57:46 D....            0            0  LIBRARY
2022-07-09 11:57:46 D....            0            0  META-INF
2022-07-09 11:57:46 .....        49267         3936  DOMDocument.xml
2022-07-09 11:57:48 .....         9735         1103  META-INF/metadata.xml
2022-07-09 11:57:48 .....         8093         2222  PublishSettings.xml
2022-07-09 11:57:48 .....            0            0  MobileSettings.xml
2022-07-09 11:57:48 D....            0            0  LIBRARY/Mouth shape graphic symbols
2022-07-09 11:57:48 D....            0            0  LIBRARY/Voice
2022-07-09 11:57:48 .....       151006       151006  bin/M 1 1252032698.dat
2022-07-09 11:57:48 .....        99707        15311  LIBRARY/mouth.xml
2022-07-09 11:57:48 .....        16510         4534  LIBRARY/Mouth shape graphic symbols/A I.xml
2022-07-09 11:57:48 .....        14334         4086  LIBRARY/Mouth shape graphic symbols/C D G K N R S Th Y Z.xml
2022-07-09 11:57:48 .....        14531         4040  LIBRARY/Mouth shape graphic symbols/E.xml
2022-07-09 11:57:48 .....        15846         4007  LIBRARY/Mouth shape graphic symbols/F V D Th.xml
2022-07-09 11:57:48 .....        13093         3542  LIBRARY/Mouth shape graphic symbols/L D Th.xml
2022-07-09 11:57:48 .....         2106          751  LIBRARY/Mouth shape graphic symbols/M B P Closed.xml
2022-07-09 11:57:48 .....        14130         3949  LIBRARY/Mouth shape graphic symbols/O.xml
2022-07-09 11:57:48 .....        11082         2951  LIBRARY/Mouth shape graphic symbols/Open_Rest.xml
2022-07-09 11:57:48 .....        14847         4066  LIBRARY/Mouth shape graphic symbols/U.xml
2022-07-09 11:57:48 .....         8139         2202  LIBRARY/Mouth shape graphic symbols/W Q.xml
2022-07-09 11:57:48 .....        15768         3914  LIBRARY/panda.xml
2022-07-09 11:57:48 .....        10477         1064  LIBRARY/sample graph.xml
2022-07-09 11:57:48 .....          538          538  bin/SymDepend.cache
------------------- ----- ------------ ------------  ------------------------
2022-07-09 11:57:48             469243       213256  21 files, 4 folders

The move to a ZIP container included a new format, XFL. This XFL file is a simple text file with the text “PROXY-CS5″. In the DOMDocument.xml file we find an XML namespace, xmlns=”http://ns.adobe.com/xfl/2008/” and a version of the XFL structure, xflVersion=”2.1″.

This ZIP compressed FLA file is still being used in the current Adobe Animate software, which no longer uses the flash technology and uses more modern web formats like HTML5 to display the animations.

I took each version and made a PRONOM signature, which you can find here with samples. These container signatures should cover all the major changes for the format, but there is a problem……..

Listing archive: Flash5.5-S01v5.fla

--
Path = Flash5.5-S01v5.fla
Type = zip
ERRORS:
Headers Error
Physical Size = 216581
Embedded Stub Size = 63
Characteristics = Local

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2022-07-09 11:57:46 .....           25           25  mimetype
2022-07-09 11:57:46 D....            0            0  LIBRARY
2022-07-09 11:57:46 D....            0            0  META-INF
2022-07-09 11:57:46 .....        48556         3742  DOMDocument.xml
2022-07-09 11:57:48 .....        10133         1112  META-INF/metadata.xml
2022-07-09 11:57:48 .....         8115         2219  PublishSettings.xml
2022-07-09 11:57:48 .....            0            0  MobileSettings.xml
2022-07-09 11:57:48 D....            0            0  LIBRARY/Mouth shape graphic symbols
2022-07-09 11:57:48 D....            0            0  LIBRARY/Voice
2022-07-09 11:57:48 .....       151006       151006  bin/M 1 1252032698.dat
2022-07-09 11:57:48 .....        99551        15319  LIBRARY/mouth.xml
2022-07-09 11:57:48 .....        16580         4536  LIBRARY/Mouth shape graphic symbols/A I.xml
2022-07-09 11:57:48 .....        14404         4089  LIBRARY/Mouth shape graphic symbols/C D G K N R S Th Y Z.xml
2022-07-09 11:57:48 .....        14531         4044  LIBRARY/Mouth shape graphic symbols/E.xml
2022-07-09 11:57:48 .....        15848         4008  LIBRARY/Mouth shape graphic symbols/F V D Th.xml
2022-07-09 11:57:48 .....        13024         3546  LIBRARY/Mouth shape graphic symbols/L D Th.xml
2022-07-09 11:57:48 .....         2106          752  LIBRARY/Mouth shape graphic symbols/M B P Closed.xml
2022-07-09 11:57:48 .....        14200         3955  LIBRARY/Mouth shape graphic symbols/O.xml
2022-07-09 11:57:48 .....        11152         2963  LIBRARY/Mouth shape graphic symbols/Open_Rest.xml
2022-07-09 11:57:48 .....        14777         4069  LIBRARY/Mouth shape graphic symbols/U.xml
2022-07-09 11:57:48 .....         8287         2228  LIBRARY/Mouth shape graphic symbols/W Q.xml
2022-07-09 11:57:48 .....        15768         3914  LIBRARY/panda.xml
2022-07-09 11:57:48 .....        10477         1064  LIBRARY/sample graph.xml
2022-07-09 11:57:48 .....          538          538  bin/SymDepend.cache
2022-07-09 11:57:46 .....           25           25  mimetype
2022-07-09 11:58:18 .....            9            9  Flash5.5-S01v5.xfl
------------------- ----- ------------ ------------  ------------------------
2022-07-09 11:58:18             469112       213163  22 files, 4 folders

Turns out majority of the samples I have from many versions of Adobe Flash after CS5 have a ZIP Header error. When using the new signatures in DROID, the samples with the header errors will fail in the DROID’s zip library processing. The DROID logs shows this issue:

Could not process the potential container format (ZIP): file:///Flash5.5-S01v5.fla	
Expected 25 more entries in the Central Directory!

The Central Directory header in a ZIP file is quite important to the proper function of the ZIP container. Wikipedia has a great explanation of the header. You may notice in the listing above the file “mimetype” is shown twice which is probably the extra entries the parser wasn’t expecting.

So currently the identification of majority of these FLA formats is on hold until a way is discovered to ignore the error and continue the container identification in DROID.

HighMAT

Before the days of streaming and devices likeSmart TVs, AppleTV and Fire sticks, a few companies tried their best to come up with ways to make viewing your media on your TV mainstream. In a previous blog post I touched on the Kodak PhotoCD method, but there is one you are probably even less familiar with. HighMAT. HighMAT, or High-Performance Media Access Technology was a technology co-developed by Microsoft and Panasonic. You may have at one point owned a DVD player which had the technology built-in, but may have never used it. It came on the scene around 2002, but was abandoned by 2008.

Panasonic DVD/CD Player with HighMAT playback.

There were quite a few devices stamped with the HighMAT logo. The technology allow you to playback any Audio and Images like a DVD, with a menu and everything.

There was three different types of HighMAT compatible devices, Audio, Audio-Image, and Audio-Image-Video.

Writing data to the HighMAT format could be done with a plugin for Windows which added the functionality to Windows Media Player for burning audio playlists to the HighMAT format or through the standard CD Writing Wizard built-in to Windows XP. An extra screen would come up asking if you would like to make the CD HighMAT compatible. Making video compatible HighMAT CDs could be done through Movie Maker.

When a HighMAT CD-R/CD-RW is authored we get an interesting CD. It appears to be a Mode 2 Form 1 format:

/dev/disk10 (internal, physical):
   #:                       TYPE NAME                    SIZE       IDENTIFIER
   0:        CD_partition_scheme                        *846.4 MB   disk10
   1:       CD_ROM_Mode_2_Form_1 Highmat02               2.7 MB     disk10s0

If you would like to check out a sample disc, you can grab the ISO or BIN/CUE here.

tree /Volumes/Highmat02
/Volumes/Highmat02
├── Audio\ Samples
│   ├── 17\ [no\ artist]\ -\ Speaker\ Identification\ Test.wma
│   └── sine.wma
├── HIGHMAT
│   ├── AUTHOR.XML
│   ├── CONTENTS.HMT
│   ├── IMAGES
│   │   ├── T0.HMT
│   │   ├── T1.HMT
│   │   ├── T10.HMT
│   │   ├── T11.HMT
│   │   ├── T12.HMT
│   │   ├── T13.HMT
│   │   ├── T2.HMT
│   │   ├── T3.HMT
│   │   ├── T4.HMT
│   │   ├── T5.HMT
│   │   ├── T6.HMT
│   │   ├── T7.HMT
│   │   ├── T8.HMT
│   │   └── T9.HMT
│   ├── MENU.HMT
│   ├── PLAYLIST
│   │   ├── 00000001.HMT
│   │   ├── 00000002.HMT
│   │   ├── 00000003.HMT
│   │   ├── 00000004.HMT
│   │   ├── 00000005.HMT
│   │   ├── 00000006.HMT
│   │   └── 00000007.HMT
│   └── TEXT.HMT
└── My\ Pics
    ├── Blue\ hills.jpg
    ├── Sunset.jpg
    ├── Water\ lilies.jpg
    └── Winter.jpg

5 directories, 31 files

There is a lot going on here, lets take a look at a few of the formats we find in this disc structure. The files added to the CD are converted to WMA if you checked the “Convert Files” feature and are accessible like a normal data CD. The HighMAT folder is created to make a compatible HighMAT disc. Except for one XML file the rest of the files in the HighMAT folder all have an HMT extension. The author.xml file contains the root element <HMT> with some filenames indicating some of the HMT files may be thumbnails. If we open one of the HMT thumbnail files in a hex editor we can see:

Just a plain old JPG header. Exiftool tells us it is small 160×120 pixel image, must be a thumbnail. But lets take a look at another HMT file.

Even though the Menu.hmt file has the same extension as the thumbnails, this file is definitely not a JPG file with pixel data. Same goes for the Contents and Text files as well, unique formats.

The files in the playlist folder also have a unique format.

So it seems all the HighMAT folder really does is add compatibility for hardware to provide a menu to access the original data, providing playlists and thumbnails to navigate the data on your TV screen.

I came across one of these discs while processing a collection of CD-R discs donated to our library. Normally I would copy the images and other data off the disc to our preservation system, but this disc made me stop to think about the best way to preserve the data. Is a disc image appropriate or is the HighMAT folder even worth preserving if we have the original files from the disc? Finding hardware or a software player to present the disc as intended is getting harder to do. I am curious what others think of the value of this content.

I chose not to submit any signatures to PRONOM for the moment as we assess. It would be difficult to properly identify each format with all of them having the same extension, especially the JPG thumbnails as HMT is not a valid extension for the format. Take a look at my sample files and if you have come across this format before, let me know.

Apple Package Format

Let’s talk about Apple’s iWork software. Apple’s Office Suite of applications was first released in 2005 and provided a WordProcessor (Pages), Presentations (Keynote), and a little later, Spreadsheet (Numbers). They are exclusive to the Macintosh and iOS devices.

iWork was released in a few different versions. They get a little confusing as each application has its own version which all seemed to unify and stabilize in 2020. Here is a matrix of major versions.

VersionPackage or ZIP
iWork ’05Package
iWork ’06Package
iWork ’08Package
iWork ’09ZIP
iWork 2013Package
iWork 2014ZIP
iWork 2019ZIP
iWork 2020ZIP

You may already be aware but MacOS can sometimes be weird. I use the term weird in a loving, sometimes proud way, but I admit, there was some “odd” choices made in regards to how applications and documents are used and stored files on a Mac.

On early Macintosh computers Apple used an interesting method of storing resources for applications and some file formats. The Resource Fork for an application contained all the “resources” needed to run in the operating system. It would contain all the icons, warning screens, graphics, sounds, etc. This help true until Mac OS X came along and then Apple started using a bundle or package format. Still in use today, what appears to be a single file or application is actually a folder of all the resources needed to run the application.

Show Package Contents

By right clicking or control clicking on the icon you can open the folder and see all the contents which make up the Application.

Directory listing of Pages.app on MacOS

Nifty right? The MacOS which knows which extensions to treat as a package. If you were to move the application over to another system it would be a folder with the extension “.app”.

For an application I can see how this makes sense as it will only execute in the MacOS environment. The problem comes in when you use the same package method for the documents the application creates.

Contents of Pages version 1 sample file.

So instead of a single “file” with a bytestream, you get a folder of files which make up the file format. Here is Apple’s description:

Document Packages

If your document file formats are getting too complex to manage because of several disparate types of data, you might consider adopting a package format for your documents. Document packages give the illusion of a single document to users but provide you with flexibility in how you store the document data internally. Especially if you use several different types of standard data formats, such as JPEG, GIF, or XML, document packages make accessing and managing that data much easier.

Apple actually defines two similar methods:

Although bundles and packages are sometimes referred to interchangeably, they actually represent very distinct concepts:

  • package is any directory that the Finder presents to the user as if it were a single file.
  • bundle is a directory with a standardized hierarchical structure that holds executable code and the resources used by that code.

A couple years ago a processed digital collection made its way down to me. It had been processed by a new digital archivist and when I went to prepare the collection for preservation, I found a folder with the extension .pages and inside the folder a whole directory of files. Many of which they had renamed and arranged. Needless to say, I had to track down the original disk so I could properly preserve the file.

So looking back at the earlier table, iWork switched back and forth between the package format and a ZIP container. For preservation purposes, the ZIP container is easier to maintain outside the MacOS. Lets look a little closer at each. If you would like to follow along I have copied a few samples onto a hybrid ISO.

iWork ’05 through iWork ’08 used the same package format and structure. Because they are a package format, they are difficult to preserve as original files. I suppose you could zip them up, but probably the best option is to open with a current version of Pages and save to the newer ZIP container format.

tree iWork08/Keynote-06.key 
├── Contents
│   └── PkgInfo
├── QuickLook
│   └── Thumbnail.jpg
├── index.apxl.gz
└── theme-files
    ├── Blue 2.jpg
    ├── Blue 2.tif
    ├── Cool Gray-2.jpg
    ├── Cool Gray.tif
    ├── Green-8.jpg
    ├── Green.tif
    ├── Headlines_bullet.pdf
    ├── Headlines_star.pdf
    ├── Orange 2.tif
    ├── Orange_2.jpg
    ├── Purple-6.jpg
    ├── Purple.tif
    ├── Red.jpg
    ├── Red.tif
    ├── endpoints.pdf
    └── headlines_hi-res.jpg

iWork ’09 changed this practice. The documents saved from Pages, Keynote, and Numbers were contained in a ZIP file and can be identified using the PRONOM registry container signatures.

filename : 'iWork 2013/Pages2013-Sample09.pages'
filesize : 105900
modified : 2019-11-21T20:36:00-07:00
matches  :
  - ns      : 'pronom'
    id      : 'fmt/1439'
    format  : 'Apple iWork Pages'
    version : '09'
    class   : 'Word Processor'
    basis   : 'extension match pages; container name index.xml with byte match at 195, 76' 
Sample09.pages
Type = zip
WARNINGS:
Headers Error
Physical Size = 105900

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2019-11-21 20:36:00 .....       364773        22413  index.xml
2019-11-21 20:36:00 .....         7007         7007  Hardcover_bullet_black.png
2019-11-21 20:36:00 .....        69176        69176  Simple_Noise_2x.jpg
2019-11-21 20:36:00 .....          232          232  buildVersionHistory.plist
2019-11-21 20:36:00 .....         6406         6406  QuickLook/Thumbnail.png
------------------- ----- ------------ ------------  ------------------------
2019-11-21 20:36:00             447594       105234  5 files

Then Apple went back to a Package format with iWork 2013. For reasons unknown. But the content and structure changed. Its a package format with a Index.zip instead of index.xml

Pages2013-Sample.pages
├── Data
│   └── Hardcover_bullet_black-13.png
├── Index.zip
├── Metadata
│   ├── BuildVersionHistory.plist
│   ├── DocumentIdentifier
│   └── Properties.plist
├── preview-micro.jpg
├── preview-web.jpg
└── preview.jpg

3 directories, 8 files

The ZIP within the package contains a new Apple format. IWA

Pages2013-Sample.pages/Index.zip
Type = zip
Physical Size = 39361

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2019-11-21 20:47:14 .....         3860         3860  Index/Document.iwa
2019-11-21 20:47:14 .....           26           26  Index/Tables/DataList.iwa
2019-11-21 20:47:14 .....          336          336  Index/ViewState.iwa
2019-11-21 20:47:14 .....          160          160  Index/CalculationEngine.iwa
2019-11-21 20:47:14 .....          121          121  Index/DocumentStylesheet.iwa
2019-11-21 20:47:14 .....        31931        31931  Index/ThemeStylesheet.iwa
2019-11-21 20:47:14 .....           22           22  Index/AnnotationAuthorStorage.iwa
2019-11-21 20:47:14 .....         1889         1889  Index/Metadata.iwa
------------------- ----- ------------ ------------  ------------------------
2019-11-21 20:47:14              38345        38345  8 files

Luckily Apple came to their senses and went back to the ZIP container format for iWork 2014 and later. The container signature looks for the IWA file Apple started using with iWork 2013.

filename : 'iWork 2014/Pages2014-Sample.pages'
filesize : 66256
modified : 2019-11-22T00:03:56-07:00
errors   : 
matches  :
  - ns      : 'pronom'
    id      : 'fmt/1441'
    format  : 'Apple iWork Document'
    version : '14'
    class   : 'Presentation, Spreadsheet, Word Processor'
    basis   : 'extension match pages; container name Index/Document.iwa with byte match at 16, 6; name Metadata/Properties.plist with name only'
Path = iWork 2014/Pages2014-Sample.pages
Type = zip
Physical Size = 66256

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2019-11-22 00:03:54 .....         3930         3930  Index/Document.iwa
2019-11-22 00:03:54 .....          364          364  Index/ViewState.iwa
2019-11-22 00:03:54 .....          206          206  Index/CalculationEngine.iwa
2019-11-22 00:03:54 .....        33573        33573  Index/DocumentStylesheet.iwa
2019-11-22 00:03:54 .....           22           22  Index/AnnotationAuthorStorage.iwa
2019-11-22 00:03:54 .....           23           23  Index/DocumentMetadata.iwa
2019-11-22 00:03:54 .....         8761         8761  Index/Metadata.iwa
2019-11-22 00:03:54 .....          322          322  Metadata/Properties.plist
2019-11-22 00:03:54 .....           36           36  Metadata/DocumentIdentifier
2019-11-22 00:03:54 .....          273          273  Metadata/BuildVersionHistory.plist
2019-11-22 00:03:54 .....        14611        14611  preview.jpg
2019-11-22 00:03:54 .....          838          838  preview-micro.jpg
2019-11-22 00:03:54 .....         1571         1571  preview-web.jpg
------------------- ----- ------------ ------------  ------------------------
2019-11-22 00:03:54              64530        64530  13 files

Now iWork was not the only Apple software to use the Package/Bundle format for their documents. Be advised the following software may save to the package format.

I remember a few years ago, Trent Reznor (NIN) decided to release a few of his tracks in the Garageband format. A little harder to find these days, but the good old wayback machine kept a copy for us! Grab them here. Be warned, they may be in the package format. Thanks Apple!

Student Writing Center

When it comes to difficult file formats, one of the more difficult groups of formats are word processing text files. Difficult for many reasons, one being the shear number of them, the other is their lack of identifiable headers. Just when you think you have seen them all another pops up to add to the mix.

In a batch of other known word processing formats I came across a few files with no extension and with the following header:

The rest of the file was binary so the only thing I had to go one was the string “TLC” and “FF”. A few searches across the interwebs didn’t reveal much, seems it wasn’t a well documented format. From the names of the files and the fact they were with other word processing formats led me to assume they were also some sort of document format. The date stamps were still intact and I could see they were from the mid 1990’s. It took a few creative searches before I wondered if the “TLC” might have something to do with “The Learning Company“. If it was, I still had quite a bit of work ahead as the software developer had produced quite a few titles over the years. You probably remember the “Reader Rabbit” series of educational games.

After a bit of time I narrowed it down to a few titles and started looking for samples of each. Software was hard to find as well. I tried opening the file in a few different software until I finally came to one called “Student Writing Center”. Which may sound familiar to some of you, but there was some variations on this name out there. Some of which are:

  • Student Writing Center
  • Student Writing & Publishing Center 
  • The Children’s Writing & Publishing Center
  • The Writing Center
  • Ultimate Writing & Creativity Center

There were probably others, considering the budget software company started in 1980 and made titles for a few computer platforms starting with the Apple II. The story behind the company is a fun read.

The Student Writing Center was a simple word processor aimed at students 10 years old and older. It was found in many schools right along side Kid Pix, another very popular graphic program for kids. The software had a few different document types to help students get started writing their book reports or journal entries.

The Student Writing Center ran on both Macintosh and Windows allowing it to be one of the more popular writing tools for the younger crowd.

Each document type had a unique interface and save menu, which on Windows would save with the extensions, .RP, .NL, .JN, .LT, and .SG. They also had a slightly different header.

Reports:        1A544C43 01464600 0000
Newsletters:    1A544C43 00464600 0300
Journals:       1A544C43 00464600 0100
Letters:        1A544C43 00464600 0400
Signs:          1A544C43 00464600 0200

The signatures submitted to PRONOM take into account endianness for Windows and Macintosh with the last two byte locations being swapped. Also every document had the values “46461A” “FF” at the end of the file.

But wait! Just when you think you had it figured out…….

This file may look similar, but they are two different formats and are not compatible with each other. The little brother to the Student Writing Center was called “Ultimate Writing & Creativity Center” and was made for younger kids, ages 6-10. It had more of a cartoon interface and a cute little fountain pen teacher to walk you through the writing process.

When you saved your file in UWCC, you could choose between formats and I guess move your documents up to the more advanced program once you turned 10! If you would like to experience or re-live the opening sequence, enjoy.

I’m not done yet………

To complicate things even more The Learning Company also released another word processor called “The Writing Center“. This gets confused with Student Writing Center frequently.

But unlike the two others, this format is very different.

We’ll have to save this format for another day.

There seems to be a never ending list of word processor formats, with no end in sight. But if you used a school computer back in the early 1990’s and still have your floppy disk from back then, hopefully now you can open that report you wrote on Abraham Lincoln.

DiskDoubler

A few years ago I had someone contact me with a desperate plea. They had a disk which contained years of journal entries and letters to loved ones she could no longer access. She had used a Macintosh in the late 1980’s and early 1990’s to create all these files, but wanted to convert them all to PDF so she could make a book. She said she had tried everything, contacted a lot of people and her son had told her it was a lost cause. In talking with others at my institution, they knew I had a background in older Macintosh formats and so she contacted me. I made no promises, but offered to try.

The files she provided were indeed early Macintosh files. One obvious trait was the lack of an extension. One might think a lack of an extension was poor planning for Apple, but they choose a different method for the operating system to know the relationship between files and applications. They did this through the use of a Type/Creator code. If you were a software developer for the Macintosh you could register a four character “Creator” code, then for all the different files you used with your software you could register a “Type” code. This told the Macintosh operating system exactly which software created the file and the type so it could be opened properly. Unlike today where an extension is defaulted to one application even if it isn’t the software which created the file.

ResEdit view of Hypercard Stack Info

In some ways this was a superior identification method as there was many software titles which could all create the same file format, but this way the correct software would open the file and render it correctly.

Looking at the files provided to me, there was a few which at first seemed like they were damaged somehow, they were extremely small compared to the other files. About half the size. When I opened them in a hex editor this is what I saw.

Usually document formats during this time would keep the text in plain ascii, but these files were different, they had binary data. In the header was the only plain text strings in the file, “WDBNMSWD”. I had seen these codes before, a Microsoft Word Document! But they weren’t….. What are they?

The head of the file has the hex values “ABCD0054”, so I started searching the internet for some help. There were others having the same problem I was having. I finally came across a tool called the “Unarchiver“. Running the command line version of the software “unar”, suddenly I had a file twice the size and could be opened by Microsoft Word!

unar Letter 
Letter: DiskDoubler
"./Letter" already exists.
Successfully extracted to "./Letter-1".

Remember back in the 1990’s when storage was expensive? Instead of dropping another $20 for a 100MB ZIP Disk, you could use Symantec’s DiskDoubler. The software would be installed on your Macintosh and then a window would come up showing you all the files on your drive. With one click you could compress a single file or a directory of files saving you tons of space. When you needed the file, just double click and the software would uncompress on the fly and then open the correct application to edit the file.

With a few clicks I was able to uncompress all the affected files and provide a PDF of all the letters and journals my new friend had tried so desperately for years to open. She was thrilled to say the least.

But why stop there? PRONOM needs to know about this format!

Once I had DiskDoubler installed I could make a few more samples, where is where I found there was a few different compression methods used by the software. They are labeled AD 1 & 2 and DD 1, 2 & 3. Making samples of each of the different types I was able to confirm the first 4 bytes of every file was the hex values “ABCD0054”. I was able to submit the format to PRONOM and it was added and given the PUID fmt/1399.

One of the other features of DiskDoubler was an ability to create a Self Extracting Archive (SEA). An sea file could contain a compressed file but also contained the code to uncompress itself. This was mostly seen with the Stuffit software, but there were many other compression tools which could write to this format. The Stuffit formats have been added to PRONOM which include identification of an SEA created by stuffit, but the SEA created by DiskDoubler is different and needs to be added.

Shockwave Audio

Ok, confession time.

There is only a couple moments in my tech history which had a profound effect on me, enough to sear the memory of the moment into my brain. When I was in college around 1997 I had a decent CD collection and I had learned how to copy those AIFF files off the disc and use them on my trusty PowerCenter Pro. These files were huge, at the time. I knew a regular size song would take up around 50MB on my hard drive. This was a lot of space back in 1997, but I could then mix them with other songs, something I did sometimes for friends I had on the dance team. I didn’t have a CD burner at the time so I would transfer them to cassette tape. I know, but remember this was the 1990’s when everything was changing and expensive.

One night I was exploring the world wide web and I happened across someone sharing a few songs. I assumed they were just clips as they were only 5MB in size, a tenth the size they should be. I downloaded the song, which of course still took a few minutes back in those days. When I played the song, I was dumbfounded, it was the whole song. I was completely confused. How could they take a 4+ minute song and compress it down to under 5MB? This was amazing.

I started grabbing every song I could find. Before long I had quite the collection. And before you judge me for downloading music from the web, this was a couple years before the advertisement we all remember reminding us that we wouldn’t steal a car so why would we steal music.

The files I found on the internet were MP3 files, the same we are familiar with today. Back then creating MP3 files wasn’t easy. MP3 was actually a licensed product so you had to get a little creative in order to make them. On my Macintosh PowerCenter Pro, there were even fewer options. I was already familiar with the sound editing application from Macromedia called SoundEdit 16, it was the tool I used to do all my editing. I found there was a plugin you could add which allowed export to a format called Shockwave Audio. This was meant for use in Macromedia’s Director application to add sound to the growing Flash animation industry. Once I got the plugin and installed I couldn’t stop making files and I made them as fast as I could. For a whole album this could take over an hour on my hardware, but it was worth it. Before long I had a large collection of popular music ready to play at a moments notice. My player of choice was MacAMP, a sibling of the popular WinAMP. I even borrowed some equipment from a friend who DJ’d on the weekends and DJ’d a college dance. I lugged my whole PowerCenter Pro tower and 17in trinitron monitor over to the school. It was so much fun and folks didn’t understand when they asked to see my CD collection.

Enough about transgressions from my youth, lets talk about the Shockwave Audio format.

To create a SWA file you would first need SoundEdit 16 Version 2. Then the plugins to enable export. This would only run on PowerPC computers running Macintosh OS or Classic in Mac OS X. For this post I pulled out my trusty PowerBook G4 Titanium running MacOS 9 and MacOS X 10.2. Installed SoundEdit 16 and the plugins in the Xtras folder and we are good to go.

Before you export you need to set what bitrate you prefer for the final file, giving you the option of 8KBits up to 160KBits per second. The higher the bitrate the longer it took and made larger files.

SoundEdit 16 had a native audio format and also frequently used the SoundDesigner II format to save the uncompressed files. On a Macintosh you had to be careful as these formats did not travel well to other systems on account of the resource forks associated with the data.

Because these SWA files were meant to be used in websites and other non-Mac systems, they did not have a resource fork, but had the Creator/Type codes, SwaT/SHCK. An extension wasn’t necessary for use on your Macintosh, but it was best to use .swa.

Here is what the data looks like for a SWA file.

Even though the SWA format uses MPEG compression, this is not a typical header you might see in a MP3. There was no ID3 tags at the time so not much in terms of metadata.

General
Complete name                            : tone2.swa
Format                                   : MPEG Audio
File size                                : 80.7 KiB
Duration                                 : 5 s 166 ms
Overall bit rate mode                    : Constant
Overall bit rate                         : 128 kb/s
FileExtension_Invalid                    : m1a mpa mpa1 mp1 m2a mpa2 mp2 mp3

Audio
Format                                   : MPEG Audio
Format version                           : Version 1
Format profile                           : Layer 3
Format settings                          : Joint stereo / MS Stereo
Duration                                 : 5 s 172 ms
Bit rate mode                            : Constant
Bit rate                                 : 128 kb/s
Channel(s)                               : 2 channels
Sampling rate                            : 44.1 kHz
Frame rate                               : 38.281 FPS (1152 SPF)
Compression mode                         : Lossy
Stream size                              : 80.7 KiB (100%)
ffprobe -i tone2.swa 
[mp3 @ 0x155704a60] Format mp3 detected only with low score of 25, misdetection possible!
[mp3 @ 0x155704a60] Skipping 324 bytes of junk at 0.
[mp3 @ 0x155704a60] Estimating duration from bitrate, this may be inaccurate
Input #0, mp3, from 'tone2.swa':
Duration: 00:00:05.15, start: 0.000000, bitrate: 128 kb/s
Stream #0:0: Audio: mp3, 44100 Hz, stereo, fltp, 128 kb/s

There are a few consistencies among all my files. They all begin with the hex values “00000140000000030000” for the first 10 bytes and all of them seem to have the string “MACRZ” at offset 36. I haven’t been able to find a open specification for this file format, so we will have to go with what we can find in the samples. According to ffprobe from above, there is 324 bytes of a header before the first MP3 frame starts.

MPEG signatures are difficult, there are no headers, just a sequence of frames. This is why there are often so many identification conflicts with the MP3 format. These SWA files indeed identify as MP3 files, but with a mismatch extension.

filename : 'tone2.swa'
filesize : 82661
modified : 1970-01-01T00:00:00-07:00
errors   : 
matches  :
  - ns      : 'pronom'
    id      : 'fmt/134'
    format  : 'MPEG 1/2 Audio Layer 3'
    version : 
    mime    : 'audio/mpeg'
    class   : 'Audio'
    basis   : 'byte match at 0, 4088 (signature 5/9)'
    warning : 'extension mismatch'

If we wanted to distinguish an SWA from an MP3 we would need to create a new signature and give it priority over the MP3 signature. There is enough of a header this would be possible and easier, but since they are, in reality, just MP3 files does it matter? Trying to play a SWA on a modern computer is only possible if you change the extension to MP3.

If you want to take a look at some samples you can grab a couple I made on my GitHub page or check out some commercially made files for an awesome Star Trek Starship Creator game.

Hemera Photo-Object

Many years ago I dabbled in a little Graphic Design. Working for a commercial printer in the Pre-Press area, I was very familiar with all things graphics, but never had a great talent for design, especially drawing. I often needed the random clip art for a design I was working on, so I purchased the Hemera, The Big Box of Art, probably from my local CompUSA if that dates me.

Hemera Big Box of Art

The cool thing about clip art from Hemera is it was not your usual JPG or TIFF format, it was in a special Photo-Object format. This format included the raster image, but also included a mask or alpha channel for the main object. They marketed this format as an alternative to the sometimes larger formats of the day. GIF files didn’t have the color depth and PNG was new enough, Hemera was probably hoping this format would be the next greatest thing to happen to clip art.

A Hemera Photo-Object has the extension HPI. Lets take a closer look at a file and see what is under the hood. I pulled this file from Disc 1 on Archive.org

The HPI file has a unique header which should make identification really easy. But what do we see starting at offset 32? A JFIF! Just after a 32 byte header the file has a standard JPG file hidden inside. Now a standard JPG file does not have the ability to support an alpha channel so there must be something else they have within to mask this file. Lets look for the EOF file marker for the JPG format.

Well, well, well. It appears the JPG file is then followed by a standard PNG! Sneaky. The entire HPI file is a 32 byte HPI header, a JPG, followed by a PNG. One could easily carve out each of the formats and save as separate files if needed. There is a script you can use to do this for you, written by Ed Halley. The original Hemera software won’t run on modern systems.

Hemera had a good run for about 10 years before selling off their assets in 2004 to another stock image company. At one point Hemera even purchased the rights to all of Corel’s Premium photo library which I covered in my article about the Kodak PhotoCD format.

Image PAC Files

I wouldn’t be surprised if you have never heard of an Image PAC file. You may know it by the more common name Kodak Photo CD Image. Kodak’s PhotoCD format actually refers to the system and Disc format used to store images for compatibility with other hardware. The Kodak PhotoCD format was pretty advanced for its time, it original purpose was to store scanned 35mm film to a disc which was playable on computers and other hardware. In fact, because it was meant to store 35mm rolls as they were scanned it was the first use of the linked Multi-session CD format made standard by the orange book specification. The format was widely adopted at first, but eventually lost favor and was abandoned by 2004.

The Kodak PhotoCD format was also used on many commercial CD-ROM products. One example was the Corel Professional CD series. Below is a photo of a case of 200 CD’s I recently acquired. Each has around a hundred PCD images and viewing software on disc. Most discs can be viewed here. Or you can view their “Sampler” CD-ROM.

The actual PCD image file format was referred to as an Image PAC File. The format was unique in the fact it has multiple resolutions built into a single file. It also stored the raster data in a format called Photo YCC color encoding metric, developed by Kodak. This requires conversion to RGB for many uses. Adobe Photoshop for many years had an import filter for the format built in which included ICC profiles for properly converting the source to a destination colorspace, but support was dropped in CS3 of their products.

Photoshop Kodak PCD import

The Image PAC PCD format was a proprietary format which Kodak protected aggressively, even to the point of threatening legal action to those who attempted to reverse engineer the format. This frustrated developers and was probably part of the reason the format was abandoned. Of course this didn’t deter some curious developers and was partially reversed engineered and is available in the NetPBM library formally knows as PBMPlus. The tool hpcdtoppm was developed to convert PCD to PBM.

The trick in preserving older obsolete formats is to find a way to first identify them, gather significant properties, then migrate to a modern format if appropriate with minimal loss of data. Luckily most PCD files have the ascii string “PCD_IPI” starting around offset 2048. This is basically how the PRONOM registry identifies the format and has assigned it fmt/211. Exiftool also supports the format in identifying some of the significant properties.

ExifTool Version Number         : 12.62
File Name                       : 136009.PCD
Directory                       : /Users/thorsted/Desktop/blog/Kodak/PCD
File Size                       : 3.6 MB
File Modification Date/Time     : 2023:06:23 10:48:55-06:00
File Access Date/Time           : 2023:06:26 23:43:50-06:00
File Inode Change Date/Time     : 2023:06:27 11:18:38-06:00
File Permissions                : -rwx------
File Type                       : PCD
File Type Extension             : pcd
MIME Type                       : image/x-photo-cd
Specification Version           : 0.6
Authoring Software Release      : 3.0
Image Magnification Descriptor  : 1.0
Create Date                     : 1993:09:20 07:35:34-06:00
Image Medium                    : Color reversal
Product Type                    : 116/01 SPD 0064  #00
Scanner Vendor ID               : KODAK
Scanner Product ID              : FilmScanner 2000
Scanner Firmware Version        : 2.21
Scanner Firmware Date           : 
Scanner Serial Number           : 0296
Scanner Pixel Size              : 0b.30 micrometers
Image Workstation Make          : Eastman Kodak
Character Set                   : 95 characters ISO 646
Photo Finisher Name             : HADWEN GRAPHICS
Scene Balance Algorithm Revision: 3.1
Scene Balance Algorithm Command : Neutral SBA On, Color SBA On
Scene Balance Algorithm Film ID : Unknown (131)
Copyright Status                : Restrictions apply
Copyright File Name             : RIGHTS.USE
Orientation                     : Horizontal (normal)
Image Width                     : 3072
Image Height                    : 2048
Compression Class               : Class 1 - 35mm film; Pictoral hard copy
Image Size                      : 3072x2048
Megapixels                      : 6.3

Exiftool is able to gather much of the important properties including an original creation date and the pixel dimensions. It would be nice if was able to mention each of the resolution options as some later Pro versions of PCD had a 64 base for resolutions of 4096 x 6144.

Migration to a more modern open format is a common preservation strategy. The National Archives and Records Administration has the format NF00224 listed as needing to migrate to JPG, while others prefer migration to TIFF. Others have learned valuable lessons attempting to find the right method for migration. There is a right way and a wrong way as the Center for Digital Archaeology learned. The easiest method is to use the popular ImageMagick command-line tool.

thorsted$ identify 136009.PCD 
136009.PCD PCD 768x512 768x512+0+0 8-bit YCC 3.44727MiB 0.020u 0:00.006
thorsted$ convert 136009.PCD[5] -colorspace sRGB +compress 136009.tif
thorsted$ identify 136009.tif
136009.tif TIFF 3072x2048 3072x2048+0+0 8-bit sRGB 18.0004MiB 0.000u 0:00.000

ImageMagick along with most other tools like IrfranView and XnView only see the base resolution of 768 x 512, but with an extra little addition to the command by adding “[5]” after the filename if forces the conversion to use the “Fifth” 16 Base resolution which is the highest resolution on most PCD files, the Pro versions may have higher. The other issue is the colorspace conversion. It is known there could be a loss of highlights. This webpage illustrates different tools and the issues with highlights. You can see the difference if I use -colorspace RGB instead of sRGB.

ImageMagick conversion using RGB vs sRGB colorspace setting.

Other tools such as the open source pcdtojpeg and paid pcdMagic both work well, but the only tool I have tested so far which keeps the original metadata is pcdMagic.

ExifTool Version Number         : 12.62
File Name                       : 136009_1.tif
Directory                       : .
File Size                       : 38 MB
File Modification Date/Time     : 2023:06:27 12:06:26-06:00
File Access Date/Time           : 2023:06:27 12:06:29-06:00
File Inode Change Date/Time     : 2023:06:27 12:06:27-06:00
File Permissions                : -rw-r--r--
File Type                       : TIFF
File Type Extension             : tif
MIME Type                       : image/tiff
Exif Byte Order                 : Little-endian (Intel, II)
Subfile Type                    : Full-resolution image
Image Width                     : 3072
Image Height                    : 2048
Bits Per Sample                 : 16 16 16
Compression                     : Uncompressed
Photometric Interpretation      : RGB
Image Description               : color reversal: Unknown film. SBA settings neutral SBA on, color SBA on
Make                            : KODAK
Camera Model Name               : FilmScanner 2000
Strip Offsets                   : 1622
Samples Per Pixel               : 3
Rows Per Strip                  : 2048
Strip Byte Counts               : 37748736
Planar Configuration            : Chunky
Software                        : pcdMagic V1.4.19
Modify Date                     : 2023:06:27 12:06:26
Copyright                       : Copyright restrictions apply - see copyright file on original CD-ROM for details
Exif Version                    : 0231
Date/Time Original              : 1993:09:20 07:35:34
Create Date                     : 1993:09:20 07:35:34
Offset Time                     : -06:00
User Comment                    : color reversal: Unknown film. SBA settings neutral SBA on, color SBA on
Color Space                     : Uncalibrated
File Source                     : Film Scanner
Profile CMM Type                : Unknown (KCMS)
Profile Version                 : 2.1.0
Profile Class                   : Display Device Profile
Color Space Data                : RGB
Profile Connection Space        : XYZ
Profile Date Time               : 1998:12:01 18:58:21
Profile File Signature          : acsp
Primary Platform                : Microsoft Corporation
CMM Flags                       : Not Embedded, Independent
Device Manufacturer             : Kodak
Device Model                    : ROMM
Device Attributes               : Reflective, Glossy, Positive, Color
Rendering Intent                : Perceptual
Connection Space Illuminant     : 0.9642 1 0.82487
Profile Creator                 : Kodak
Profile ID                      : 0
Profile Copyright               : Copyright (c) Eastman Kodak Company, 1999, all rights reserved.
Profile Description             : ProPhoto RGB
Media White Point               : 0.9642 1 0.82489
Red Tone Reproduction Curve     : (Binary data 14 bytes, use -b option to extract)
Green Tone Reproduction Curve   : (Binary data 14 bytes, use -b option to extract)
Blue Tone Reproduction Curve    : (Binary data 14 bytes, use -b option to extract)
Red Matrix Column               : 0.79767 0.28804 0
Green Matrix Column             : 0.13519 0.71188 0
Blue Matrix Column              : 0.03134 9e-05 0.82491
Device Mfg Desc                 : KODAK
Device Model Desc               : Reference Output Medium Metric(ROMM)
Make And Model                  : (Binary data 40 bytes, use -b option to extract)
Image Size                      : 3072x2048
Megapixels                      : 6.3
Modify Date                     : 2023:06:27 12:06:26-06:00

There is a way to convert the PCD to TIF using ImageMagick, then using Exiftool to map some of the metadata over to the new TIFF file. It would look something like this:

exiftool -addtagsfromfile 136009.PCD '-EXIF:DateTimeOriginal<PhotoCD:CreateDate' '-EXIF:CreateDate<PhotoCD:CreateDate' '-ExifIFD:SerialNumber<PhotoCD:ScannerSerialNumber' '-ExifIFD:ExifImageWidth<PhotoCD:ImageWidth' '-ExifIFD:ExifImageHeight<PhotoCD:ImageHeight' '-IFD0:Make<PhotoCD:ScannerVendorID' '-IFD0:Model<PhotoCD:ScannerProductID' '-IFD0:Orientation<PhotoCD:Orientation' '-IFD0:Copyright<PhotoCD:CopyrightStatus' 136009.tif