Apple Mail

October 27, 2023 by Thor Leave a comment

There really is no “Macintosh Format”, but there sure are a lot of formats you only find on the MacOS. From Resource Forks and iWork formats to unique sound formats, MacOS has them all! Majority of cross-platform software vendors have done a much better job in recent years in making their file formats the same across platforms, but for Apple, they love to make things unique, just for their platform.

Take EMLX for example. Seems to be a trend to add “X” to the end of an older format to breath new life into it. The EML format, or Electronic Mail, has existed for a few decades now, but in 2005 Apple updated their Apple Mail application to use a new format, EMLX.

As far as I know, Apple hasn’t released any documentation on the EMLX format, but many folks out there have asked the question and have been able to “reverse engineer” the format. Lets take a look.

An EMLX file consists of three parts:

bytecount on first line;
email content in MIME format (headers, body, attachments);
Apple property list (plist) with metadata.

The bytecount is a variable number which consists of the total bytes starting from the start of the MIME format, including HTML, to the start of the XML property list. Lets look at a simple EMLX.

The byte count is on line 1 with the MIME email (EML) taking up the 556 bytes, then the XML plist at the end. You may ask, what is a plist? Well, it is another Apple (originally NextStep) invention which is embedded throughout the MacOS operating system. A Plist is usually an XML with keys but can also be in a binary format. The Plist can contain properties of the email within Apple Mail like special color flags, tagged as junk, date received and last reviewed.

If you do happen across an EMLX file or group of them, there are a few tools you can use to convert them to a plain old EML. There are python libraries or many other tools to do the job.

But first we need to be sure of identification beyond the extension. Adding this file format to PRONOM would help in identification for preservation purposes. If ran through PRONOM today we get:

filename : '9.emlx'
filesize : 18582
modified : 2023-10-26T22:16:25-06:00
errors   : 
matches  :
  - ns      : 'pronom'
    id      : 'fmt/950'
    format  : 'MIME Email'
    version : '1.0'
    mime    : 'message/rfc822'
    class   : 'Text (Structured)'
    basis   : 'byte match at [[31 17] [599 4] [339 6] [426 6] [90 14]]'
    warning : 'extension mismatch'

Because the format has a EML plain text format within its structure, it is assumed to be an EML file. While technically accurate, Identifying as a unique EMLX format would be beneficial in a preservation system so you can properly assign risk and choose the right tool to parse or migrate.

In looking at the three parts of an EMLX format, we know the EML file is not a good way to show the difference as they are the same structure. The byte count on the first line is variable, so there is no static byte sequence to use for identification. That leaves the Plist section at the end to distinguish the difference.

The PRONOM entry for a Plist looks for the typical XML strings present in most XML files, but then uses the root element “<plist version=”1.0″>” for identification. We could combine the existing EML signature and the Plist signature to identify an EMLX, or just take the existing EML signature and put in a small byte sequence for the closing of the </plist> tag near the EOF? There would be a need for a priority over EML, both would essentially accomplish the same thing.

Take a look at latter idea on my GitHub page and tell me which makes the most sense.

Common Ground

October 20, 2023 by Thor 1 Comment

If digital preservation had an extension it most likely would be .DP

Unfortunately, it’s taken. Say hello to Digital Paper.

In the early 1990’s, folks started to share documents with each other through the their phone lines. The early internet, BBS, AOL, CompuServe and the like allowed people to share ideas through applications like Word/WordPerfect Documents. Most people had a copy of the popular software and that software could open documents from their competitors, but fonts were always a problem. Technically a font is software as well and needs a license to be used. Also printers at the time dictated what the document might look like when opened, so your document may look different on someone else’s computer. This lead to a few innovations in the software market Digital Paper.

The idea is simple, create a format which could be opened with a free viewer which includes all the parts to make it look and print just like it was intended to. You may have already guessed who the winner in this space tuned out to be, yes, the PDF format. You can’t tell the history of the PDF Format without mentioning others that tried their luck to be the leader in portable document formats . WordPerfect’s Envoy format was one, Common Ground Digital Paper was another.

No Hands Software which started in 1990, developed the idea of making your documents truly portable. They released the Common Ground Maker and Viewer software in 1993. By 1996 the company was doing so well they were bought for $6 million by Hummingbird Ltd. PDF soon became so ubiquitous, formats like Common Ground and Envoy fizzled out. That doesn’t mean they didn’t have a big impact and still can be found in quite a few places.

Apple was one of the bigger users for awhile, but the format can still be found floating around today.

The Common Ground Digital Paper has some similarities to the PDF format, but the biggest different is the format is proprietary and not open like PDF. Another difference is you could embed the viewer into the file, this would make an executable on both Windows and Macintosh. Very convenient for sending to those who may not have the viewer or can’t install the viewer on their system.

Common Ground had two different viewers, a pro viewer with more features and a Mini Viewer with basic features and which was free to download and distribute from their website. Unfortunately, they linked to an FTP site which no longer is available and so finding the viewers today can be difficult.

I came across a boxed version 1 for Macintosh of the software a few years back, but have yet to find other full versions. The software did change hands a bit, but seems to have topped out at Version 4 in the late 1990’s. Let’s take a look at the file format for the samples we do have.

Version 1 for the Macintosh was the first I believe, coming to Windows shortly afterwards. The format was even assigned a MimeType for use on the web and the application gives us a little insight into the format.

The commonground file format does have versions (two at the moment). They *are* internally documented with a file signature, allowing commonground viewers to automatically handle both old and new format files. Therefore, I don’t believe a ‘version’ parameter is needed.

A Content-Type of “application/commonground” indicates a document in the Common Ground portable file format, also known as Digital Paper.

Encoding considerations: Common Ground files are in a binary format. Some encoding will be necessary for MIME mailers as in application/octet-stream. Common Ground files for the Macintosh are encoded in the data fork of a Macintosh file. The file type is APPL, the creator is CGVM.

If we look at a sample from Version 1 for the Macintosh we find the follow hex values:

hexdump -C CG-s01.dp | head
00000000  00 00 03 56 00 00 04 d9  43 47 44 43 00 00 00 00  |...V....CGDC....|
00000010  96 6c 00 07 04 b4 03 de  00 00 00 00 02 da 02 28  |.l.............(|
00000020  00 11 02 ff 0c 00 ff ff  ff ff 00 00 00 00 00 00  |................|
00000030  00 00 02 28 00 00 02 da  00 00 00 00 00 00 00 01  |...(............|
00000040  00 0a 00 05 00 05 00 15  02 23 00 32 00 05 80 02  |.........#.2....|
00000050  00 15 7f fe 00 2c 00 09  00 03 06 47 65 6e 65 76  |.....,.....Genev|
00000060  61 00 00 03 00 03 00 0d  00 0c 00 2e 00 04 00 00  |a...............|
00000070  00 00 00 2b 06 11 07 54  65 73 74 69 6e 67 00 01  |...+...Testing..|
00000080  00 0a ff e1 ff e2 02 f9  02 46 00 03 00 00 00 0d  |.........F......|
00000090  00 00 00 28 02 d5 01 05  05 2d 20 31 20 2d 00 ff  |...(.....- 1 -..|

In all the samples I have the first 8 bytes are not consistent, but the next four bytes are. CGDC, which happens to be the registered type on the Macintosh. Convenient. But it appears later versions are not the same.

hexdump -C MANUAL.DP | head
00000000  00 00 00 20 00 00 b7 f4  44 50 4c 33 00 00 00 04  |... ....DPL3....|
00000010  00 00 00 00 00 00 00 00  3b 60 53 df 00 00 00 00  |........;`S.....|
00000020  00 00 00 18 00 00 b4 da  00 00 b4 c2 00 00 03 3e  |...............>|
00000030  78 00 79 00 7a 00 7b 00  00 00 00 77 01 01 00 0c  |x.y.z.{....w....|
00000040  00 01 02 01 00 00 00 97  fe ed f0 05 00 b7 86 04  |................|
00000050  5f 05 f7 01 00 03 ed f0  02 00 3d 00 ff 45 75 72  |_.........=..Eur|
00000060  6c 20 00 01 07 ff bf 05  9f 00 01 08 a3 05 fb ba  |l ..............|
00000070  02 fa f1 00 ff ff 00 11  ff 68 74 74 70 3a 2f 2f  |.........http://|
00000080  77 ff 77 77 2e 47 53 50  2e 43 b9 43 1c 0f 03 04  |w.ww.GSP.C.C....|
00000090  95 05 c8 0d 00 cc fb 05  e3 13 06 15 6d 61 69 6c  |............mail|

hexdump -C dpwhite.dp | head
00000000  00 00 00 18 00 01 79 17  44 50 4c 32 00 00 00 00  |......y.DPL2....|
00000010  00 00 00 00 00 00 00 00  00 00 00 18 00 01 76 de  |..............v.|
00000020  00 01 76 c6 00 00 04 b2  00 00 00 00 00 00 00 00  |..v.............|
00000030  00 00 00 1e 01 01 00 0c  00 00 01 01 00 00 00 12  |................|
00000040  00 01 00 01 00 00 00 00  0c 4e 09 60 01 2c 01 2c  |.........N.`.,.,|
00000050  00 64 00 00 00 02 00 00  00 00 00 a2 01 01 00 0c  |.d..............|
00000060  00 01 02 01 00 00 00 e2  fa ed f0 22 ed f1 0c 4e  |..........."...N|
00000070  09 60 00 ff e1 01 26 0a  83 08 3b ff ff 6a ff 6a  |.`....&...;..j.j|
00000080  0c e4 09 f6 01 ff 2c 01  2c 00 08 00 64 00 df 00  |......,.,...d...|
00000090  01 01 00 03 ed f0 0f 00  79 0a 1c 0f 28 07 42 41  |........y...(.BA|

These files are from a later version and have a different string at byte 8. DPL2 & DPL3. In the MiniViewer you can request document information and it provides some basic metadata for each file.

I only have one example of the DPL3, but a couple examples of DPL2, and it seems like DPL2 comes from a Version 3 DP Maker and DPL3 comes from Version 4 Maker. Need to see if I can find a Version 2 file and see if it follows the same pattern.

Two of my favorite CD-ROM’s on Internet Archive are Dr. Dobb’s The Essential Books on File Formats and Internet File Formats, both have copies of the Mini Viewer.

One of features similar to PDF is the ability to password protect certain features. This is what the document information looks like.

The header is the same, but the plain text usually seen in the file is no longer visible, so it appears the rest of the file is encrypted.

hexdump -C password.dp | head 
00000000  00 00 5d 95 00 00 06 94  43 47 44 43 00 00 00 01  |..].....CGDC....|
00000010  8e 3b 18 7e c5 16 f8 e0  0f f5 6f 32 2f 34 36 81  |.;.~......o2/46.|
00000020  4b 8a 03 da 9e 1a 85 6c  36 e4 39 f2 5a 2a a2 5f  |K......l6.9.Z*._|
00000030  81 83 65 ee 9c 16 d0 2d  2d c3 04 df 69 c8 06 0d  |..e....--...i...|
00000040  77 df 27 19 33 59 f6 05  61 4e 2c a6 58 27 47 26  |w.'.3Y..aN,.X'G&|
00000050  fe 6b 3c 06 7e cb 7f fb  33 f8 64 ed 05 54 b4 7d  |.k<.~...3.d..T.}|
00000060  c7 b5 e3 c2 df 40 53 63  ef 8e 10 1c c7 58 bd 28  |.....@Sc.....X.(|
00000070  9b 8a 2c 8f ae 82 33 f7  ff d4 3c 96 5c b4 08 69  |..,...3...<.\..i|
00000080  1f 00 af ce a7 56 93 27  07 cc 39 97 17 22 49 d7  |.....V.'..9.."I.|
00000090  5b 89 9b e6 b7 b1 5c 38  75 ba 08 ee 66 d0 9a d2  |[.....\8u...f...|

This file format is not currently in PRONOM. From what I have gathered I could add three signatures. There could be some other variations out there and the password protection needs to be considered. Maybe I’ll take Nick Gault’s offer and request the format which was available starting in the middle of 1995. Think they’ll deliver?

No bad deed….

October 13, 2023 by Thor Leave a comment

I had access to my first Macintosh computer around 1987. My father brought it home and I spent hours on it playing games and occasionally writing reports for school. The Macintosh Plus computer had one floppy drive and no hard drive. I remember playing the game Orbiter which had two floppy disks and right in the middle of game play it would pause and ask me to insert disk 2, then quickly ask for disk 1 again. The struggle was real. I spent years using many different Macintosh computers and now own more than I wish to admit. I’m preserving them!

The wild world of digital preservation has been a little lacking on the Macintosh side of things as I have come to realize. There still not a great way to manage Resource Forks in many preservation systems and the identification tools are mainly focused on the data bytetreams and not any system specific attributes Macintosh used often.

The PRONOM registry has either referenced early Macintosh specific formats or missed them entirely so I have been slowly working on a few to close that gap.

Interestingly enough, many Microsoft programs initially made their GUI debuts on the early Macintosh before making their way to Windows. Excel is one I am working on, as Version 1 is not identifiable in PRONOM, it was Macintosh only at the time.

Another is PowerPoint, I recently submitted two new signatures to PRONOM.

fmt/1747: Microsoft PowerPoint Presentation v2.x. Full entry added.
fmt/1748: Microsoft PowerPoint Presentation v3.x. Full entry added.
fmt/1866: Microsoft Powerpoint for Macintosh v.2. Full entry added.
fmt/1867: Microsoft Powerpoint for Macintosh v.3. Full entry added.

PowerPoint was initially released in 1987 on the Macintosh platform. It was developed by a company called ForeThought. Version 1.0 on the Macintosh was under this name, until it was bought by Microsoft only three months after being released. The history of PowerPoint can be discovered at Robert Gaskins, one of the original developers, website and book he wrote. The available information provided by Microsoft is only for the OLE format, covering versions 4.0 until 2003.

So, lets take a look at the Powerpoint original file format, before OLE.

   Type/Creator      RF      DF  Date         Filename
f  SLDS/PPNT         0       932 Oct 10 19:10 PowerPoint-v1

Luckily the early PowerPoint files did not have a Resource Fork. The Data Fork, if you haven’t noticed, has an interesting set of hex values at the beginning of the file. 0BADDEED is the first 4 bytes. If we look at a PowerPoint version 2 file from Windows.

The file format is the same, but because of the weird world of endianness, the first few bytes are in reverse order, EDDEAD0B.

Obviously we need to discuss this magic number and the meaning behind “Bad Deed”. This question was asked previously by the digital preservation community. I have a previous blog post about the use of words for the magic number CAFEBEEF as it was used with with JAVA class files and Express Publisher in the 1990’s. BADDEED looks like another clever use of the hex values that formed words. But was there a story behind the words? Joe Carrano asked if this string might be hexspeak. I wanted to know more so I asked some one who might know.

Robert Gaskins was kind enough to chat with me for a bit about the early days of PowerPoint.

I had a theory on the possible meaning behind BADDEED, so I asked him what the feeling was like between Apple and Microsoft at the time. I had heard for years that PowerPoint was originally created for the Macintosh, but Robert informed me:

In fact, PowerPoint was designed first for Microsoft Windows,

and its first spec shows that: “All the screen shots, menus, and

dialogs were set up to look like Microsoft Windows, not like

Macintosh.” (Gaskins, Sweating Bullets, p. 92) You can see that

spec here.

A year later, we concluded that we would be forced to ship

on Mac first, although we still thought that Windows was the

big opportunity and thought that Mac was risky. “We just didn’t think

we could successfully ship a product for Windows, yet, though we planned

to later. (Gaskins, Sweating Bullets, p. 105) The considerations are

summarized in my June 1986 product marketing document.

Of course, we turned out to have been right all along. PowerPoint on

Mac was much loved, but sales remained poor because Mac sales were

so poor. It was only after we shipped on Windows that PowerPoint gained

the dominant market share which has characterized it ever since, and

Windows PPT outsold Mac PPT very quickly. (Gaskins, Sweating Bullets, p. 403)

So my original thought was that there was some bad feelings around this Apple, Microsoft battle which has been the sentiment for quite some time. So when I asked if any of that influenced the use of BADDEED, I was told:

So, far from being disgruntled by expanding PowerPoint to Windows,

that had been our goal all along, and its achievement was the most

important success we had.

I judge that you are fully aware of all that, and that

your question is more, “was there any bad deed signified

by the Mac hex value chosen?” No, it was just the poverty

of choice when you only have six letters.

So there you have it. The use of the hex values 0x0BADDEED, was simply chosen from a limited set of values when looking at words hexadecimal could spell. I guess I should never let the truth get in the way of a good story.

I continued to have a wonderful conversation with Robert and also asked him for some details on the rest of the PowerPoint file format. I was hoping there might be some documentation out there explaining the early format before Microsoft took over. Robert said:

I don’t know of any such documentation apart from the official

Microsoft support files available online. I don’t have any such

information. I know that Dennis Austin deposited some of our

working files at the Computer History Museum (not online):

https://archive.computerhistory.org/resources/access/text/finding-aids/102733943-Austin/102733943-Austin.pdf

and it’s likely that some information is there–if nothing

else, it claims to contain a source code listing for PPT 1.0

which would contain the code to read the file format.

So there might be some information in at the Computer History Museum worth looking into.

As far as I could tell from the available online information, there is a few differences between Version 1.0 and Version 2.0, the biggest being the fact that 1.0 did not have an option to print in color, amount a few other minor things. Here is a screenshot of a page from the Microsoft PowerPoint 2.0 documentation on archive.org.

I suppose with the signature additions of the Macintosh and Windows versions 2.0 and 3.0 of the PowerPoint file format in PRONOM, that should cover most needs. Currently my PowerPoint 1.0 files identify at 2.0 files, so I may need to have them adjust the PUID to include both versions 1.0 and 2.0 as they are so similar. If I am able to find a difference or get my hands on the original source code I may find a better solution.

Quicktime MooV

October 6, 2023 by Thor 2 Comments

During the 1990’s Apple Quicktime became the dominant digital media standard. It is the basis for the MPEG-4 format which is used everywhere now. Technically the Quicktime Movie format is a container or wrapper which can hold a variety of Video and Audio streams.

The basic unit of a Quicktime Movie is an atom. The MooV atom is the most important piece of a Quicktime Movie. Without it and the “mvhd” header atom, all the characteristics of the movie are lost.

Having the MooV atom missing from your movie file seems like it would be a rare thing, but it may happen more often than you think.

What happens when you come across a Quicktime file on an HFS disk, like one of these: https://archive.org/details/quick-clips-cd

If you try and open the movie you might see this.

MediaInfo doesn’t know what to make of the file. You can see the hex values from the beginning of the file, there clearly is no MooV atom.

Enter Macintosh Resource Forks.

Original Quicktime files stored the MOOV atom in a resource fork.

Lets take a look a closer look at one of these files.

derez Wildebeest 
data 'moov' (128) {
	$"0000 0465 6D6F 6F76 0000 006C 6D76 6864"            /* ...emoov...lmvhd */
	$"0000 0000 E143 7EF5 E143 7EF5 0000 0258"            /* ....?C~??C~?...X */
	$"0000 1068 0001 0000 00FF 0000 0000 0000"            /* ...h.....?...... */
	$"0000 0000 0001 0000 0000 0000 0000 0000"            /* ................ */
	$"0000 0000 0001 0000 0000 0000 0000 0000"            /* ................ */
	$"0000 0000 4000 0000 0000 0000 0000 0000"            /* ....@........... */
	$"0000 0924 0000 0000 0000 0000 0000 0000"            /* ...$............ */
	$"0000 0002 0000 03D9 7472 616B 0000 005C"            /* .......?trak...\ */
	$"746B 6864 0000 000F A5EA 1D89 E143 7EF5"            /* tkhd....??.??C~? */
	$"0000 0001 0000 0000 0000 1068 0000 0000"            /* ...........h.... */
	$"0000 0000 0000 0000 0000 0000 0001 0000"            /* ................ */
	$"0000 0000 0000 0000 0000 0000 0001 0000"            /* ................ */
	$"0000 0000 0000 0000 0000 0000 4000 0000"            /* ............@... */
	$"00A0 0000 0078 0000 0000 0024 6564 7473"            /* .?...x.....$edts */
	$"0000 001C 656C 7374 0000 0000 0000 0001"            /* ....elst........ */
	$"0000 1068 0000 0000 0001 0000 0000 0351"            /* ...h...........Q */
	$"6D64 6961 0000 0020 6D64 6864 0000 0000"            /* mdia... mdhd.... */
	$"E143 7EF5 E143 7EF5 0000 0258 0000 1068"            /* ?C~??C~?...X...h */
	$"0000 003C 0000 003A 6864 6C72 0000 0000"            /* ...<...:hdlr.... */
	$"6D68 6C72 7669 6465 6170 706C 4000 0000"            /* mhlrvideappl@... */
	$"0001 002C 1941 7070 6C65 2056 6964 656F"            /* ...,.Apple Video */
	$"204D 6564 6961 2048 616E 646C 6572 0000"            /*  Media Handler.. */

The MooV atom is in the Resource Fork. Apple explains why they did it this way.

FILE MOVIE HEADER

Note: the header is safer when stored at the beginning of the file or in the HFS resource fork as type ‘moov’; ID any. The advantage of using another file fork is that the header can be lengthened without recalculating the sample offsets or new header must be written at the end of the file.
QTM-Layout

If you are playing back a movie on an older Macintosh using an earlier version of Quicktime, you won’t have any issues, but if you plan on playing the movie on a newer system or try and preserve the file, then we run into problems. Especially if the file is moved off of the HFS disk onto a system which doesn’t maintain the resource fork. Then you are stuck with just the data with no way to interpret the movie file.

Solutions.

One solution you can follow is to use MacBinary or AppleSingle to combine the Resource Fork and Data Fork together into one file. You are left with a different format, but one which can be preserved and reverted back to the original when needed.

Another way is to create a Single-Fork Movie file, aka a normal QuickTime file.

“single-fork movie file – A QuickTime movie file
that stores both the movie data and the movie
resource in the data fork of the movie file. You
can use single-fork movie files to ease the
exchange of QuickTime movie data between
Macintosh computers and other computer
systems.”
Inside Macintosh – QuickTime

Creating a Single-Fork can be accomplished a couple different ways. One is to use an older version of QuickTime to “Save As” to a self contained file with the box checked to allow playback on a “non-Apple” computer.

Another method is to use a simple utility called Single Fork Flattener. I found a copy on a old QuickTime disc and uploaded to Macintosh Garden if you want to try it out. No QuickTime needed, just open the file and it updates it to include the MooV resource. Also a tool called FlattenMooV.

Once combined, MediaInfo now sees a complete QuickTime file which VLC can play!

mediainfo Wildebeest2 
General
Complete name                            : Wildebeest
Format                                   : QuickTime
Format/Info                              : Original Apple specifications
File size                                : 565 KiB
Duration                                 : 7 s 0 ms
Overall bit rate                         : 661 kb/s
Frame rate                               : 10.000 FPS
Encoded date                             : 2023-10-02 14:15:15 UTC
Tagged date                              : 2023-10-02 14:15:15 UTC
Writing library                          : Apple QuickTime
FileExtension_Invalid                    : braw mov qt

Video
ID                                       : 0
Format                                   : Road Pizza
Codec ID                                 : rpza
Duration                                 : 7 s 0 ms
Bit rate                                 : 659 kb/s
Width                                    : 160 pixels
Height                                   : 120 pixels
Display aspect ratio                     : 4:3
Frame rate mode                          : Constant
Frame rate                               : 10.000 FPS
Bits/(Pixel*Frame)                       : 3.434
Stream size                              : 563 KiB (100%)
Language                                 : English
Encoded date                             : 1992-03-16 09:40:25 UTC
Tagged date                              : 2023-10-02 14:15:15 UTC

I was hoping I could find a method to use a modern tool to combine into a Single-Fork file, ~~but nothing yet~~. I did find a C++ source that when compiled does indeed merge the two forks together, which in this case merges the MooV atom at the end of the file. Its called qtmerge. QuickTime 7 is your best bet for a GUI tool which works on recent MacOS, but not the last couple versions. There is a reference out there to a tool called RezWack, but I have been unable to verify.

BINHEX

September 29, 2023 by Thor 1 Comment

Working with files in todays world is much different than before. Today getting files back and forth from the cloud or through email is relatively easy, unlike the early days when we used FTP sites and needed to encode our data to properly transfer. I remember using an FTP program on my old Mac called Fetch. We had to determine if the content was to be transferred as text or binary.

Picking the right encoding was critical to getting the content transferred correctly, this was even more critical when working with Macintosh files which needed a resource fork and/or finder attributes to work properly. In those cases a MacBinary or BinHex file was required! Fetch would automatically identify those formats and decode them for you.

If you need a refresher on MacBinary and AppleSingle, you can view my iPres 2022 presentation.

One format I didn’t spend much if any time on is the BinHex format. BinHex was a format born out of necessity to move files back and forth across the World Web Web, bulletin boards, AOL, Compuserve, and the like. The FTP program Fetch glossary describes BinHex as:

BinHex (sometimes called BinHex4) is a format for representing a Macintosh file in text form.

The Macintosh file is converted to a series of lines, each made up of letters, numbers, and

punctuation. Because BinHex files are simply text, they can be sent through most electronic mail

systems and stored on most computers. However the conversion to text makes the file larger, so it

takes longer to transmit a file in BinHex format than if the file was represented some other way.

The suffix “.hqx” usually indicates a BinHex format file.

You can still find many of these HQX files floating around the interwebs and on older CDs from the 1990’s. One such CD recently came into my possession. I managed to get a copy of the book “Internet File Formats“, by Tim Kientzie. It came with a CD-ROM with lots of goodies included. Some sample files, specifications, and software. The disc itself is an ISO 9660 partitioned disc, but includes a few Macintosh formats, so the author put many of the software files in the HQX format to maintain the much needed resource fork Macintosh applications need in order to run.

I initially ran the whole disc through DROID to get an idea what was on the disc and if any sample formats were unidentified (something I do regularly), and found majority of the HQX files didn’t identify as they should have to PRONOM PUID x-fmt/416. The signature is an older one, from 2010, but since the format isn’t updated anymore it should be solid. Or so I thought.

Since BINHEX files are encoded as text, lets take a look at a couple of these from the disc which didn’t identify.

The PRONOM signature currently is:

File extension: hqx	
Name	BinHex Binary Text
Description	Header: (This file must be converted with BinHex
Byte sequences	
Position type	Absolute from BOF
Offset	0	 
Value	28546869732066696C65206D75737420626520636F6E76657274656420776974682042696E486578

That “Value” listed in hexadecimal decodes to: “(This file must be converted with BinHex” as listed in the description. We can see this line in the file above, but the signature assumes the value begins at offset 0 from the beginning of the file. So its looking for that value at the start of the file, but this file seems to have some additional text before the value. What does the specs say?

The BinHex 4.0 format was created in 1985 and defined in RFC 1741.

   The whole file is considered as a stream of bits.  This stream will
   be divided in blocks of 6 bits and then converted to one of 64
   characters contained in a table.  The characters in this table have
   been chosen for maximum noise protection.  The format will start
   with a ":" (first character on a line) and end with a ":".
   There will be a maximum of 64 characters on a line.  It must be
   preceded, by this comment, starting in column 1 (it does not start
   in column 1 in this document):

    (This file must be converted with BinHex 4.0)

   Any text before this comment is to be ignored.

   The characters used is:

    !"#$%&'()*+,- 012345689@ABCDEFGHIJKLMNPQRSTUVXYZ[`abcdefhijklmpqr

Ok, so in the specs we can see the “Value” string must be there, but according to the specification, any text before this comment is to be ignored. So adding some instructions and even an email header at the beginning is ok, as long as the value string is there right before the encoded data.

We also learn a couple interesting things. The first character of the first line after the string should be a “:” and the last line should end with a “:” as well. That could help make the signature more solid. We also learn there are a maximum of 64 characters per line. The last line will probably not have full maximum, but the previous lines should…. I wonder if we could use this fixed position from the initial “:” to add even more strength to the signature.

So an updated PRONOM signature might look like:

BOF: {0-4084}28546869732066696C65206D75737420626520636F6E76657274656420776974682042696E486578{6-9}3A

EOF: 3A (Max Offset 64)

Adding the 4,084 bytes at the beginning allow for additional text. This value worked for my samples but there could be others out there with more. The {6-9} bytes in between the string and the colon account for the various way newlines are encoded. Sometimes is one “0A” byte, other times it is “OD”, and others its both. After testing, adding values in the signature to account for the 64 byte line can fail if the file has only one line, so I left it out.

The EOF should just be the colon (3A), but I found many of my samples had various line endings and other random characters. Hence the 64 bytes for max offset.

Also, the current PRONOM entry doesn’t include the Mime-Type, which is: “application/mac-binhex40”

Hopefully this update will add some strength to the signature and follow the specification closer. The new signature even works on files with extra content at the beginning!

This image has an empty alt attribute; its file name is long-binhex-header.png

There are a number of software titles you can use to encode and decode a BinHex file. On a modern Mac, try using The Unarchiver, or Stuffit Expander. From the commandline, you can use the macutil library or the CLI version of Unarchiver. Although the MacOS has a built in utility to decode BinHex files. If you are using a classic version of Macintosh OS, you can find a number of utilities on Macintosh Garden.

Oh, and also, the CD-ROM I mentioned earlier has a few “fun” features. Not sure if they are on purpose or if errors were made during mastering, but a few filenames have some hidden extra characters and one folder puts any tool traversing the directory into a loop, even droid. Have fun!

Gone in a Flash

September 22, 2023 by Thor Leave a comment

This week I am at the annual iPres digital preservation conference. It is an amazing week of meeting colleagues and old friends who share the same passion of digital preservation. Outside of this community and my co-workers, talking about file formats and digital preservation usually bores people to death and I can hear some of them mumble under their breath, “nerd”! I term I am happy to accept.

At the conference, which is in lovely Urbana-Champaign Illinois this year, I am trying to soak in all the amazing talks and conversations about the challenges facing our work. There were a couple great workshops on Persistent Identifiers and Digital Object Storage Criteria. The presentations I made were of course on File Formats, documentation, and obsolescence. One talk before my panel conversation was about the ubiquitous Adobe Flash format.

The paper, “Around for Decades, Gone in a Flash: How we dealt with Flash objects and the National Archives of the Netherlands” was presented by Lotte Wijsman and Marin Rappard. They knew they had flash objects in their web archives and wanted to go through the process of how they might be preserved and accessed. They started out looking for any files with “FLA”, “SWF”, and “FLV” as extensions. This proved problematic as there were references to those extensions within other documents and objects. They then used DROID to identify the flash formats. “SWF” has quite a number of format PUID’s.

PUID	Format Name	Format Version	Extension
fmt/104	Macromedia Flash	1	swf,
fmt/105	Macromedia Flash	2	swf,
fmt/106	Macromedia Flash	3	swf,
fmt/107	Macromedia Flash	4	swf,
fmt/108	Macromedia Flash	5	swf,
fmt/109	Macromedia Flash	6	swf,
fmt/110	Macromedia Flash	7	swf,
fmt/505	Adobe Flash	8	swf,
fmt/506	Adobe Flash	9	swf,
fmt/507	Adobe Flash	10	swf,
fmt/757	Adobe Flash	11	swf,
fmt/758	Adobe Flash	12	swf,
fmt/759	Adobe Flash	13	swf,
fmt/760	Adobe Flash	14	swf,
fmt/761	Adobe Flash	15	swf,
fmt/762	Adobe Flash	16	swf,
fmt/763	Adobe Flash	17	swf,
fmt/764	Adobe Flash	18	swf,
fmt/765	Adobe Flash	19	swf,
fmt/766	Adobe Flash	20	swf,
fmt/767	Adobe Flash	21	swf,
fmt/768	Adobe Flash	22	swf,
fmt/769	Adobe Flash	23	swf,
fmt/770	Adobe Flash	24	swf,
fmt/771	Adobe Flash	25	swf,
fmt/772	Adobe Flash	26	swf,
fmt/773	Adobe Flash	27	swf,
fmt/774	Adobe Flash	28	swf,
fmt/775	Adobe Flash	29	swf,
fmt/776	Adobe Flash	30	swf,

Even the Macromedia/Adobe Flash Video format has a PRONOM PUID, x-fmt/382.

The format missing from PRONOM is the FLA format. FLA is the native format for Macromedia/Adobe Flash for saving the source project of your Flash document. SWF files are compiled from the FLA source. This means the the SWF will be the most common format found on the web and in public places, but the FLA format might be more often found on local drives and working folders.

The Flash format and software was actually created by Future Wave software in 1996 as FutureSplash Animator, but was shortly bought by Macromedia later that year and Flash 1.0 was born. FutureSplash used the extension .SPA, but Macromedia changed it to FLA.

The format was initially based on the Microsoft Compound File Format or the OLE container format.

oledir Flash4-S01.fla 
oledir 0.54 - http://decalage.info/python/oletools
OLE directory entries in file Flash4-S01.fla:
----+------+-------+----------------------+-----+-----+-----+--------+------
id  |Status|Type   |Name                  |Left |Right|Child|1st Sect|Size  
----+------+-------+----------------------+-----+-----+-----+--------+------
0   |<Used>|Root   |Root Entry            |-    |-    |1    |5       |4416  
1   |<Used>|Stream |Contents              |2    |-    |-    |6       |4013  
2   |<Used>|Stream |Page 1                |-    |-    |-    |0       |329   
3   |unused|Empty  |                      |-    |-    |-    |0       |0     
----+----------------------------+------+--------------------------------------
id  |Name                        |Size  |CLSID                                 
----+----------------------------+------+--------------------------------------
0   |Root Entry                  |-     |597CAA70-72AA-11CF-831E-524153480000  
1   |Contents                    |4013  |                                      
2   |Page 1                      |329   |

The FLA format stayed with OLE until Adobe Flash CS5, which the format changed to use a ZIP container to store all the content.

Flash5.5-S01.fla
Type = zip
Physical Size = 216632

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2022-07-09 11:57:46 .....           25           25  mimetype
2022-07-09 11:57:46 .....            9            9  Flash5.5-S01.xfl
2022-07-09 11:57:46 D....            0            0  LIBRARY
2022-07-09 11:57:46 D....            0            0  META-INF
2022-07-09 11:57:46 .....        49267         3936  DOMDocument.xml
2022-07-09 11:57:48 .....         9735         1103  META-INF/metadata.xml
2022-07-09 11:57:48 .....         8093         2222  PublishSettings.xml
2022-07-09 11:57:48 .....            0            0  MobileSettings.xml
2022-07-09 11:57:48 D....            0            0  LIBRARY/Mouth shape graphic symbols
2022-07-09 11:57:48 D....            0            0  LIBRARY/Voice
2022-07-09 11:57:48 .....       151006       151006  bin/M 1 1252032698.dat
2022-07-09 11:57:48 .....        99707        15311  LIBRARY/mouth.xml
2022-07-09 11:57:48 .....        16510         4534  LIBRARY/Mouth shape graphic symbols/A I.xml
2022-07-09 11:57:48 .....        14334         4086  LIBRARY/Mouth shape graphic symbols/C D G K N R S Th Y Z.xml
2022-07-09 11:57:48 .....        14531         4040  LIBRARY/Mouth shape graphic symbols/E.xml
2022-07-09 11:57:48 .....        15846         4007  LIBRARY/Mouth shape graphic symbols/F V D Th.xml
2022-07-09 11:57:48 .....        13093         3542  LIBRARY/Mouth shape graphic symbols/L D Th.xml
2022-07-09 11:57:48 .....         2106          751  LIBRARY/Mouth shape graphic symbols/M B P Closed.xml
2022-07-09 11:57:48 .....        14130         3949  LIBRARY/Mouth shape graphic symbols/O.xml
2022-07-09 11:57:48 .....        11082         2951  LIBRARY/Mouth shape graphic symbols/Open_Rest.xml
2022-07-09 11:57:48 .....        14847         4066  LIBRARY/Mouth shape graphic symbols/U.xml
2022-07-09 11:57:48 .....         8139         2202  LIBRARY/Mouth shape graphic symbols/W Q.xml
2022-07-09 11:57:48 .....        15768         3914  LIBRARY/panda.xml
2022-07-09 11:57:48 .....        10477         1064  LIBRARY/sample graph.xml
2022-07-09 11:57:48 .....          538          538  bin/SymDepend.cache
------------------- ----- ------------ ------------  ------------------------
2022-07-09 11:57:48             469243       213256  21 files, 4 folders

The move to a ZIP container included a new format, XFL. This XFL file is a simple text file with the text “PROXY-CS5″. In the DOMDocument.xml file we find an XML namespace, xmlns=”http://ns.adobe.com/xfl/2008/” and a version of the XFL structure, xflVersion=”2.1″.

This ZIP compressed FLA file is still being used in the current Adobe Animate software, which no longer uses the flash technology and uses more modern web formats like HTML5 to display the animations.

I took each version and made a PRONOM signature, which you can find here with samples. These container signatures should cover all the major changes for the format, but there is a problem……..

Listing archive: Flash5.5-S01v5.fla

--
Path = Flash5.5-S01v5.fla
Type = zip
ERRORS:
Headers Error
Physical Size = 216581
Embedded Stub Size = 63
Characteristics = Local

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2022-07-09 11:57:46 .....           25           25  mimetype
2022-07-09 11:57:46 D....            0            0  LIBRARY
2022-07-09 11:57:46 D....            0            0  META-INF
2022-07-09 11:57:46 .....        48556         3742  DOMDocument.xml
2022-07-09 11:57:48 .....        10133         1112  META-INF/metadata.xml
2022-07-09 11:57:48 .....         8115         2219  PublishSettings.xml
2022-07-09 11:57:48 .....            0            0  MobileSettings.xml
2022-07-09 11:57:48 D....            0            0  LIBRARY/Mouth shape graphic symbols
2022-07-09 11:57:48 D....            0            0  LIBRARY/Voice
2022-07-09 11:57:48 .....       151006       151006  bin/M 1 1252032698.dat
2022-07-09 11:57:48 .....        99551        15319  LIBRARY/mouth.xml
2022-07-09 11:57:48 .....        16580         4536  LIBRARY/Mouth shape graphic symbols/A I.xml
2022-07-09 11:57:48 .....        14404         4089  LIBRARY/Mouth shape graphic symbols/C D G K N R S Th Y Z.xml
2022-07-09 11:57:48 .....        14531         4044  LIBRARY/Mouth shape graphic symbols/E.xml
2022-07-09 11:57:48 .....        15848         4008  LIBRARY/Mouth shape graphic symbols/F V D Th.xml
2022-07-09 11:57:48 .....        13024         3546  LIBRARY/Mouth shape graphic symbols/L D Th.xml
2022-07-09 11:57:48 .....         2106          752  LIBRARY/Mouth shape graphic symbols/M B P Closed.xml
2022-07-09 11:57:48 .....        14200         3955  LIBRARY/Mouth shape graphic symbols/O.xml
2022-07-09 11:57:48 .....        11152         2963  LIBRARY/Mouth shape graphic symbols/Open_Rest.xml
2022-07-09 11:57:48 .....        14777         4069  LIBRARY/Mouth shape graphic symbols/U.xml
2022-07-09 11:57:48 .....         8287         2228  LIBRARY/Mouth shape graphic symbols/W Q.xml
2022-07-09 11:57:48 .....        15768         3914  LIBRARY/panda.xml
2022-07-09 11:57:48 .....        10477         1064  LIBRARY/sample graph.xml
2022-07-09 11:57:48 .....          538          538  bin/SymDepend.cache
2022-07-09 11:57:46 .....           25           25  mimetype
2022-07-09 11:58:18 .....            9            9  Flash5.5-S01v5.xfl
------------------- ----- ------------ ------------  ------------------------
2022-07-09 11:58:18             469112       213163  22 files, 4 folders

Turns out majority of the samples I have from many versions of Adobe Flash after CS5 have a ZIP Header error. When using the new signatures in DROID, the samples with the header errors will fail in the DROID’s zip library processing. The DROID logs shows this issue:

Could not process the potential container format (ZIP): file:///Flash5.5-S01v5.fla	
Expected 25 more entries in the Central Directory!

The Central Directory header in a ZIP file is quite important to the proper function of the ZIP container. Wikipedia has a great explanation of the header. You may notice in the listing above the file “mimetype” is shown twice which is probably the extra entries the parser wasn’t expecting.

So currently the identification of majority of these FLA formats is on hold until a way is discovered to ignore the error and continue the container identification in DROID.

TIFF

September 15, 2023 by Thor 1 Comment

Lets talk TIFF, or Tagged Image File Format. It is well documented and accepted by the community. The format has been around since 1986, first developed by Aldus as a image format for scanners. The TIFF format is now used worldwide as a preferred format for scanning and preservation of cultural heritage objects.

As amazing as the format is, there are a few features of the format which can be a preservation risk. I want to focus on three of those risks.

The Tagged Image File Format has a well known header:

A TIFF file begins with an 8-byte image file header, containing the following
information:
Bytes 0-1: The byte order used within the file. Legal values are:
“II” (4949.H) LSB (IBM)
“MM” (4D4D.H) MSB (Mac)
Bytes 2-3 An arbitrary but carefully chosen number (42).
Bytes 4-7 The offset (in bytes) of the first IFD.

Putting this poster of the TIFF structure in your office will impress your co-workers, guaranteed. Thanks Ange!

The three risks I have been pondering lately are:

Multiple IFD’s
Metadata
DNG format

TIFF version 6.0 was released in 1992 and is the most recent version. Although some vendors are free to add their own private tags. In 1995 Adobe added an addendum which added some additions for use with PageMaker.

One of the main features of the TIFF format is its ability to hold multiple pages. In Adobe’s words:

TIFF has always supported what amounts to a singly linked list of IFD’s in a single TIFF file, via the “next IFD pointer,” though most applications currently ignore any IFD beyond the first one. Probably the best use for a linked list of IFD’s is when you want to store multiple different but related images in the same file—a ‘burst’ of images from a camera, for example.
Adobe PageMaker® 6.0 TIFF Technical Notes

Take note of the highlighted text, software like Adobe Photoshop will ignore any IFD beyond the first one. Even worse, Photoshop won’t even mention there are additional IFD’s. I have used many document scanners which default to multipage TIFF capture and have lost pages because of this. Because of this I have always built my workflows around single page TIFF’s for all scanning and we check against this as a rule.

What also makes this hard is how some capture software uses additional IFD’s. CaptureOne is a popular imaging software used by photographers and cultural heritage institutions. We have used it to connect to our PhaseOne cameras for capture of books and other flat objects. By default the software exports the final TIFF image with a thumbnail.

With the “No Thumbnail” unchecked we get this TIFF structure:

identify _MG_0193.tif 
_MG_0193.tif[0] TIFF 3456x5184 3456x5184+0+0 8-bit sRGB 51.3136MiB 0.030u 0:00.026
_MG_0193.tif[1] TIFF 107x160 107x160+0+0 8-bit sRGB 0.000u 0:00.007

 <IFD0:ImageWidth>3456</IFD0:ImageWidth>
 <IFD0:ImageHeight>5184</IFD0:ImageHeight>
 <IFD1:SubfileType>Reduced-resolution image</IFD1:SubfileType>
 <IFD1:ImageWidth>107</IFD1:ImageWidth>
 <IFD1:ImageHeight>160</IFD1:ImageHeight>
 <IFD1:BitsPerSample>8 8 8</IFD1:BitsPerSample>

So Imagemagick identifies two pages 0 and 1 with the second a much smaller resolution than the first. Exiftool reports back IFD0 and IFD1 with IFD1 having a SubfileType of a Reduced-resolution image. Makes sense, it is a thumbnail. In looking at the specifications for TIFF 6.0, I can find no mention of the word “thumbnail”, but the specification does make mention of “reduced resolution” images:

If multiple subfiles are written, the first one must be the full-resolution image. Subsequent images, such as reduced-resolution images, may be in any order in the TIFF file.

The specification also gives us this warning:

TIFF readers must be prepared for multiple images (subfiles) per TIFF file, although they are not required to do anything with images after the first one.

Scary to think about how a reader is not required to do anything, not even warn against multiple IFD’s (Subfiles).

The EXIF specifications seem to expand on this through attributes:

Attribute information can be recorded in 2 IFDs (0th IFD, 1st IFD) following the TIFF structure, including the File Header. The 0th IFD records compressed image attributes (the image itself). The 1st IFD may be used for thumbnail images.
Page 97 of EXIF Specification

Take a look at the information and Figure 6 on page 21-22 in the EXIF specification.

Adobe early on decided to use their own tags for thumbnail data. Since Photoshop 5, Adobe has stored the thumbnail in Tag 1036.

 1036 Photoshop Thumbnail             : (Binary data 4625 bytes, use -b option to extract)

There is another TIFF structure sometimes used in older FAX compressed multipage TIFFs and now used in the DNG Specification. The SubIFD tag was writable using the libtiff “thumbnail” tool, but is now depreciated. Originally described in the TIFF/EP specification, DNG files use SubIFD trees.

DNG files are often talked about in the same way TIFF files are, and many tools handle both seamlessly. One of the major differences is that DNG files switch their IFD use. IFD0 is often the reduced-resolution thumbnail and SubIFD the full-resolution image.

<IFD0:SubfileType>Reduced-resolution image</IFD0:SubfileType>
<IFD0:ImageWidth>256</IFD0:ImageWidth>
<IFD0:ImageHeight>171</IFD0:ImageHeight> 

<SubIFD:SubfileType>Full-resolution image</SubIFD:SubfileType>
<SubIFD:ImageWidth>3516</SubIFD:ImageWidth>
<SubIFD:ImageHeight>2328</SubIFD:ImageHeight>

This can cause issues when trying to extract technical metadata from images, knowing which IFD to get the main image details requires a bit of work. I’ll save DNG for another blog post.

TIFF Metadata is a vital part of preservation. The metadata can provide technical properties of the file along with some descriptive information. It amazes me how much the embedded metadata can vary from a scanner or camera capture device. The digitization lab I worked in for years had scanners from Epson, Fujitu, Canon and others. Along with cameras made by Canon, PhaseOne, and Copibooks. Each one with a vastly different set of metadata using different standards. Even when each workflow produced final uncompressed TIFF images, they all varied in metadata.

The TIFF images with the leasT amount of metadata was from the Epson scanners. When using the free Epson Scan software, not a single metadata field was embedded, no dates, scanner model or manufacturer. More was embedded when you used the Silverfast professional software included with each Epson, but even then if you didn’t add any IPTC fields, the metadata was limited.

The most metadata came from the camera systems, especially the PhaseOne/CaptureOne systems. Even though it produced the most and had valuable properties, there were some issues. I already discussed the thumbnail issue, but PhaseOne decided they wanted to change how some of the tags were used.

CaptureOne has quite the list of white balance options. Which is great for the photographer, but not so great for adhering to the TIFF standard.

According to the EXIF TIFF Specification, there are only two values allowed for White Balance, Auto or Manual. A CaptureOne produced TIFF will have this value if Auto or Manual are not chosen:

41987 White Balance                   : Unknown (5)
37384 Light Source                    : Other

The different lighting situations should be identified using the “Light Source” 37384 tag, but alas they chose to add to white balance instead. When I asked about this, they responded that they requested this update to the TIFF spec, but they weren’t willing so they took matters into their own hands. You can read the conversation on the JHOVE issues page.

The TIFF format is very accepted in the Cultural Heritage community as a preferred preservation format. The specification is well understood and documented. I just hope we can continue to openly discuss issues that arise in preservation which add risk to our collections. Some issues are minor compared to others. Sometimes it’s the tools we use to validate formats like TIFF which are wrong and need to be corrected. The talk more about these issues and how to manage them.

Apple Package Format

September 1, 2023 by Thor Leave a comment

Let’s talk about Apple’s iWork software. Apple’s Office Suite of applications was first released in 2005 and provided a WordProcessor (Pages), Presentations (Keynote), and a little later, Spreadsheet (Numbers). They are exclusive to the Macintosh and iOS devices.

iWork was released in a few different versions. They get a little confusing as each application has its own version which all seemed to unify and stabilize in 2020. Here is a matrix of major versions.

Version	Package or ZIP
iWork ’05	Package
iWork ’06	Package
iWork ’08	Package
iWork ’09	ZIP
iWork 2013	Package
iWork 2014	ZIP
iWork 2019	ZIP
iWork 2020	ZIP

You may already be aware but MacOS can sometimes be weird. I use the term weird in a loving, sometimes proud way, but I admit, there was some “odd” choices made in regards to how applications and documents are used and stored files on a Mac.

On early Macintosh computers Apple used an interesting method of storing resources for applications and some file formats. The Resource Fork for an application contained all the “resources” needed to run in the operating system. It would contain all the icons, warning screens, graphics, sounds, etc. This held true until Mac OS X came along and then Apple started using a bundle or package format. Still in use today, what appears to be a single file or application is actually a folder of all the resources needed to run the application.

By right clicking or control clicking on the icon you can open the folder and see all the contents which make up the Application.

Nifty right? The MacOS which knows which extensions to treat as a package. If you were to move the application over to another system it would be a folder with the extension “.app”.

For an application I can see how this makes sense as it will only execute in the MacOS environment. The problem comes in when you use the same package method for the documents the application creates.

Contents of Pages version 1 sample file.

So instead of a single “file” with a bytestream, you get a folder of files which make up the file format. Here is Apple’s description:

Document Packages

If your document file formats are getting too complex to manage because of several disparate types of data, you might consider adopting a package format for your documents. Document packages give the illusion of a single document to users but provide you with flexibility in how you store the document data internally. Especially if you use several different types of standard data formats, such as JPEG, GIF, or XML, document packages make accessing and managing that data much easier.

Apple actually defines two similar methods:

Although bundles and packages are sometimes referred to interchangeably, they actually represent very distinct concepts:

A package is any directory that the Finder presents to the user as if it were a single file.

A bundle is a directory with a standardized hierarchical structure that holds executable code and the resources used by that code.

A couple years ago a processed digital collection made its way down to me. It had been processed by a new digital archivist and when I went to prepare the collection for preservation, I found a folder with the extension .pages and inside the folder a whole directory of files. Many of which they had renamed and arranged. Needless to say, I had to track down the original disk so I could properly preserve the file.

So looking back at the earlier table, iWork switched back and forth between the package format and a ZIP container. For preservation purposes, the ZIP container is easier to maintain outside the MacOS. Lets look a little closer at each. If you would like to follow along I have copied a few samples onto a hybrid ISO.

iWork ’05 through iWork ’08 used the same package format and structure. Because they are a package format, they are difficult to preserve as original files. I suppose you could zip them up, but probably the best option is to open with a current version of Pages and save to the newer ZIP container format.

tree iWork08/Keynote-06.key 
├── Contents
│   └── PkgInfo
├── QuickLook
│   └── Thumbnail.jpg
├── index.apxl.gz
└── theme-files
    ├── Blue 2.jpg
    ├── Blue 2.tif
    ├── Cool Gray-2.jpg
    ├── Cool Gray.tif
    ├── Green-8.jpg
    ├── Green.tif
    ├── Headlines_bullet.pdf
    ├── Headlines_star.pdf
    ├── Orange 2.tif
    ├── Orange_2.jpg
    ├── Purple-6.jpg
    ├── Purple.tif
    ├── Red.jpg
    ├── Red.tif
    ├── endpoints.pdf
    └── headlines_hi-res.jpg

iWork ’09 changed this practice. The documents saved from Pages, Keynote, and Numbers were contained in a ZIP file and can be identified using the PRONOM registry container signatures.

filename : 'iWork 2013/Pages2013-Sample09.pages'
filesize : 105900
modified : 2019-11-21T20:36:00-07:00
matches  :
  - ns      : 'pronom'
    id      : 'fmt/1439'
    format  : 'Apple iWork Pages'
    version : '09'
    class   : 'Word Processor'
    basis   : 'extension match pages; container name index.xml with byte match at 195, 76'

Sample09.pages
Type = zip
WARNINGS:
Headers Error
Physical Size = 105900

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2019-11-21 20:36:00 .....       364773        22413  index.xml
2019-11-21 20:36:00 .....         7007         7007  Hardcover_bullet_black.png
2019-11-21 20:36:00 .....        69176        69176  Simple_Noise_2x.jpg
2019-11-21 20:36:00 .....          232          232  buildVersionHistory.plist
2019-11-21 20:36:00 .....         6406         6406  QuickLook/Thumbnail.png
------------------- ----- ------------ ------------  ------------------------
2019-11-21 20:36:00             447594       105234  5 files

Then Apple went back to a Package format with iWork 2013. For reasons unknown. But the content and structure changed. Its a package format with a Index.zip instead of index.xml

Pages2013-Sample.pages
├── Data
│   └── Hardcover_bullet_black-13.png
├── Index.zip
├── Metadata
│   ├── BuildVersionHistory.plist
│   ├── DocumentIdentifier
│   └── Properties.plist
├── preview-micro.jpg
├── preview-web.jpg
└── preview.jpg

3 directories, 8 files

The ZIP within the package contains a new Apple format. IWA

Pages2013-Sample.pages/Index.zip
Type = zip
Physical Size = 39361

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2019-11-21 20:47:14 .....         3860         3860  Index/Document.iwa
2019-11-21 20:47:14 .....           26           26  Index/Tables/DataList.iwa
2019-11-21 20:47:14 .....          336          336  Index/ViewState.iwa
2019-11-21 20:47:14 .....          160          160  Index/CalculationEngine.iwa
2019-11-21 20:47:14 .....          121          121  Index/DocumentStylesheet.iwa
2019-11-21 20:47:14 .....        31931        31931  Index/ThemeStylesheet.iwa
2019-11-21 20:47:14 .....           22           22  Index/AnnotationAuthorStorage.iwa
2019-11-21 20:47:14 .....         1889         1889  Index/Metadata.iwa
------------------- ----- ------------ ------------  ------------------------
2019-11-21 20:47:14              38345        38345  8 files

Luckily Apple came to their senses and went back to the ZIP container format for iWork 2014 and later. The container signature looks for the IWA file Apple started using with iWork 2013.

filename : 'iWork 2014/Pages2014-Sample.pages'
filesize : 66256
modified : 2019-11-22T00:03:56-07:00
errors   : 
matches  :
  - ns      : 'pronom'
    id      : 'fmt/1441'
    format  : 'Apple iWork Document'
    version : '14'
    class   : 'Presentation, Spreadsheet, Word Processor'
    basis   : 'extension match pages; container name Index/Document.iwa with byte match at 16, 6; name Metadata/Properties.plist with name only'

Path = iWork 2014/Pages2014-Sample.pages
Type = zip
Physical Size = 66256

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2019-11-22 00:03:54 .....         3930         3930  Index/Document.iwa
2019-11-22 00:03:54 .....          364          364  Index/ViewState.iwa
2019-11-22 00:03:54 .....          206          206  Index/CalculationEngine.iwa
2019-11-22 00:03:54 .....        33573        33573  Index/DocumentStylesheet.iwa
2019-11-22 00:03:54 .....           22           22  Index/AnnotationAuthorStorage.iwa
2019-11-22 00:03:54 .....           23           23  Index/DocumentMetadata.iwa
2019-11-22 00:03:54 .....         8761         8761  Index/Metadata.iwa
2019-11-22 00:03:54 .....          322          322  Metadata/Properties.plist
2019-11-22 00:03:54 .....           36           36  Metadata/DocumentIdentifier
2019-11-22 00:03:54 .....          273          273  Metadata/BuildVersionHistory.plist
2019-11-22 00:03:54 .....        14611        14611  preview.jpg
2019-11-22 00:03:54 .....          838          838  preview-micro.jpg
2019-11-22 00:03:54 .....         1571         1571  preview-web.jpg
------------------- ----- ------------ ------------  ------------------------
2019-11-22 00:03:54              64530        64530  13 files

Now iWork was not the only Apple software to use the Package/Bundle format for their documents. Be advised the following software may save to the package format.

I remember a few years ago, Trent Reznor (NIN) decided to release a few of his tracks in the Garageband format. A little harder to find these days, but the good old wayback machine kept a copy for us! Grab them here. Be warned, they may be in the package format. Thanks Apple!

GEDCOM

August 4, 2023 by Thor 1 Comment

One of the first PRONOM signatures I submitted was for a format I felt responsible for, considering where I worked. This is the GEDCOM format, which is an acronym for GEnealogical Data COMmunication. At the time I submitted the signature the format hadn’t been updated in years.

Very recently it has seen a renewed interest from those in the Genealogical community. In 2021 the format was renewed with a Version 7 specification with the purpose of simplifying and clarifying the format. In addition a new format was released to handle storing multimedia files in a container called GED-ZIP.

My first attempt at a signature was based on the specification generally, but with the new version released, I thought it might be good to revisit this format and see if we need to make any adjustments. There needs to be a new signature for the GED-ZIP format as well.

The original signature, fmt/851, created for PRONOM is:

302048454144{0-1024}47454443(0D0A|0D|0A)322056455253

It has an offset of 0-3 to account for any Unicode BOM, but starts with “0 HEAD”; this is the required start to a GEDCOM file. The next bits can be a source of the software which created the GEDCOM, using the tag “SOUR” which can also include a version of the software and name and address of the developer. This can take a bit of space so we include 0-1024 bytes for this information. The next tag is the subrecord of HEAD, “GEDC”, then the next subrecord, “VERS”. Most GEDCOM validations will look for HEAD.GEDC.VERS for the version of GEDCOM the file claims to conform with. The hex values, (0D0A|0D|0A), is the hard return accounting for the different systems that could write the GEDCOM.

A minimal GEDCOM version 5.5 would contain the following.

0 HEAD
1 GEDC
2 VERS 5.5
0 TRLR

The end of the file is marked by the tag “TRLR” in reference to a Trailer. I didn’t include this in my initial signature, but probably should have.

GEDCOM files have been around a long time, the first draft was released in 1984, but the GEDCOM structure we see now really didn’t come along until version 3 in 1987, when the format was standardized and made public. The HEAD.GEDC.VERS wasn’t standardized until version 4. You can see the history here.

So moving forward we should probably have a new PUID for Version 3, Version 4, Version 5 and the new Version 7 and leave the existing signature as is.

Version 3 only requires the tags HEAD, SOUR, DEST and the ending TRLR.

BOF 302048454144(0D0A|0D|0A)3120534F5552{0-128}312044455354
EOF 302054524C52

Version 4 requires the HEAD.GEDC.VERS sequence.

BOF 302048454144{0-1024}47454443(0D0A|0D|0A)3220564552532034
EOF 302054524C52

Version 5 is similar.

BOF 302048454144{0-1024}47454443(0D0A|0D|0A)3220564552532035
EOF 302054524C52

Version 7 is also similar.

BOF 302048454144{0-1024}47454443(0D0A|0D|0A)3220564552532037
EOF 302054524C52

For the new GED-ZIP format we need to create a container signature as the format is a ZIP file but with a GEDCOM inside. The GED-ZIP specifications states:

A GEDCOM ZIP file should:
• include exactly one GEDCOM file with the name “gedcom.ged”
• include all the multimedia objects references by that GEDCOM file
• not include unreferenced multimedia objects

Our Container signature would look like this:

<ContainerSignature Id="1000" ContainerType="ZIP">
 <Description>GEDZIP</Description>
  <Files>
   <File>
     <Path>gedcom.ged</Path>
      <BinarySignatures>
       <InternalSignatureCollection>                    
	 <InternalSignature ID="300">
	  <ByteSequence Reference="BOFoffset">
	    <SubSequence Position="1" SubSeqMinOffset="0" SubSeqMaxOffset="3">
	      <Sequence>30 20 48 45 41 44</Sequence>
	    </SubSequence>
	  </ByteSequence>
	</InternalSignature>
      </InternalSignatureCollection>
     </BinarySignatures>
    </File>               
   </Files>
</ContainerSignature>

I recently learned of a variation on the GEDCOM format which can cause a lot of confusion. The software Family Tree Maker could export to the GEDCOM format, but had a checkbox which, unchecked, allowed you to not abbreviate the tags. The tags in the GEDCOM format are expected just the way they are, which makes me wonder why they would do something so confusing. You can read more about this format here.

I was recently made aware a few of these rouge “GEDCOM” files were out there, in the wild, causing confusion during identification. My first thought was to adjust the signature to make it a little more loose to fit these variations, but then discovered they are not GEDCOM files. In fact later versions of FTM forgot they did this and would error when you tried to import them back into the software. I think it would be wise to identify these FTM GEDCOM variants, just so one is aware of the difference and can then decide how to handle them properly.

The format was named “FTW TEXT”, so we can use that to call the new signature. Instead of “0 HEAD”, “0 HEADER” is used, instead of “0 SOUR”, “0 SOURCE” is used, and instead of “0 TRLR” at the end, “0 TRAILER” is used.

BOF 3020484541444552(0D0A|0D|0A)3120534F55524345
EOF 3020545241494C4552

It was fun to look back at this format and try and improve on it a bit. I learned more than I did when I initially wrote the signature and hopefully documented it well enough. The FTM variant was an interesting twist I was not expecting, which I am sure will show up again in the future. Take a look at the signatures and samples I updated and let me know what you think.

RIS Citation

July 28, 2023 by Thor Leave a comment

Up until recently I was working in a Corporate archive preserving all sort of content. The corporation throughout the years used many different software packages to produce all sorts of data. When I moved to an academic library I saw much of the same content, but there was a some new file formats which I needed to document and manage. Many of those come from scholarly journals , theses, dissertation, and data sets for projects.

One format which I came across often but seems to be missing from the standard file format known lists was the RefMan citation format. This format is a simple text based format which serves to standardize citations from scholarly sources. Created by Research Information Systems, the format uses the RIS extension used by Procite and Reference Manager (RefMan). ISI ResearchSoft managed the format for a bit in the 1990’s, this is where you can find most of the specifications.

Now that I am a little more familiar with the format I see it everywhere! Find any scholarly journal and there will usually be a “cite” feature to download the citation in a few formats, RIS being one of the most common.

Example: Theory and Craft of Digital Preservation

It can be called by a few names, mostly based on the systems which support it. You might see Ris (Zotero), or EasyBib, Mendeley, ProCite, Reference Manager, and others. But they all follow the same format.

The format is simple plain text format, there are codes which indicate the different field types and tags. The basic structure would look like this:

TY  - BOOK
AU  - Owens, Trevor 
LA  - eng
PB  - Johns Hopkins University Press Baltimore, Maryland
CY  - Baltimore, Maryland
SN  - 9781421426976; 1421426978
PY  - 2018
TI  - The theory and craft of digital preservation
LK  - https://worldcat.org/title/1030899528
ER  -

The first tag always needed to be “TY” and the last tag “ER”. TY stands for Type of reference and ER stands End of reference.

There is actually two versions of the format, this original specification and a later one which added some header information. You can download the full documentation here.

Provider: The name of the information provider (required)
Database: The name of the database (optional)

Tagformat: Name of the tag format used identify fields (optional)
Content: media type for the body of the file (required)

Creation of a PRONOM signature for this text format is pretty straight forward. Looking for the TY and ER string should be enough to ensure the format doesn’t clash with other text based formats. Text formats are notoriously difficult to identify, but when they have expected patterns it makes it a little easier. I had to add a little buffer at the beginning of the signature to allow for the newer header information, but more samples will be needed to see if this is enough to identify the format in all situations. Take a look and see if it works for you!