BINHEX

Working with files in todays world is much different than before. Today getting files back and forth from the cloud or through email is relatively easy, unlike the early days when we used FTP sites and needed to encode our data to properly transfer. I remember using an FTP program on my old Mac called Fetch. We had to determine if the content was to be transferred as text or binary.

Picking the right encoding was critical to getting the content transferred correctly, this was even more critical when working with Macintosh files which needed a resource fork and/or finder attributes to work properly. In those cases a MacBinary or BinHex file was required! Fetch would automatically identify those formats and decode them for you.

If you need a refresher on MacBinary and AppleSingle, you can view my iPres 2022 presentation.

One format I didn’t spend much if any time on is the BinHex format. BinHex was a format born out of necessity to move files back and forth across the World Web Web, bulletin boards, AOL, Compuserve, and the like. The FTP program Fetch glossary describes BinHex as:

BinHex (sometimes called BinHex4) is a format for representing a Macintosh file in text form.

The Macintosh file is converted to a series of lines, each made up of letters, numbers, and

punctuation. Because BinHex files are simply text, they can be sent through most electronic mail

systems and stored on most computers. However the conversion to text makes the file larger, so it

takes longer to transmit a file in BinHex format than if the file was represented some other way.

The suffix “.hqx” usually indicates a BinHex format file.

You can still find many of these HQX files floating around the interwebs and on older CDs from the 1990’s. One such CD recently came into my possession. I managed to get a copy of the book “Internet File Formats“, by Tim Kientzie. It came with a CD-ROM with lots of goodies included. Some sample files, specifications, and software. The disc itself is an ISO 9660 partitioned disc, but includes a few Macintosh formats, so the author put many of the software files in the HQX format to maintain the much needed resource fork Macintosh applications need in order to run.

I initially ran the whole disc through DROID to get an idea what was on the disc and if any sample formats were unidentified (something I do regularly), and found majority of the HQX files didn’t identify as they should have to PRONOM PUID x-fmt/416. The signature is an older one, from 2010, but since the format isn’t updated anymore it should be solid. Or so I thought.

Since BINHEX files are encoded as text, lets take a look at a couple of these from the disc which didn’t identify.

The PRONOM signature currently is:

File extension: hqx	
Name	BinHex Binary Text
Description	Header: (This file must be converted with BinHex
Byte sequences	
Position type	Absolute from BOF
Offset	0	 
Value	28546869732066696C65206D75737420626520636F6E76657274656420776974682042696E486578

That “Value” listed in hexadecimal decodes to: “(This file must be converted with BinHex” as listed in the description. We can see this line in the file above, but the signature assumes the value begins at offset 0 from the beginning of the file. So its looking for that value at the start of the file, but this file seems to have some additional text before the value. What does the specs say?

The BinHex 4.0 format was created in 1985 and defined in RFC 1741.

   The whole file is considered as a stream of bits.  This stream will
   be divided in blocks of 6 bits and then converted to one of 64
   characters contained in a table.  The characters in this table have
   been chosen for maximum noise protection.  The format will start
   with a ":" (first character on a line) and end with a ":".
   There will be a maximum of 64 characters on a line.  It must be
   preceded, by this comment, starting in column 1 (it does not start
   in column 1 in this document):

    (This file must be converted with BinHex 4.0)

   Any text before this comment is to be ignored.

   The characters used is:

    !"#$%&'()*+,- 012345689@ABCDEFGHIJKLMNPQRSTUVXYZ[`abcdefhijklmpqr

Ok, so in the specs we can see the “Value” string must be there, but according to the specification, any text before this comment is to be ignored. So adding some instructions and even an email header at the beginning is ok, as long as the value string is there right before the encoded data.

We also learn a couple interesting things. The first character of the first line after the string should be a “:” and the last line should end with a “:” as well. That could help make the signature more solid. We also learn there are a maximum of 64 characters per line. The last line will probably not have full maximum, but the previous lines should…. I wonder if we could use this fixed position from the initial “:” to add even more strength to the signature.

So an updated PRONOM signature might look like:

BOF: {0-4084}28546869732066696C65206D75737420626520636F6E76657274656420776974682042696E486578{6-9}3A

EOF: 3A (Max Offset 64)

Adding the 4,084 bytes at the beginning allow for additional text. This value worked for my samples but there could be others out there with more. The {6-9} bytes in between the string and the colon account for the various way newlines are encoded. Sometimes is one “0A” byte, other times it is “OD”, and others its both. After testing, adding values in the signature to account for the 64 byte line can fail if the file has only one line, so I left it out.

The EOF should just be the colon (3A), but I found many of my samples had various line endings and other random characters. Hence the 64 bytes for max offset.

Also, the current PRONOM entry doesn’t include the Mime-Type, which is: “application/mac-binhex40”

Hopefully this update will add some strength to the signature and follow the specification closer. The new signature even works on files with extra content at the beginning!

This image has an empty alt attribute; its file name is long-binhex-header.png

There are a number of software titles you can use to encode and decode a BinHex file. On a modern Mac, try using The Unarchiver, or Stuffit Expander. From the commandline, you can use the macutil library or the CLI version of Unarchiver. Although the MacOS has a built in utility to decode BinHex files. If you are using a classic version of Macintosh OS, you can find a number of utilities on Macintosh Garden.

Oh, and also, the CD-ROM I mentioned earlier has a few “fun” features. Not sure if they are on purpose or if errors were made during mastering, but a few filenames have some hidden extra characters and one folder puts any tool traversing the directory into a loop, even droid. Have fun!

Gone in a Flash

This week I am at the annual iPres digital preservation conference. It is an amazing week of meeting colleagues and old friends who share the same passion of digital preservation. Outside of this community and my co-workers, talking about file formats and digital preservation usually bores people to death and I can hear some of them mumble under their breath, “nerd”! I term I am happy to accept.

At the conference, which is in lovely Urbana-Champaign Illinois this year, I am trying to soak in all the amazing talks and conversations about the challenges facing our work. There were a couple great workshops on Persistent Identifiers and Digital Object Storage Criteria. The presentations I made were of course on File Formats, documentation, and obsolescence. One talk before my panel conversation was about the ubiquitous Adobe Flash format.

The paper, “Around for Decades, Gone in a Flash: How we dealt with Flash objects and the National Archives of the Netherlands” was presented by Lotte Wijsman and Marin Rappard. They knew they had flash objects in their web archives and wanted to go through the process of how they might be preserved and accessed. They started out looking for any files with “FLA”, “SWF”, and “FLV” as extensions. This proved problematic as there were references to those extensions within other documents and objects. They then used DROID to identify the flash formats. “SWF” has quite a number of format PUID’s.

PUIDFormat NameFormat VersionExtension
fmt/104Macromedia Flash1swf,
fmt/105Macromedia Flash2swf,
fmt/106Macromedia Flash3swf,
fmt/107Macromedia Flash4swf,
fmt/108Macromedia Flash5swf,
fmt/109Macromedia Flash6swf,
fmt/110Macromedia Flash7swf,
fmt/505Adobe Flash8swf,
fmt/506Adobe Flash9swf,
fmt/507Adobe Flash10swf,
fmt/757Adobe Flash11swf,
fmt/758Adobe Flash12swf,
fmt/759Adobe Flash13swf,
fmt/760Adobe Flash14swf,
fmt/761Adobe Flash15swf,
fmt/762Adobe Flash16swf,
fmt/763Adobe Flash17swf,
fmt/764Adobe Flash18swf,
fmt/765Adobe Flash19swf,
fmt/766Adobe Flash20swf,
fmt/767Adobe Flash21swf,
fmt/768Adobe Flash22swf,
fmt/769Adobe Flash23swf,
fmt/770Adobe Flash24swf,
fmt/771Adobe Flash25swf,
fmt/772Adobe Flash26swf,
fmt/773Adobe Flash27swf,
fmt/774Adobe Flash28swf,
fmt/775Adobe Flash29swf,
fmt/776Adobe Flash30swf,

Even the Macromedia/Adobe Flash Video format has a PRONOM PUID, x-fmt/382.

The format missing from PRONOM is the FLA format. FLA is the native format for Macromedia/Adobe Flash for saving the source project of your Flash document. SWF files are compiled from the FLA source. This means the the SWF will be the most common format found on the web and in public places, but the FLA format might be more often found on local drives and working folders.

The Flash format and software was actually created by Future Wave software in 1996 as FutureSplash Animator, but was shortly bought by Macromedia later that year and Flash 1.0 was born. FutureSplash used the extension .SPA, but Macromedia changed it to FLA.

The format was initially based on the Microsoft Compound File Format or the OLE container format.

oledir Flash4-S01.fla 
oledir 0.54 - http://decalage.info/python/oletools
OLE directory entries in file Flash4-S01.fla:
----+------+-------+----------------------+-----+-----+-----+--------+------
id  |Status|Type   |Name                  |Left |Right|Child|1st Sect|Size  
----+------+-------+----------------------+-----+-----+-----+--------+------
0   |<Used>|Root   |Root Entry            |-    |-    |1    |5       |4416  
1   |<Used>|Stream |Contents              |2    |-    |-    |6       |4013  
2   |<Used>|Stream |Page 1                |-    |-    |-    |0       |329   
3   |unused|Empty  |                      |-    |-    |-    |0       |0     
----+----------------------------+------+--------------------------------------
id  |Name                        |Size  |CLSID                                 
----+----------------------------+------+--------------------------------------
0   |Root Entry                  |-     |597CAA70-72AA-11CF-831E-524153480000  
1   |Contents                    |4013  |                                      
2   |Page 1                      |329   |   

The FLA format stayed with OLE until Adobe Flash CS5, which the format changed to use a ZIP container to store all the content.

Flash5.5-S01.fla
Type = zip
Physical Size = 216632

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2022-07-09 11:57:46 .....           25           25  mimetype
2022-07-09 11:57:46 .....            9            9  Flash5.5-S01.xfl
2022-07-09 11:57:46 D....            0            0  LIBRARY
2022-07-09 11:57:46 D....            0            0  META-INF
2022-07-09 11:57:46 .....        49267         3936  DOMDocument.xml
2022-07-09 11:57:48 .....         9735         1103  META-INF/metadata.xml
2022-07-09 11:57:48 .....         8093         2222  PublishSettings.xml
2022-07-09 11:57:48 .....            0            0  MobileSettings.xml
2022-07-09 11:57:48 D....            0            0  LIBRARY/Mouth shape graphic symbols
2022-07-09 11:57:48 D....            0            0  LIBRARY/Voice
2022-07-09 11:57:48 .....       151006       151006  bin/M 1 1252032698.dat
2022-07-09 11:57:48 .....        99707        15311  LIBRARY/mouth.xml
2022-07-09 11:57:48 .....        16510         4534  LIBRARY/Mouth shape graphic symbols/A I.xml
2022-07-09 11:57:48 .....        14334         4086  LIBRARY/Mouth shape graphic symbols/C D G K N R S Th Y Z.xml
2022-07-09 11:57:48 .....        14531         4040  LIBRARY/Mouth shape graphic symbols/E.xml
2022-07-09 11:57:48 .....        15846         4007  LIBRARY/Mouth shape graphic symbols/F V D Th.xml
2022-07-09 11:57:48 .....        13093         3542  LIBRARY/Mouth shape graphic symbols/L D Th.xml
2022-07-09 11:57:48 .....         2106          751  LIBRARY/Mouth shape graphic symbols/M B P Closed.xml
2022-07-09 11:57:48 .....        14130         3949  LIBRARY/Mouth shape graphic symbols/O.xml
2022-07-09 11:57:48 .....        11082         2951  LIBRARY/Mouth shape graphic symbols/Open_Rest.xml
2022-07-09 11:57:48 .....        14847         4066  LIBRARY/Mouth shape graphic symbols/U.xml
2022-07-09 11:57:48 .....         8139         2202  LIBRARY/Mouth shape graphic symbols/W Q.xml
2022-07-09 11:57:48 .....        15768         3914  LIBRARY/panda.xml
2022-07-09 11:57:48 .....        10477         1064  LIBRARY/sample graph.xml
2022-07-09 11:57:48 .....          538          538  bin/SymDepend.cache
------------------- ----- ------------ ------------  ------------------------
2022-07-09 11:57:48             469243       213256  21 files, 4 folders

The move to a ZIP container included a new format, XFL. This XFL file is a simple text file with the text “PROXY-CS5″. In the DOMDocument.xml file we find an XML namespace, xmlns=”http://ns.adobe.com/xfl/2008/” and a version of the XFL structure, xflVersion=”2.1″.

This ZIP compressed FLA file is still being used in the current Adobe Animate software, which no longer uses the flash technology and uses more modern web formats like HTML5 to display the animations.

I took each version and made a PRONOM signature, which you can find here with samples. These container signatures should cover all the major changes for the format, but there is a problem……..

Listing archive: Flash5.5-S01v5.fla

--
Path = Flash5.5-S01v5.fla
Type = zip
ERRORS:
Headers Error
Physical Size = 216581
Embedded Stub Size = 63
Characteristics = Local

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2022-07-09 11:57:46 .....           25           25  mimetype
2022-07-09 11:57:46 D....            0            0  LIBRARY
2022-07-09 11:57:46 D....            0            0  META-INF
2022-07-09 11:57:46 .....        48556         3742  DOMDocument.xml
2022-07-09 11:57:48 .....        10133         1112  META-INF/metadata.xml
2022-07-09 11:57:48 .....         8115         2219  PublishSettings.xml
2022-07-09 11:57:48 .....            0            0  MobileSettings.xml
2022-07-09 11:57:48 D....            0            0  LIBRARY/Mouth shape graphic symbols
2022-07-09 11:57:48 D....            0            0  LIBRARY/Voice
2022-07-09 11:57:48 .....       151006       151006  bin/M 1 1252032698.dat
2022-07-09 11:57:48 .....        99551        15319  LIBRARY/mouth.xml
2022-07-09 11:57:48 .....        16580         4536  LIBRARY/Mouth shape graphic symbols/A I.xml
2022-07-09 11:57:48 .....        14404         4089  LIBRARY/Mouth shape graphic symbols/C D G K N R S Th Y Z.xml
2022-07-09 11:57:48 .....        14531         4044  LIBRARY/Mouth shape graphic symbols/E.xml
2022-07-09 11:57:48 .....        15848         4008  LIBRARY/Mouth shape graphic symbols/F V D Th.xml
2022-07-09 11:57:48 .....        13024         3546  LIBRARY/Mouth shape graphic symbols/L D Th.xml
2022-07-09 11:57:48 .....         2106          752  LIBRARY/Mouth shape graphic symbols/M B P Closed.xml
2022-07-09 11:57:48 .....        14200         3955  LIBRARY/Mouth shape graphic symbols/O.xml
2022-07-09 11:57:48 .....        11152         2963  LIBRARY/Mouth shape graphic symbols/Open_Rest.xml
2022-07-09 11:57:48 .....        14777         4069  LIBRARY/Mouth shape graphic symbols/U.xml
2022-07-09 11:57:48 .....         8287         2228  LIBRARY/Mouth shape graphic symbols/W Q.xml
2022-07-09 11:57:48 .....        15768         3914  LIBRARY/panda.xml
2022-07-09 11:57:48 .....        10477         1064  LIBRARY/sample graph.xml
2022-07-09 11:57:48 .....          538          538  bin/SymDepend.cache
2022-07-09 11:57:46 .....           25           25  mimetype
2022-07-09 11:58:18 .....            9            9  Flash5.5-S01v5.xfl
------------------- ----- ------------ ------------  ------------------------
2022-07-09 11:58:18             469112       213163  22 files, 4 folders

Turns out majority of the samples I have from many versions of Adobe Flash after CS5 have a ZIP Header error. When using the new signatures in DROID, the samples with the header errors will fail in the DROID’s zip library processing. The DROID logs shows this issue:

Could not process the potential container format (ZIP): file:///Flash5.5-S01v5.fla	
Expected 25 more entries in the Central Directory!

The Central Directory header in a ZIP file is quite important to the proper function of the ZIP container. Wikipedia has a great explanation of the header. You may notice in the listing above the file “mimetype” is shown twice which is probably the extra entries the parser wasn’t expecting.

So currently the identification of majority of these FLA formats is on hold until a way is discovered to ignore the error and continue the container identification in DROID.

TIFF

Lets talk TIFF, or Tagged Image File Format. It is well documented and accepted by the community. The format has been around since 1986, first developed by Aldus as a image format for scanners. The TIFF format is now used worldwide as a preferred format for scanning and preservation of cultural heritage objects.

As amazing as the format is, there are a few features of the format which can be a preservation risk. I want to focus on three of those risks.

The Tagged Image File Format has a well known header:

A TIFF file begins with an 8-byte image file header, containing the following
information:
Bytes 0-1: The byte order used within the file. Legal values are:
“II” (4949.H) LSB (IBM)
“MM” (4D4D.H) MSB (Mac)
Bytes 2-3 An arbitrary but carefully chosen number (42).
Bytes 4-7 The offset (in bytes) of the first IFD.

Putting this poster of the TIFF structure in your office will impress your co-workers, guaranteed. Thanks Ange!

The three risks I have been pondering lately are:

  • Multiple IFD’s
  • Metadata
  • DNG format

TIFF version 6.0 was released in 1992 and is the most recent version. Although some vendors are free to add their own private tags. In 1995 Adobe added an addendum which added some additions for use with PageMaker.

One of the main features of the TIFF format is its ability to hold multiple pages. In Adobe’s words:

TIFF has always supported what amounts to a singly linked list of IFD’s in a single TIFF file, via the “next IFD pointer,” though most applications currently ignore any IFD beyond the first one. Probably the best use for a linked list of IFD’s is when you want to store multiple different but related images in the same file—a ‘burst’ of images from a camera, for example.

Adobe PageMaker® 6.0 TIFF Technical Notes

Take note of the highlighted text, software like Adobe Photoshop will ignore any IFD beyond the first one. Even worse, Photoshop won’t even mention there are additional IFD’s. I have used many document scanners which default to multipage TIFF capture and have lost pages because of this. Because of this I have always built my workflows around single page TIFF’s for all scanning and we check against this as a rule.

What also makes this hard is how some capture software uses additional IFD’s. CaptureOne is a popular imaging software used by photographers and cultural heritage institutions. We have used it to connect to our PhaseOne cameras for capture of books and other flat objects. By default the software exports the final TIFF image with a thumbnail.

With the “No Thumbnail” unchecked we get this TIFF structure:

identify _MG_0193.tif 
_MG_0193.tif[0] TIFF 3456x5184 3456x5184+0+0 8-bit sRGB 51.3136MiB 0.030u 0:00.026
_MG_0193.tif[1] TIFF 107x160 107x160+0+0 8-bit sRGB 0.000u 0:00.007

 <IFD0:ImageWidth>3456</IFD0:ImageWidth>
 <IFD0:ImageHeight>5184</IFD0:ImageHeight>
 <IFD1:SubfileType>Reduced-resolution image</IFD1:SubfileType>
 <IFD1:ImageWidth>107</IFD1:ImageWidth>
 <IFD1:ImageHeight>160</IFD1:ImageHeight>
 <IFD1:BitsPerSample>8 8 8</IFD1:BitsPerSample>

So Imagemagick identifies two pages 0 and 1 with the second a much smaller resolution than the first. Exiftool reports back IFD0 and IFD1 with IFD1 having a SubfileType of a Reduced-resolution image. Makes sense, it is a thumbnail. In looking at the specifications for TIFF 6.0, I can find no mention of the word “thumbnail”, but the specification does make mention of “reduced resolution” images:

If multiple subfiles are written, the first one must be the full-resolution image. Subsequent images, such as reduced-resolution images, may be in any order in the TIFF file.

The specification also gives us this warning:

TIFF readers must be prepared for multiple images (subfiles) per TIFF file, although they are not required to do anything with images after the first one.

Scary to think about how a reader is not required to do anything, not even warn against multiple IFD’s (Subfiles).

The EXIF specifications seem to expand on this through attributes:

Attribute information can be recorded in 2 IFDs (0th IFD, 1st IFD) following the TIFF structure, including the File Header. The 0th IFD records compressed image attributes (the image itself). The 1st IFD may be used for thumbnail images. 

Page 97 of EXIF Specification

Take a look at the information and Figure 6 on page 21-22 in the EXIF specification.

Adobe early on decided to use their own tags for thumbnail data. Since Photoshop 5, Adobe has stored the thumbnail in Tag 1036.

 1036 Photoshop Thumbnail             : (Binary data 4625 bytes, use -b option to extract)

There is another TIFF structure sometimes used in older FAX compressed multipage TIFFs and now used in the DNG Specification. The SubIFD tag was writable using the libtiff “thumbnail” tool, but is now depreciated. Originally described in the TIFF/EP specification, DNG files use SubIFD trees.

DNG files are often talked about in the same way TIFF files are, and many tools handle both seamlessly. One of the major differences is that DNG files switch their IFD use. IFD0 is often the reduced-resolution thumbnail and SubIFD the full-resolution image.

<IFD0:SubfileType>Reduced-resolution image</IFD0:SubfileType>
<IFD0:ImageWidth>256</IFD0:ImageWidth>
<IFD0:ImageHeight>171</IFD0:ImageHeight> 

<SubIFD:SubfileType>Full-resolution image</SubIFD:SubfileType>
<SubIFD:ImageWidth>3516</SubIFD:ImageWidth>
<SubIFD:ImageHeight>2328</SubIFD:ImageHeight>

This can cause issues when trying to extract technical metadata from images, knowing which IFD to get the main image details requires a bit of work. I’ll save DNG for another blog post.

TIFF Metadata is a vital part of preservation. The metadata can provide technical properties of the file along with some descriptive information. It amazes me how much the embedded metadata can vary from a scanner or camera capture device. The digitization lab I worked in for years had scanners from Epson, Fujitu, Canon and others. Along with cameras made by Canon, PhaseOne, and Copibooks. Each one with a vastly different set of metadata using different standards. Even when each workflow produced final uncompressed TIFF images, they all varied in metadata.

The TIFF images with the leasT amount of metadata was from the Epson scanners. When using the free Epson Scan software, not a single metadata field was embedded, no dates, scanner model or manufacturer. More was embedded when you used the Silverfast professional software included with each Epson, but even then if you didn’t add any IPTC fields, the metadata was limited.

The most metadata came from the camera systems, especially the PhaseOne/CaptureOne systems. Even though it produced the most and had valuable properties, there were some issues. I already discussed the thumbnail issue, but PhaseOne decided they wanted to change how some of the tags were used.

CaptureOne has quite the list of white balance options. Which is great for the photographer, but not so great for adhering to the TIFF standard.

According to the EXIF TIFF Specification, there are only two values allowed for White Balance, Auto or Manual. A CaptureOne produced TIFF will have this value if Auto or Manual are not chosen:

41987 White Balance                   : Unknown (5)
37384 Light Source                    : Other

The different lighting situations should be identified using the “Light Source” 37384 tag, but alas they chose to add to white balance instead. When I asked about this, they responded that they requested this update to the TIFF spec, but they weren’t willing so they took matters into their own hands. You can read the conversation on the JHOVE issues page.

The TIFF format is very accepted in the Cultural Heritage community as a preferred preservation format. The specification is well understood and documented. I just hope we can continue to openly discuss issues that arise in preservation which add risk to our collections. Some issues are minor compared to others. Sometimes it’s the tools we use to validate formats like TIFF which are wrong and need to be corrected. The talk more about these issues and how to manage them.

HighMAT

Before the days of streaming and devices likeSmart TVs, AppleTV and Fire sticks, a few companies tried their best to come up with ways to make viewing your media on your TV mainstream. In a previous blog post I touched on the Kodak PhotoCD method, but there is one you are probably even less familiar with. HighMAT. HighMAT, or High-Performance Media Access Technology was a technology co-developed by Microsoft and Panasonic. You may have at one point owned a DVD player which had the technology built-in, but may have never used it. It came on the scene around 2002, but was abandoned by 2008.

Panasonic DVD/CD Player with HighMAT playback.

There were quite a few devices stamped with the HighMAT logo. The technology allow you to playback any Audio and Images like a DVD, with a menu and everything.

There was three different types of HighMAT compatible devices, Audio, Audio-Image, and Audio-Image-Video.

Writing data to the HighMAT format could be done with a plugin for Windows which added the functionality to Windows Media Player for burning audio playlists to the HighMAT format or through the standard CD Writing Wizard built-in to Windows XP. An extra screen would come up asking if you would like to make the CD HighMAT compatible. Making video compatible HighMAT CDs could be done through Movie Maker.

When a HighMAT CD-R/CD-RW is authored we get an interesting CD. It appears to be a Mode 2 Form 1 format:

/dev/disk10 (internal, physical):
   #:                       TYPE NAME                    SIZE       IDENTIFIER
   0:        CD_partition_scheme                        *846.4 MB   disk10
   1:       CD_ROM_Mode_2_Form_1 Highmat02               2.7 MB     disk10s0

If you would like to check out a sample disc, you can grab the ISO or BIN/CUE here.

tree /Volumes/Highmat02
/Volumes/Highmat02
├── Audio\ Samples
│   ├── 17\ [no\ artist]\ -\ Speaker\ Identification\ Test.wma
│   └── sine.wma
├── HIGHMAT
│   ├── AUTHOR.XML
│   ├── CONTENTS.HMT
│   ├── IMAGES
│   │   ├── T0.HMT
│   │   ├── T1.HMT
│   │   ├── T10.HMT
│   │   ├── T11.HMT
│   │   ├── T12.HMT
│   │   ├── T13.HMT
│   │   ├── T2.HMT
│   │   ├── T3.HMT
│   │   ├── T4.HMT
│   │   ├── T5.HMT
│   │   ├── T6.HMT
│   │   ├── T7.HMT
│   │   ├── T8.HMT
│   │   └── T9.HMT
│   ├── MENU.HMT
│   ├── PLAYLIST
│   │   ├── 00000001.HMT
│   │   ├── 00000002.HMT
│   │   ├── 00000003.HMT
│   │   ├── 00000004.HMT
│   │   ├── 00000005.HMT
│   │   ├── 00000006.HMT
│   │   └── 00000007.HMT
│   └── TEXT.HMT
└── My\ Pics
    ├── Blue\ hills.jpg
    ├── Sunset.jpg
    ├── Water\ lilies.jpg
    └── Winter.jpg

5 directories, 31 files

There is a lot going on here, lets take a look at a few of the formats we find in this disc structure. The files added to the CD are converted to WMA if you checked the “Convert Files” feature and are accessible like a normal data CD. The HighMAT folder is created to make a compatible HighMAT disc. Except for one XML file the rest of the files in the HighMAT folder all have an HMT extension. The author.xml file contains the root element <HMT> with some filenames indicating some of the HMT files may be thumbnails. If we open one of the HMT thumbnail files in a hex editor we can see:

Just a plain old JPG header. Exiftool tells us it is small 160×120 pixel image, must be a thumbnail. But lets take a look at another HMT file.

Even though the Menu.hmt file has the same extension as the thumbnails, this file is definitely not a JPG file with pixel data. Same goes for the Contents and Text files as well, unique formats.

The files in the playlist folder also have a unique format.

So it seems all the HighMAT folder really does is add compatibility for hardware to provide a menu to access the original data, providing playlists and thumbnails to navigate the data on your TV screen.

I came across one of these discs while processing a collection of CD-R discs donated to our library. Normally I would copy the images and other data off the disc to our preservation system, but this disc made me stop to think about the best way to preserve the data. Is a disc image appropriate or is the HighMAT folder even worth preserving if we have the original files from the disc? Finding hardware or a software player to present the disc as intended is getting harder to do. I am curious what others think of the value of this content.

I chose not to submit any signatures to PRONOM for the moment as we assess. It would be difficult to properly identify each format with all of them having the same extension, especially the JPG thumbnails as HMT is not a valid extension for the format. Take a look at my sample files and if you have come across this format before, let me know.

Apple Package Format

Let’s talk about Apple’s iWork software. Apple’s Office Suite of applications was first released in 2005 and provided a WordProcessor (Pages), Presentations (Keynote), and a little later, Spreadsheet (Numbers). They are exclusive to the Macintosh and iOS devices.

iWork was released in a few different versions. They get a little confusing as each application has its own version which all seemed to unify and stabilize in 2020. Here is a matrix of major versions.

VersionPackage or ZIP
iWork ’05Package
iWork ’06Package
iWork ’08Package
iWork ’09ZIP
iWork 2013Package
iWork 2014ZIP
iWork 2019ZIP
iWork 2020ZIP

You may already be aware but MacOS can sometimes be weird. I use the term weird in a loving, sometimes proud way, but I admit, there was some “odd” choices made in regards to how applications and documents are used and stored files on a Mac.

On early Macintosh computers Apple used an interesting method of storing resources for applications and some file formats. The Resource Fork for an application contained all the “resources” needed to run in the operating system. It would contain all the icons, warning screens, graphics, sounds, etc. This help true until Mac OS X came along and then Apple started using a bundle or package format. Still in use today, what appears to be a single file or application is actually a folder of all the resources needed to run the application.

Show Package Contents

By right clicking or control clicking on the icon you can open the folder and see all the contents which make up the Application.

Directory listing of Pages.app on MacOS

Nifty right? The MacOS which knows which extensions to treat as a package. If you were to move the application over to another system it would be a folder with the extension “.app”.

For an application I can see how this makes sense as it will only execute in the MacOS environment. The problem comes in when you use the same package method for the documents the application creates.

Contents of Pages version 1 sample file.

So instead of a single “file” with a bytestream, you get a folder of files which make up the file format. Here is Apple’s description:

Document Packages

If your document file formats are getting too complex to manage because of several disparate types of data, you might consider adopting a package format for your documents. Document packages give the illusion of a single document to users but provide you with flexibility in how you store the document data internally. Especially if you use several different types of standard data formats, such as JPEG, GIF, or XML, document packages make accessing and managing that data much easier.

Apple actually defines two similar methods:

Although bundles and packages are sometimes referred to interchangeably, they actually represent very distinct concepts:

  • package is any directory that the Finder presents to the user as if it were a single file.
  • bundle is a directory with a standardized hierarchical structure that holds executable code and the resources used by that code.

A couple years ago a processed digital collection made its way down to me. It had been processed by a new digital archivist and when I went to prepare the collection for preservation, I found a folder with the extension .pages and inside the folder a whole directory of files. Many of which they had renamed and arranged. Needless to say, I had to track down the original disk so I could properly preserve the file.

So looking back at the earlier table, iWork switched back and forth between the package format and a ZIP container. For preservation purposes, the ZIP container is easier to maintain outside the MacOS. Lets look a little closer at each. If you would like to follow along I have copied a few samples onto a hybrid ISO.

iWork ’05 through iWork ’08 used the same package format and structure. Because they are a package format, they are difficult to preserve as original files. I suppose you could zip them up, but probably the best option is to open with a current version of Pages and save to the newer ZIP container format.

tree iWork08/Keynote-06.key 
├── Contents
│   └── PkgInfo
├── QuickLook
│   └── Thumbnail.jpg
├── index.apxl.gz
└── theme-files
    ├── Blue 2.jpg
    ├── Blue 2.tif
    ├── Cool Gray-2.jpg
    ├── Cool Gray.tif
    ├── Green-8.jpg
    ├── Green.tif
    ├── Headlines_bullet.pdf
    ├── Headlines_star.pdf
    ├── Orange 2.tif
    ├── Orange_2.jpg
    ├── Purple-6.jpg
    ├── Purple.tif
    ├── Red.jpg
    ├── Red.tif
    ├── endpoints.pdf
    └── headlines_hi-res.jpg

iWork ’09 changed this practice. The documents saved from Pages, Keynote, and Numbers were contained in a ZIP file and can be identified using the PRONOM registry container signatures.

filename : 'iWork 2013/Pages2013-Sample09.pages'
filesize : 105900
modified : 2019-11-21T20:36:00-07:00
matches  :
  - ns      : 'pronom'
    id      : 'fmt/1439'
    format  : 'Apple iWork Pages'
    version : '09'
    class   : 'Word Processor'
    basis   : 'extension match pages; container name index.xml with byte match at 195, 76' 
Sample09.pages
Type = zip
WARNINGS:
Headers Error
Physical Size = 105900

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2019-11-21 20:36:00 .....       364773        22413  index.xml
2019-11-21 20:36:00 .....         7007         7007  Hardcover_bullet_black.png
2019-11-21 20:36:00 .....        69176        69176  Simple_Noise_2x.jpg
2019-11-21 20:36:00 .....          232          232  buildVersionHistory.plist
2019-11-21 20:36:00 .....         6406         6406  QuickLook/Thumbnail.png
------------------- ----- ------------ ------------  ------------------------
2019-11-21 20:36:00             447594       105234  5 files

Then Apple went back to a Package format with iWork 2013. For reasons unknown. But the content and structure changed. Its a package format with a Index.zip instead of index.xml

Pages2013-Sample.pages
├── Data
│   └── Hardcover_bullet_black-13.png
├── Index.zip
├── Metadata
│   ├── BuildVersionHistory.plist
│   ├── DocumentIdentifier
│   └── Properties.plist
├── preview-micro.jpg
├── preview-web.jpg
└── preview.jpg

3 directories, 8 files

The ZIP within the package contains a new Apple format. IWA

Pages2013-Sample.pages/Index.zip
Type = zip
Physical Size = 39361

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2019-11-21 20:47:14 .....         3860         3860  Index/Document.iwa
2019-11-21 20:47:14 .....           26           26  Index/Tables/DataList.iwa
2019-11-21 20:47:14 .....          336          336  Index/ViewState.iwa
2019-11-21 20:47:14 .....          160          160  Index/CalculationEngine.iwa
2019-11-21 20:47:14 .....          121          121  Index/DocumentStylesheet.iwa
2019-11-21 20:47:14 .....        31931        31931  Index/ThemeStylesheet.iwa
2019-11-21 20:47:14 .....           22           22  Index/AnnotationAuthorStorage.iwa
2019-11-21 20:47:14 .....         1889         1889  Index/Metadata.iwa
------------------- ----- ------------ ------------  ------------------------
2019-11-21 20:47:14              38345        38345  8 files

Luckily Apple came to their senses and went back to the ZIP container format for iWork 2014 and later. The container signature looks for the IWA file Apple started using with iWork 2013.

filename : 'iWork 2014/Pages2014-Sample.pages'
filesize : 66256
modified : 2019-11-22T00:03:56-07:00
errors   : 
matches  :
  - ns      : 'pronom'
    id      : 'fmt/1441'
    format  : 'Apple iWork Document'
    version : '14'
    class   : 'Presentation, Spreadsheet, Word Processor'
    basis   : 'extension match pages; container name Index/Document.iwa with byte match at 16, 6; name Metadata/Properties.plist with name only'
Path = iWork 2014/Pages2014-Sample.pages
Type = zip
Physical Size = 66256

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2019-11-22 00:03:54 .....         3930         3930  Index/Document.iwa
2019-11-22 00:03:54 .....          364          364  Index/ViewState.iwa
2019-11-22 00:03:54 .....          206          206  Index/CalculationEngine.iwa
2019-11-22 00:03:54 .....        33573        33573  Index/DocumentStylesheet.iwa
2019-11-22 00:03:54 .....           22           22  Index/AnnotationAuthorStorage.iwa
2019-11-22 00:03:54 .....           23           23  Index/DocumentMetadata.iwa
2019-11-22 00:03:54 .....         8761         8761  Index/Metadata.iwa
2019-11-22 00:03:54 .....          322          322  Metadata/Properties.plist
2019-11-22 00:03:54 .....           36           36  Metadata/DocumentIdentifier
2019-11-22 00:03:54 .....          273          273  Metadata/BuildVersionHistory.plist
2019-11-22 00:03:54 .....        14611        14611  preview.jpg
2019-11-22 00:03:54 .....          838          838  preview-micro.jpg
2019-11-22 00:03:54 .....         1571         1571  preview-web.jpg
------------------- ----- ------------ ------------  ------------------------
2019-11-22 00:03:54              64530        64530  13 files

Now iWork was not the only Apple software to use the Package/Bundle format for their documents. Be advised the following software may save to the package format.

I remember a few years ago, Trent Reznor (NIN) decided to release a few of his tracks in the Garageband format. A little harder to find these days, but the good old wayback machine kept a copy for us! Grab them here. Be warned, they may be in the package format. Thanks Apple!