If you hadn’t been over to see the posters made by Ange Albertini, head over now. Below is his poster on the JPG image file format. This is the basic JFIF file format, which stands for JPEG File Interchange Format. There are also raw JPEG streams and Exif, Exchangeable Image File Format.
The basic format is pretty straight forward. There is a start of image marker FFD8 some format information, then the raster compressed data, then an end of image marker FFD9. Identification of a JPEG file should be pretty straight forward. Knowing the start and end marker values and then the type of JPEG based on the Application data, can be very specific. That is until some software engineers start playing fast and loose with the format specifications.
A while back I received a JPG file which didn’t identify using the latest PRONOM signature. It’s happened before, some new phones came out and started using a newer version of the exif specification so I submitted an update to PRONOM for JPG’s using exif 2.3 and greater. But also may need to submit another signature soon for the newly released Exif 3.0 specification! But this JPG I received wasn’t a new version, it should have been identified with the current PRONOM signature. It started with FFD8 and when I went to look at the end of the file for the end of image marker FFD9, it wasn’t where I expected it to be.
This JPG file had an additional 9632 bytes after the FFD9 end of image marker. But why? The image rendered just fine in multiple JPG viewers. The only warning from Exiftool was for “Unrecognized MakerNotes”, which is not too uncommon. So I went to the JPG Exif specification.
EOI, Recording this marker is mandatory. It shall be recorded in this position.
But reading a little further we see…..
Moreover, Exif/DCF readers should be implemented to operate without interruption even if certain kinds of data have been recorded after EOI of the primary image defined in the Exif standard. Specifically, unknown data after EOI of the primary image should be skipped. (see section 4.7.1)
So the extra data is allowed by specification. Any readers should ignore or skip any data after the EOI (End of Image). Well that makes identification more difficult. All the PRONOM signatures are based on having the EOI marker at the “End”. Some have allowance for padding, but not enough for the worst offenders……
The image referenced above was created on a Huawei MHA-L29 cameraphone. But since finding this image, I have also found many Samsung phones do the same thing. Here is one from a Samsung SM-G975U1. Much less padding but enough to throw off identification.
Apple iPhones are also not exempt from this “feature” either. When using the MacOS ImageCapture tool with the HEIC format, a bug can add an excessive amount of empty data at the end of the converted JPG file.
So, when it comes to identification, if your JPG files don’t seem to identify correctly, look closer at the end of the file, it may have some “extra” data.
I am dating myself by using the phrase “What’s the 411?” Back in my day (before the Googles), if you wanted quick information you could pick up the “land line”, a corded phone in your home which could only make phone calls, and dial 4-1-1 and you would be connected to an operator that could help you locate businesses, tell you the time, answer simple questions, and was infinity smarter than Alexa.
Around the same time I was using 4-1-1 to answer all my questions, digital camera’s were just coming on the scene. One of those was the Sony Mavica line of digital camera’s. They were unique as they used a floppy disk as the storage media. They had a small LED screen for capture and playback of the captured images. In order to quickly preview the images captured on disk, the camera generates a hidden thumbnail file for each image, this file has the extension .411. When I first saw this file when I copied a floppy from my Mavica cameras, it reminded me of the old information line. I first assumed it was a metadata file as the first few Mavica camera did not use EXIF in their files, but they are simply a raster image in a 64×48 pixel file. Of course Sony did not document this file format and probably hoped no one would noticed as they are hidden on the floppy FAT12 formatted disk.
One could argue the value of documenting and possibly identifying thumbnail formats as many in digital preservation have chosen not to keep the Thumbs.db file or other hidden files not meant to be preserved or accessible to the user. I have found documenting any format found through technical appraisals provides value to everyone, which may ultimately determine not to keep such formats in their repository, but knowing what they are is vital to the process. Come listen and chat with me about this topic at iPres 2023!
Usually the first part of documenting a format is looking for specifications online or documented somewhere. Since Sony did not publicly release any specifications for this format, we have to use others reverse engineering or do so ourselves. There have been a few attempts to document a conversion of the 411 format to a common raster format like BMP. Like this C code for conversion to BMP, or to NetPBM formats like PPM, or the Java “Javica” software which makes use of the 411 files. My first step was to see if we could find some common patterns in the many samples I have from my Mavica collection. Running Marco Pontello’s TrIDScan, across my 54 samples came up with no common patterns, this was expected as all the reverse engineering efforts points out the format is probably based on the CCIR.601 specification which is MPEG based on frames.
With no common patterns among all the samples, creating a PRONOM signature is not possible. In the future, file identification may be based more on dynamic pattern matching instead of the current static patterns we look for now. Until then, this may need to be submitted as an extension only entry. Two things to note, the files created by the camera are all named starting with “MVC” which could also be used for identification. You may also notice that every .411 file is exactly 4608 bytes. The extension .411 is also pretty unique, so I doubt it will clash with any other format for the moment.
File extensions are the easiest way to quickly identify a file format, but they can be misleading. This is the reason in Digital Preservation format identification tools like DROID are important to look closer at the file structure to more accurately identify formats. The other complication is some extensions are used for more than one format. Extensions like .DOC or .ISO can be used with many formats.
The PRONOM registry which DROID uses will list extensions associated with each format signature, but for some, they only have an extension and no signature. It’s nice to have an official ID to go with a format but with no signature it only matches based on extension.
This caused a problem awhile back for me while working with some files with the extension CDX. Which according to PRONOM, there are 5 completely different formats which use the extension, and probably others.
My CDX was related to some indexing software called Cindex. At the time the only format with a signature was for the WARC summary file CDX. The other was for a CorelDraw Compressed format with no signature. Confusing right? When I would run format identification on my Cindex files, they would default to the CorelDraw Compressed format, identified by extension. It was easy enough to create a signature for the Cindex format as I had enough samples to know the patterns needed for correct identification. But I was curious about the CorelDraw format. Should be easy to find, right?
Wrong. Finding a sample of this format was very elusive. All I had to go by was the name given to the format by PRONOM and the extension. I scoured every Corel CD and image I could get my hands on. For months I looked and could never find a single CDX file. Each CorelDraw software I was able to run did not have any ability to save in the CDX format. I scoured clipart discs, other Corel software, like Designer, PrintHouse, Photo-Paint, nada, nothing. I started to wonder if the format even existed. That’s when I noticed in the filters included with CorelDraw a reference to the ability to import a CDX but not write to one.
This led to me finding a reference on the old Corel FTP site for knowledge base number 4550.
It mentioned something called ArtShow, where version 5 supported the file format CDX. ArtShow was a gallery of winning designs released on a CD-ROM and book each year. The first one being ArtShow 91, then ArtShow 3, 4, 5, 6, and finally 7 was the last. Each one released used a different proprietary compressed format for storing all the designs, these formats exist nowhere else. The question remains, why didn’t they use other popular Corel formats like CDR, CMX, or CCX which were used on many other clip art titles.
It took some time but I was finally able to find copies of a few of the Artshow CD-ROM discs, especially numbers 5 & 6. Which had the CDX format and the second generation CPX formats.
Each format had a easy to recognize header making a PRONOM signature easy to create. PRONOM already had the PUID for the two formats CDX & CPX, so sending in the signature added to the registry and hopefully will help distinguish between all the CDX formats!
A few years ago I became obsessed with creating 3D models from physical objects. There was an app on my iPhone called 123D Catch by AutoDesk and it allowed you to take a series of photos with your iPhone camera, then combine them to create a 3D Model. This lead me down a path to eventually take a course on Photogrammetry and develop a process for capturing objects in our Museum.
Autodesk eventually discontinued the app and built the technology into their paid products. This is when we started seeing lidar introduced with handheld devices. The first one I tried was using my XBOX360 kinect sensor with the skanect software. The quality was horrible, but was fun to learn about depth sensors and structure from motion. When the iPhone finally came out with lidar sensor it was like Apple had read my mind. I love having the ability to capture objects I find into 3D models. The quality is pretty good, not as good as taking the time to capture image sets for photogrammetry and use tools like Aigisoft metashape, but apps like Scaniverse do a fantastic job. You can check out some of the models I have captured on my Sketchfab page.
With any new technology comes new file formats, and 3D formats are definitely no exception. It seems every software developer has to come up with their own proprietary format leaving the digital preservation folks scrambling to keep up. The DPC and Archivematica published a report a couple years ago and state:
“There are many challenges in preserving 3D data. As well as the complexity of the data itself, there is a lack of interoperability between the different (often proprietary) systems that are used to create and manipulate 3D models. Relationships to other data, software and hardware also need to be captured and managed effectively.”
With my new iPhone in hand I found myself with new file format I was unfamiliar with. Universal Scene Description is a framework to exchanging 3D data between different software developed by Pixar. The relationship between Apple and Pixar goes way back so it was no surprise the Apple iPhone has support built in for this new format and I found myself capturing and sending 3D models to others with an iPhone. The USDZ format is a ZIP package format for containing a USD 3D model and is perfect for sharing and preserving.
There is no current PRONOM signatures for identifying USD formats, so I wanted to look into creating one. This is where I ran into a problem. The current PRONOM signature syntax has no way of properly identifying the USDZ format. Let me explain.
When DROID or Siegfried is used to identify a container format such as USDZ. It will first identify the format as a ZIP file, which technically it is. This triggers the software to then refer to the container signature to see if any patterns from the files internal to the ZIP match to a known format. This is done by pointing to a specific file and a hex pattern or ascii string within the file. In the case of a USDZ the internal structure may look like this:
Listing archive: scaniverse-20210928-113055.usdz
--
Path = scaniverse-20210928-113055.usdz
Type = zip
Physical Size = 5702256
Date Time Attr Size Compressed Name
------------------- ----- ------------ ------------ ------------------------
2021-09-28 11:47:36 ..... 297999 297999 scaniverse-20210928-113055.usdc
2021-09-28 11:47:36 ..... 5403849 5403849 0/texgen_0.jpg
------------------- ----- ------------ ------------ ------------------------
2021-09-28 11:47:36 5701848 5701848 2 files
In this sample file the name of the USDZ is the same name as the internal USDC file. So the name of the USDC is variable and DROID needs a static name and path to look for patterns. The USDZ specification is clear that the only required file inside a USDZ is a USD model, anything else is ancillary and is not always going to be included. Currently the only format used is USDC, but in the future may allow a simple USD or USDA format. In addition, some of the other sample files show a very nested USDC file, making identification even more difficult.
Listing archive: Scan.usdz
--
Path = Scan.usdz
Type = zip
Physical Size = 19155195
Characteristics = Minor_Extra_ERROR
Date Time Attr Size Compressed Name
------------------- ----- ------------ ------------ ------------------------
2021-03-09 09:22:36 ..... 19154773 19154773 /private/var/mobile/Containers/Data/Application/EFD09E66-32FB-4B08-8BED-B7E3D78FE1A8/tmp/Scan.usdc
------------------- ----- ------------ ------------ ------------------------
2021-03-09 09:22:36 19154773 19154773 1 files
The USDZ format is not the only file format which makes identification difficult through variable names and non-static patterns. An issue on GitHub has been raised to address this problem. One potential fix is to use glob patterns as suggested by the amazing Richard Lehane, creator of Siegfried. This way we could use wildcard to ignore the variable names and find any file with an extension of .USDC for example. The USDC file format has a nice 8 byte header “PXR-USDC” which is perfectly suited for identification so our container signature might look like this:
Update: I was able to get a beta version of Siegfried working with my test signature.
siegfried : 1.11.0
scandate : 2023-06-02T08:54:27-06:00
signature : default.sig
created : 2023-06-02T08:52:33-06:00
identifiers :
- name : 'pronom'
details : 'DROID_SignatureFile_V112.xml; container-signature-20230510.xml; extensions: usdz-signature-file-v1.xml; container extensions: usdz-dev1-signaturefile-20230601.xml'
---
filename : 'scaniverse-20210928-113055.usdz'
filesize : 5702256
modified : 2021-09-28T11:47:37-06:00
errors :
matches :
- ns : 'pronom'
id : 'BYUdev/1'
format : 'USDZ 3D Package'
version :
mime : 'model/vnd.usdz+zip'
class :
basis : 'extension match usdz; container name scaniverse-20210928-113055.usdc with byte match at 0, 8 (signature 1/2)'
warning :
I am still in the process of testing some beta versions of Siegfried in hopes of getting the glob matching to work, but still have more to do. Stay tuned!
Digital Preservation is all about identifying risks. This is done through a process which includes identification, validation, and metadata extraction. The more you know about the digital data you need to preserve over time, the more you can do to minimize those risks with the goal of making the data accessible over time.
Many formats are pretty straight forward, they are identifiable through a header and then have some binary bits or plain text that is readable by certain software. Others are more complicated. A common practice for more complex needs is to use a container. Word processing programs started out with plain text with maybe some formatting codes mixed in, then many moved to the Microsoft OLE container so you could have additional content embedded in a single file. Today file formats such as DOCX use a ZIP container, which houses all the text, images, formatting and anything else the format supports. Knowing what the format is and knowing what it may contain is important to preservation.
I collect older digital cameras, specifically cameras with unique file formats, raw and otherwise. When I picked up a HP (Hewlett-Packard) point and shoot camera awhile back, I was initially unimpressed as it would only capture in a JPEG format and only 3 quality settings. While looking at a copy of the manual, I saw the camera was capable of capturing audio clips or voice memos for each photo taken. This can be handy when taking many photos and need a reminder about the context. This was not unique to HP, as many cameras could do this, normally a JPG was captured and the Audio would have the same name connecting the two. But when I recorded some audio on my little HP, placed the SD card in my computer, I couldn’t find the additional audio file. I also not the only one to ask about this.
There are many types of JPG files. Raw Streams, JPEG File Interchange Format (JFIF), and Exchangeable Image File Format (EXIF). Normally these formats have raster image data sprinkled with metadata. I have seen JPEG files embedded into other formats and containers, such as MP3, PDF, etc, but JPEG’s are not container formats. Or so I thought…..
Lets take a look at an image I took with my HP Photosmart 433. We’ll start with identification:
siegfried : 1.10.1
scandate : 2023-05-25T12:27:04-06:00
signature : default.sig
created : 2023-05-22T08:43:02-06:00
identifiers :
- name : 'pronom'
details : 'DROID_SignatureFile_V112.xml; container-signature-20230510.xml'
---
filename : 'GitHub/digicam_corpus/HP/Photosmart 433/IM000959.JPG'
filesize : 178922
modified : 2023-05-25T11:23:32-06:00
errors :
matches :
- ns : 'pronom'
id : 'x-fmt/391'
format : 'Exchangeable Image File Format (Compressed)'
version : '2.2'
mime : 'image/jpeg'
class : 'Image (Raster)'
basis : 'extension match jpg; byte match at [[0 16] [366 12] [178907 2]] (signature 2/2)'
warning :
IM000959.JPG was identified as x-fmt/391 which is a compressed Exchangeable Image File Format. version 2.2. Pretty straight forward. Next lets look at validation:
I removed a few lines to show important parts, but we get some similar information about the format, a JPEG with EXIF version 2.2. We also learn that HP improperly ordered their tags and put Tag 41492 out of sequence, but we can ignore that for now. Looking close at the output does not give us any indication of audio formats. There is a clue when we see the mention of a Flashpix version and additional Application Segments.
Since this is an image with EXIF data, lets also take a look at the output of Exiftool.
ExifTool Version Number : 12.62
File Name : IM000959.JPG
Directory : .
File Size : 179 kB
File Modification Date/Time : 2023:05:25 11:23:32-06:00
File Access Date/Time : 2023:05:25 11:24:42-06:00
File Inode Change Date/Time : 2023:05:25 11:24:39-06:00
File Permissions : -rwxr-xr-x
File Type : JPEG
File Type Extension : jpg
MIME Type : image/jpeg
Exif Byte Order : Little-endian (Intel, II)
Image Description : IM000959.JPG
Make : Hewlett-Packard
Camera Model Name : hp PhotoSmart 43x series
Orientation : Horizontal (normal)
X Resolution : 72
Y Resolution : 72
Resolution Unit : inches
Software : 1.400
Modify Date : 2021:11:16 09:04:04
Y Cb Cr Positioning : Co-sited
Copyright : Copyright 2002-2003
Exposure Time : 1/29
F Number : 4.0
ISO : 100
Exif Version : 0220
Date/Time Original : 2021:11:16 09:04:04
Create Date : 2021:11:16 09:04:04
Components Configuration : Y, Cb, Cr, -
Compressed Bits Per Pixel : 1.567552083
Shutter Speed Value : 1/30
Aperture Value : 4.0
Exposure Compensation : 0
Max Aperture Value : 4.0
Subject Distance : 1 m
Metering Mode : Average
Light Source : Unknown
Flash : Auto, Did not fire
Focal Length : 5.7 mm
Warning : [minor] Unrecognized MakerNotes
Flashpix Version : 0100
Color Space : sRGB
Exif Image Width : 640
Exif Image Height : 480
Interoperability Index : R98 - DCF basic file (sRGB)
Interoperability Version : 0100
Digital Zoom Ratio : 1
Subject Location : 0
Compression : JPEG (old-style)
Thumbnail Offset : 2046
Thumbnail Length : 7112
Code Page : Unicode UTF-16, little endian
Used Extension Numbers : 1, 31
Extension Name : Audio
Extension Class ID : 10000100-6FC0-11D0-BD01-00609719A180
Extension Persistence : Always Valid
Audio Stream : (Binary data 117820 bytes, use -b option to extract)
Image Width : 640
Image Height : 480
Encoding Process : Baseline DCT, Huffman coding
Bits Per Sample : 8
Color Components : 3
Y Cb Cr Sub Sampling : YCbCr4:2:2 (2 1)
Aperture : 4.0
Image Size : 640x480
Megapixels : 0.307
Shutter Speed : 1/29
Thumbnail Image : (Binary data 7112 bytes, use -b option to extract)
Focal Length : 5.7 mm
Light Value : 8.9
Ohh, what do we have here? Exiftool mentions an audio stream. An audio stream inside the JPEG? How is this possible? The Flashpix format was originally developed by Kodak in which collaborated with HP. This was later added to the EXIF specifications. Below is an screenshot from the Exif Version 2.2 spec.
Exiftool mentioned Flashpix and additional APP2 segments. Lets take a look at the raw file in a hex editor.
Ahhh….. In one of the App2 segments we can see something familiar. A RIFF WAVE header! Lets see if we can extract the WAVE file.
exiftool -b -AudioStream IM000959.JPG > IM000959.WAV
mediainfo IM000959.WAV
General
Complete name : IM000959.WAV
Format : Wave
Format settings : WaveFormatEx
File size : 115 KiB
Duration : 10 s 681 ms
Overall bit rate mode : Constant
Overall bit rate : 88.2 kb/s
Audio
Format : ADPCM
Codec ID : 11
Codec ID/Hint : Intel
Duration : 10 s 681 ms
Bit rate mode : Constant
Bit rate : 88.2 kb/s
Channel(s) : 1 channel
Sampling rate : 22.05 kHz
Bit depth : 4 bits
Stream size : 115 KiB (100%)
MediaInfo can give us details on the embedded WAVE file, which is pretty terrible quality but is a PCM audio stream.
Embedded audio inside a raster image is not common. Most software which can render a JPEG image will most likely ignore the embedded WAVE and not even give a warning it exists. IM000959.JPG opens fine in Adobe Photoshop, but saving to a new format or making any edits will delete the WAVE file. Imagemagick also will remove the WAVE with any editing with no warning.
In order to ensure the embedded audio stream is preserved we first need to know it is there, this is where tools like exiftool can be used to extract metadata from the file and the image can be associated with having an audio stream and handled differently than any other JPEG file. More work is needed, Exiftool may mention an Audio Stream, but currently does not have the ability to pull any data from the stream.
There are some file formats out there which are confusing. One such file came across my desk awhile back. This file was not identifiable with any tools I threw at it. At first I believed it to be a TIFF file variant.
You can see the TIFF header, but would not open as one, even if the extension was changed from PSC to TIF. The other hint was the phrase “3M Printscape”, I had never heard of it and there wasn’t much information available about it. It seems it was a creative product made by 3M in the early 2000’s. You could buy a package of printable cards, gift bags, etc. The problem was, there was no available software to be found. I searched on the Internet Archive, the Wayback Machine, and many other abandoned software sites. For months I searched, it wasn’t until a year later I came across one of the creative packages at thrift store. I was thrilled. That is until I was able to get the software installed.
After I installed the software in a virtual machine running Windows 98 I tried to open the PSC file but the software was looking for files with the extension STD, which is an unfortunate acronym. Turns out it stands for SureThing Document. SureThing is a software company who develops Label software. After many months of searching I thought I had found the software to render my file, but it was not meant to be.
Many months later I decided to do some more searches. That is when another copy of 3M Printscape showed up in the Internet Archive. 3M Printscape 2.0! It appears 3M decided to design their own software for version 2.0.
The preservation value of the above image is not lost on me. What took me over a year to figure out ended up being a simple pixelated image of a cardinal. Its the journey, not the destination?
From this little adventure I was able to submit two file formats to PRONOM, fmt/1275, fmt/1276. Also I documented the formats and linked to the software on the File Format Wiki. The 3M Printscape version 2 was also released for Macintosh, so the signature had to account for endianness, just like a TIFF file would. With the format having the string “3M Printscape” in the header, it made for an easy signature.
Hopefully, I will be the last to spend this much time on an image of a bird.
During the 1980’s and 90’s, there was an explosion of software created for the PC and Macintosh. When it came to graphic design, Aldus, Adobe, Quark, Serif, and a few others were clearly the best. That didn’t stop other software developers in trying their hand with publishing design software. If you were on a budget, there were plenty of options to choose from. One of them, Timeworks Publisher, was very popular. It was released in 1987 for IBM PC and Atari with later releases for Apple II and Macintosh. The name was later changed to Pressworks. It was published by an interesting software company out of the UK called GST Software, also under the GSP name. They really enjoyed licensing their software.
Desktop Publishing software
TimeWorks Publisher may have been the first, but was definitely not the last. Pressworks was very popular so the software was sold and rebranded to many companies. In 2001 GST merged with eGames Europe as a new company, Greenstreet Software who continued to support the software. Some of which are:
FUJI Publisher
Global Software Publishing (in Europe) Pressworks, Power Publisher
GST Pressworks
1st Press
IMSI TurboPublisher
Media Graphics Publishers Paradise Page Express
MicroVision Vision Publisher 4
NEBS PageMagic
PersonalSoft Publications (Français)
Pushbutton Publish
Softkey Publisher DOS
Sybex Page (Deutsch)
Timeworks Publisher, Publish-it, Publisher Lite, Publish-it Lite
VCI Pro Publisher
Wizardworks CompuWorks Publisher
Instant Home Publisher
Greenstreet Publisher
Canon Publishing Suite
All the of the software listed above could open and save to the same file format with the extension .DTP with full compatibility, also used TPL for templates. Originally the DTP file format was a single proprietary binary format which had an ascii header of “DTPI” and all seemed to end with the ascii “EODF”. Later the software was enhanced to be OLE compatible and the binary format was wrapped inside. This made it work well for moving objects in and out of the software into other OLE compatible software like Word, but is confusing to format identification software as the header is the same as a Word file. I have added the two versions of the DTP format to PRONOM to help identify them better. They are fmt/1415 and fmt/1416.
Drawing Software
In addition to the popular Desktop Publishing software, there was a companion Drawing software licensed as well. It also had many titles:
BHV COLOURDRAW!
FUJI Designer
Global Software Publishing (in Europe) Designworks, Power Publisher
GST (in North America) PressworksDraw
1st Design
IMSI TurboDraw
Media Graphics Publishers Paradise Design Studio
MicroVision Vision Draw
NEBS DesignMagic
PersonalSoft Création Graphique
Pushbutton Design
VCI Pro Design
Wizardworks CompuWorks Designer, CompuWorks Draw
Canon Publishing Suite
The Draw/Design software all used the same file format as well with the extension .ART, also with full compatibility between all the titles. The TEM extension was used for templates. Not to be confused with the AOL Image format, or Asymetrix Compel Image format, or a number of other formats using the ART extension. This format also began as a single proprietary binary format with the ascii header “GST:ART” starting at offset 16. And just like the DTP format it was later wrapped in an OLE container to be more compatible. In fact, the DTP format may have embedded Art objects! This format is not in PRONOM, so lets take a closer look.
You can see from the 1stdgn.art file here, the ascii “GST:ART” string starting at byte 16. This is consistent with all the samples I have. The first 16 bytes seem to vary in each sample and probably have to do with the size of the file and dimensions of the artwork. GST:ART is unique enough and should work well for a signature.
The ART file from a later version of Draw is in the OLE file format. This container format was designed by Microsoft as a universal container to increase compatibility among software. You can see from the hex view above the file looks very similar to the DOC format used by Word. There were many software titles which used this container format, many documented here. One of the easiest ways to look inside an OLE container is to use 7-Zip. A quick listing of the file shows it is a Type = compound and includes three files. The SummaryInformation file is common among many OLE formats and can contain some metadata, but the Contents file is what we are looking for. Examining the Contents file we find it looks identical to the earlier version of the ART format. The same “GST:ART” string starting at byte 16.
A note about the Preview.dib file. It appears to be a Device-Independent Bitmap, similar to a Bitmap file, probably for a thumbnail preview.
Writing a signature for an OLE container format is a bit more tricky. It requires a separate signature file to go along with the regular signature xml. Basically DROID is setup to “trigger” once it discovers either a “ZIP” file or “OLE” format. If it detects one of those formats it then looks into the container signature xml for additional patterns. If it finds a match then it identifies the format, if not it reports back a generic “ZIP” or “OLE” format.
As it turns out there were two different types of OLE file types, one used “Contents” for the internal file and another which used “CONTENTS”. Since the signature is case sensitive, the container signature requires two signatures both mapped to the same PUID.
These two formats were used with quite a few software titles. Hopefully these signatures cover most of them! You can find a couple samples and my signatures on my Github.
One of the earliest hypermedia systems which predated the world wide web was called HyperCard on the Macintosh. Within minutes you could have a small application to do just about anything, calendar, address book, interactive books, games, etc. The internet archive has collected many HyperCard stacks and emulates them directly in the browser.
Riding on the success of HyperCard was another hypermedia tool called MetaCard, which later became Runtime Revolution. Today it is known as LiveCode, a cross-platform application development system. LiveCode is often used to quickly create applications which can run on many platforms including iOS. It is popular with students and higher education. The LiveCode source was opened for a time instigated by a successful kickstarter program, but closed in 2021 as the company struggled to keep paying customers.
Each LiveCode version produced unique files for each of the major versions. Currently none of the formats can be identified using preservation tools. Luckily, because the code was open-source for a time, we have details which helps us identify the formats. Let’s take a look:
I took LiveCode up on their 10 day trial and was able to install software version 9.6.9 to save some samples. The software has a “Save as” option which allows you to save your code to older versions. Although one must be careful as saving to older versions may have some data loss.
The samples I was able to save had matching headers just like in the source code. The REVO string starts right at the beginning of the file making identification easy. Take a look at my GitHub page for samples and signature. Also check out the File Format Wiki Page for more information and more samples!
Awhile back I was asked to look at a file in our repository which had the extension OMF. It was not identified by DROID and didn’t appear to be in PRONOM. It didn’t take long to find quite a bit of information on the file format as it was used by many important software titles, or at least it used to. Exploring the details of this file format led me on quite the rabbit hole. You see, the OMFI format is based on a container format that once was heralded as the a better open choice over the Microsoft OLE container format growing in popularity.
OpenDoc
This all started with a multi-platform approach to an open document format started by Apple Computers in the early 1990’s called OpenDoc. It was originally an alliance between Apple, IBM, and Motorola. The idea was to have a framework any developer could use to develop software or components that would all work seamlessly together. Many developers were on board initially with many promised software titles being developed, but ultimately with much confusion surrounding the framework and Steve Jobs return to Apple in 1997, the project was scrapped.
Bento
The storage format to be used with OpenDoc was called Bento, in reference to the Japanese style of a compartmentalized container tray. Specifications were released in 1993.
There are four key ideas in the Bento format:
everything in the container is an object,
objects have persistent IDs,
all the metadata lives in the TOC (Table of Contents),
objects consist entirely of values, and
each value knows its own property, type, and data location.
The idea of a data model with such an organized structure was so appealing the digital preservation community there was excited to push for a Universal Preservation Format specifically for multimedia based on Bento. The idea was presented to AMIA in 1996!
Open Media Framework (OMF) Interchange
Avid Technology, a leader in audio/video editing systems, used the Bento specification to design a container format for multimedia. This allowed easy interchange of projects between many different software titles. Original specifications were published in 1994, while the 2.1 specifications released in 1997. Software titles such as Pro Tools, Cubase, Adobe Audition, Adobe Premiere, Apple Logic Pro, Apple Final Cut Pro, and many others supported the OMF format, at least for awhile. OMFI was migrated to Microsoft’s Structured Storage container format to form the core of (AAF) Advanced Authoring Format in the late 1990’s.
Identification
In order to identify an OMF file we first need to understand what is part of the OMF specifications and what is part of the Bento format. OpenDoc may not have lived very long but the Bento format held on long enough to be the structure used by a few different file formats. I am aware of the following, but there was other software being developed at the time.
Samples from each of these formats show some similar patterns. In the Bento specifications we can see:
The only version of the specifications I can find are version 1.0d5 released in 1993, but we know there was also a version 2 released later. The magic bytes are not defined in the 1.0d5 spec, but looking at the code in the Open Doc Developer Release in 1996, we can find reference to the magic bytes used in “Containr.h”.
The Bento specification also defines this header information as, “Our solution to this is to define the standard Bento format to have the label at the end of the container.” Which means this byte sequence will frequently be found at the End of File. The “CM” refers to “Container Manager” and “Hdr” refers to “Header”.
Now that we have the magic bytes for the Bento container we can look at what makes the OMF file unique from others. We can find the answer in the Bento specifications.
We know that every Bento container must have a object, so in version 1.0 of the specifications on page 65 we find.
Each object must have the property OMFI:ObjID. The value of OMFI:ObjID is required and is listed in the property description for each object.
The OMFI:ObjID can also be found in version 2.0 of the specification, but in addition it defines:
The OMFI:ObjID property has been renamed the OMFI:OOBJ:ObjClass property, which eliminates the concept of generic properties and makes the class model easier to understand. The name ObjClass is more descriptive because the property identifies the class of the object rather than containing an ID number for the object.
Since both are required it seems appropriate to use those strings for identification in a PRONOM signature. You can check out the proposed signature and samples on my GitHub page.
There is so much history wrapped up in these formats and the potential they had to change how we preserve files in our archives. Luckily we have the Internet Archive WayBack machine to help us discover or remember ideas that once existed, some which may find their way back to inspire future file formats.
Sony’s IC Recorders have been a popular small digital voice recorder for many consumers. The current models all use common recording formats like Linear PCM WAVE files or MP3, but it wasn’t always so. One of the first models ICD-R100 would record to the ICS audio format, which was Sony’s original sound formats used on the IC Recorders. I am still looking for samples of this format. If you do have a need to convert this format, Sony has free converter software.
The next generation of IC Recorders used a Memory Stick and therefore recorded audio to the MSV (Memory Stick Voice) format. There were actually two different types of MSV files, the first used the ADPCM codec and the next used the LPEC codec. Later IC Recorders would record to the DVF (Digital Voice Format) which also had a couple versions, one using the LPEC codec and the other the older TRC codec.
AFAIK, none of the codecs used in these file formats has been made public and these formats are not readable by tools such as MediaInfo. The only way to know details of a file and have the ability to play or convert is to use Sony software which has been discontinued and the replacement, Sound Organizer, can only recognize the LPEC codec versions of MSV and DVF. There is also a plugin for Windows Media Player available here, which is required even for Switch to work.
PRONOM currently has one signature for the LPEC versions of MSV and DVF, so lets look closer at the formats and see if we can determine what they are from the header.
The current software for managing audio files from IC Recorder is Sound Organizer. The software does open and convert some MSV/DVF files as long as they use the LPEC codec. Sound Organizer Compatible formats.
Also note, Sony made one ICD-CX series recorder which could also capture photos. It requires the Visual & Voice Player software. Audio is recorded in the DVF format.
Test Data Set
In order to explore the different formats I first needed to gather some samples. There are a few out there, but with the Digital Voice Editor 3 software, I was able to take a sample file and convert it to the many options available. You can see in the screenshot below, the different samples, their extension and the codec used. You can find my samples in GitHub here.
All MSV and DVF file have a similar pattern. The first 32 bytes have the text string “MS_VOICE SONY CORPORATION”. In between MS_VOICE and SONY, there is 4 bytes which vary slightly between the different formats. Here is a table of samples and the 4 bytes so we can see the differences.
Model
CODEC
EXTENSION
Hex Values
ICD-Px0
TRC
DVF
01020000
ICD-Px8
TRC
DVF
01020000
ICD-Px7
TRC
DVF
01020000
ICD-SXxx0
LPEC
MSV
01030000
ICD-SXx8
LPEC
MSV
01030000
ICD-SXx7
LPEC
MSV
01030000
ICD-SXx6
LPEC
DVF
01020000
ICD-SXx5
LPEC
DVF
01020000
ICD-SXx0
LPEC
DVF
01020000
ICD-MX
LPEC
MSV
01020000
ICD-BM
LPEC
MSV
01020000
ICD-ST
LPEC
DVF
01020000
ICD-MS5xx
LPEC
MSV
01010000
ICD-S
LPEC
MSV
01010000
ICD-BPx50
LPEC
DVF
01010000
ICD-BP100/x20
LPEC
DVF
01010000
ICD-MS1/MS2
ADPCM
MSV
01000000
ICD-R100/R200
Unknown
ICS
There is an obvious pattern to the hex values as they increment 0100, 0101, 0102, and 0103. But there is some overlap between extension and codec, so probably more of a version number than specific to the codec. Currently the PRONOM signature for this format fmt/472, has the pattern for the 0102 version, but none of the others. We could simply add a variable in the signature for the different values and update the PRONOM signature so more samples would be identified. This would work well if there was a secondary characterization process to get technical metadata such as the codec and quality, but I am unaware of any tool to gather this information from the format, so I wonder if we can find any hints in the file to identify the codec so we have multiple PRONOM signatures to choose from. Also, you can see from the screenshot above that some of the LPEC formats have specific model numbers in the codec column, which could mean they may not be exactly the same. Each IC Recorder model has different quality settings and it appears, some settings may not be compatible with other models.
Looking beyond the first 16 bytes there is a lot of hex values which are unknown. A close comparison of all the samples leads me to the 4 bytes at offset 60. They seem to be the same for files with the same settings. Below is a chart of those values.
Extension
CODEC
Quality
Offset 60
DVF
TRC
HQ
00300001
DVF
TRC
SP
00350001
DVF
TRC
LP
00370001
DVF
LPEC (ICD-BP-100/x20)
SP
00150001
DVF
LPEC (ICD-BP-100/x20)
LP
00190001
DVF
LPEC
SP
002A0001
DVF
LPEC
LP
002C0001
MSV
LPEC (ICD-BM/MX/SXx7/SXx8/SXxx0)
SP
004A0001
MSV
LPEC (ICD-BM/MX/SXx7/SXx8/SXxx0)
LP
004C0001
MSV/DVF
LPEC (ICD-SXx7/SXx8/SXxx0)
STHQ
00200002
MSV/DVF
LPEC (ICD-SXx7/SXx8/SXxx0)
ST
00240002
MSV
ADPCM
SP
00050001
MSV
ADPCM
LP
00090001
Just to be sure this value at offset 60 was indeed an indication of codec and quality I manually switch out the 4 bytes from a LPEC ST file for a TRC HQ file. Sure enough, the software now saw the file as a TRC HQ audio file, even though the original is a Stereo file.
There is a very good chance this is not all the options. I only have one physical recorder which only records in Mono. But this gives us a really good idea of how to tell the difference between files. Below are the patterns I am submitting to PRONOM.
This is one example of a file format which has a proprietary component which was never released from the vendor. When the vendor stopped supporting the software to open and read these formats, the risk increased for long-term preservation. It would be really nice when a vendor discontinues a technology, which was used by consumers, they would make the documentation for the format openly available. If you know more about the format, please reach out or if you have samples which don’t match the patterns mentioned here.