One of the earliest hypermedia systems which predated the world wide web was called HyperCard on the Macintosh. Within minutes you could have a small application to do just about anything, calendar, address book, interactive books, games, etc. The internet archive has collected many HyperCard stacks and emulates them directly in the browser.
Riding on the success of HyperCard was another hypermedia tool called MetaCard, which later became Runtime Revolution. Today it is known as LiveCode, a cross-platform application development system. LiveCode is often used to quickly create applications which can run on many platforms including iOS. It is popular with students and higher education. The LiveCode source was opened for a time instigated by a successful kickstarter program, but closed in 2021 as the company struggled to keep paying customers.
Each LiveCode version produced unique files for each of the major versions. Currently none of the formats can be identified using preservation tools. Luckily, because the code was open-source for a time, we have details which helps us identify the formats. Let’s take a look:
I took LiveCode up on their 10 day trial and was able to install software version 9.6.9 to save some samples. The software has a “Save as” option which allows you to save your code to older versions. Although one must be careful as saving to older versions may have some data loss.
The samples I was able to save had matching headers just like in the source code. The REVO string starts right at the beginning of the file making identification easy. Take a look at my GitHub page for samples and signature. Also check out the File Format Wiki Page for more information and more samples!
Awhile back I was asked to look at a file in our repository which had the extension OMF. It was not identified by DROID and didn’t appear to be in PRONOM. It didn’t take long to find quite a bit of information on the file format as it was used by many important software titles, or at least it used to. Exploring the details of this file format led me on quite the rabbit hole. You see, the OMFI format is based on a container format that once was heralded as the a better open choice over the Microsoft OLE container format growing in popularity.
OpenDoc
This all started with a multi-platform approach to an open document format started by Apple Computers in the early 1990’s called OpenDoc. It was originally an alliance between Apple, IBM, and Motorola. The idea was to have a framework any developer could use to develop software or components that would all work seamlessly together. Many developers were on board initially with many promised software titles being developed, but ultimately with much confusion surrounding the framework and Steve Jobs return to Apple in 1997, the project was scrapped.
Bento
The storage format to be used with OpenDoc was called Bento, in reference to the Japanese style of a compartmentalized container tray. Specifications were released in 1993.
There are four key ideas in the Bento format:
everything in the container is an object,
objects have persistent IDs,
all the metadata lives in the TOC (Table of Contents),
objects consist entirely of values, and
each value knows its own property, type, and data location.
The idea of a data model with such an organized structure was so appealing the digital preservation community there was excited to push for a Universal Preservation Format specifically for multimedia based on Bento. The idea was presented to AMIA in 1996!
Open Media Framework (OMF) Interchange
Avid Technology, a leader in audio/video editing systems, used the Bento specification to design a container format for multimedia. This allowed easy interchange of projects between many different software titles. Original specifications were published in 1994, while the 2.1 specifications released in 1997. Software titles such as Pro Tools, Cubase, Adobe Audition, Adobe Premiere, Apple Logic Pro, Apple Final Cut Pro, and many others supported the OMF format, at least for awhile. OMFI was migrated to Microsoft’s Structured Storage container format to form the core of (AAF) Advanced Authoring Format in the late 1990’s.
Identification
In order to identify an OMF file we first need to understand what is part of the OMF specifications and what is part of the Bento format. OpenDoc may not have lived very long but the Bento format held on long enough to be the structure used by a few different file formats. I am aware of the following, but there was other software being developed at the time.
Samples from each of these formats show some similar patterns. In the Bento specifications we can see:
The only version of the specifications I can find are version 1.0d5 released in 1993, but we know there was also a version 2 released later. The magic bytes are not defined in the 1.0d5 spec, but looking at the code in the Open Doc Developer Release in 1996, we can find reference to the magic bytes used in “Containr.h”.
The Bento specification also defines this header information as, “Our solution to this is to define the standard Bento format to have the label at the end of the container.” Which means this byte sequence will frequently be found at the End of File. The “CM” refers to “Container Manager” and “Hdr” refers to “Header”.
Now that we have the magic bytes for the Bento container we can look at what makes the OMF file unique from others. We can find the answer in the Bento specifications.
We know that every Bento container must have a object, so in version 1.0 of the specifications on page 65 we find.
Each object must have the property OMFI:ObjID. The value of OMFI:ObjID is required and is listed in the property description for each object.
The OMFI:ObjID can also be found in version 2.0 of the specification, but in addition it defines:
The OMFI:ObjID property has been renamed the OMFI:OOBJ:ObjClass property, which eliminates the concept of generic properties and makes the class model easier to understand. The name ObjClass is more descriptive because the property identifies the class of the object rather than containing an ID number for the object.
Since both are required it seems appropriate to use those strings for identification in a PRONOM signature. You can check out the proposed signature and samples on my GitHub page.
There is so much history wrapped up in these formats and the potential they had to change how we preserve files in our archives. Luckily we have the Internet Archive WayBack machine to help us discover or remember ideas that once existed, some which may find their way back to inspire future file formats.
Sony’s IC Recorders have been a popular small digital voice recorder for many consumers. The current models all use common recording formats like Linear PCM WAVE files or MP3, but it wasn’t always so. One of the first models ICD-R100 would record to the ICS audio format, which was Sony’s original sound formats used on the IC Recorders. I am still looking for samples of this format. If you do have a need to convert this format, Sony has free converter software.
The next generation of IC Recorders used a Memory Stick and therefore recorded audio to the MSV (Memory Stick Voice) format. There were actually two different types of MSV files, the first used the ADPCM codec and the next used the LPEC codec. Later IC Recorders would record to the DVF (Digital Voice Format) which also had a couple versions, one using the LPEC codec and the other the older TRC codec.
AFAIK, none of the codecs used in these file formats has been made public and these formats are not readable by tools such as MediaInfo. The only way to know details of a file and have the ability to play or convert is to use Sony software which has been discontinued and the replacement, Sound Organizer, can only recognize the LPEC codec versions of MSV and DVF. There is also a plugin for Windows Media Player available here, which is required even for Switch to work.
PRONOM currently has one signature for the LPEC versions of MSV and DVF, so lets look closer at the formats and see if we can determine what they are from the header.
The current software for managing audio files from IC Recorder is Sound Organizer. The software does open and convert some MSV/DVF files as long as they use the LPEC codec. Sound Organizer Compatible formats.
Also note, Sony made one ICD-CX series recorder which could also capture photos. It requires the Visual & Voice Player software. Audio is recorded in the DVF format.
Test Data Set
In order to explore the different formats I first needed to gather some samples. There are a few out there, but with the Digital Voice Editor 3 software, I was able to take a sample file and convert it to the many options available. You can see in the screenshot below, the different samples, their extension and the codec used. You can find my samples in GitHub here.
All MSV and DVF file have a similar pattern. The first 32 bytes have the text string “MS_VOICE SONY CORPORATION”. In between MS_VOICE and SONY, there is 4 bytes which vary slightly between the different formats. Here is a table of samples and the 4 bytes so we can see the differences.
Model
CODEC
EXTENSION
Hex Values
ICD-Px0
TRC
DVF
01020000
ICD-Px8
TRC
DVF
01020000
ICD-Px7
TRC
DVF
01020000
ICD-SXxx0
LPEC
MSV
01030000
ICD-SXx8
LPEC
MSV
01030000
ICD-SXx7
LPEC
MSV
01030000
ICD-SXx6
LPEC
DVF
01020000
ICD-SXx5
LPEC
DVF
01020000
ICD-SXx0
LPEC
DVF
01020000
ICD-MX
LPEC
MSV
01020000
ICD-BM
LPEC
MSV
01020000
ICD-ST
LPEC
DVF
01020000
ICD-MS5xx
LPEC
MSV
01010000
ICD-S
LPEC
MSV
01010000
ICD-BPx50
LPEC
DVF
01010000
ICD-BP100/x20
LPEC
DVF
01010000
ICD-MS1/MS2
ADPCM
MSV
01000000
ICD-R100/R200
Unknown
ICS
There is an obvious pattern to the hex values as they increment 0100, 0101, 0102, and 0103. But there is some overlap between extension and codec, so probably more of a version number than specific to the codec. Currently the PRONOM signature for this format fmt/472, has the pattern for the 0102 version, but none of the others. We could simply add a variable in the signature for the different values and update the PRONOM signature so more samples would be identified. This would work well if there was a secondary characterization process to get technical metadata such as the codec and quality, but I am unaware of any tool to gather this information from the format, so I wonder if we can find any hints in the file to identify the codec so we have multiple PRONOM signatures to choose from. Also, you can see from the screenshot above that some of the LPEC formats have specific model numbers in the codec column, which could mean they may not be exactly the same. Each IC Recorder model has different quality settings and it appears, some settings may not be compatible with other models.
Looking beyond the first 16 bytes there is a lot of hex values which are unknown. A close comparison of all the samples leads me to the 4 bytes at offset 60. They seem to be the same for files with the same settings. Below is a chart of those values.
Extension
CODEC
Quality
Offset 60
DVF
TRC
HQ
00300001
DVF
TRC
SP
00350001
DVF
TRC
LP
00370001
DVF
LPEC (ICD-BP-100/x20)
SP
00150001
DVF
LPEC (ICD-BP-100/x20)
LP
00190001
DVF
LPEC
SP
002A0001
DVF
LPEC
LP
002C0001
MSV
LPEC (ICD-BM/MX/SXx7/SXx8/SXxx0)
SP
004A0001
MSV
LPEC (ICD-BM/MX/SXx7/SXx8/SXxx0)
LP
004C0001
MSV/DVF
LPEC (ICD-SXx7/SXx8/SXxx0)
STHQ
00200002
MSV/DVF
LPEC (ICD-SXx7/SXx8/SXxx0)
ST
00240002
MSV
ADPCM
SP
00050001
MSV
ADPCM
LP
00090001
Just to be sure this value at offset 60 was indeed an indication of codec and quality I manually switch out the 4 bytes from a LPEC ST file for a TRC HQ file. Sure enough, the software now saw the file as a TRC HQ audio file, even though the original is a Stereo file.
There is a very good chance this is not all the options. I only have one physical recorder which only records in Mono. But this gives us a really good idea of how to tell the difference between files. Below are the patterns I am submitting to PRONOM.
This is one example of a file format which has a proprietary component which was never released from the vendor. When the vendor stopped supporting the software to open and read these formats, the risk increased for long-term preservation. It would be really nice when a vendor discontinues a technology, which was used by consumers, they would make the documentation for the format openly available. If you know more about the format, please reach out or if you have samples which don’t match the patterns mentioned here.
The first version of Microsoft Excel was released on Macintosh in 1985. Before that there was MultiPlan.
MultiPlan version 4 and Excel version 2 used the well known and documented BIFF format. Before BIFF2 the formats are a bit of a mystery. AFAIK, Microsoft never released any documentation on the file format used for Excel version 1 and MultiPlan 1 -3, they emphasized using the SYLK format for interchange. To make matters worse, there were upwards of 100 different versions of the early MultiPlan, ported for dozens of different systems. Some of them are discussed on the TRS-80 website.
Or you can take MultiPlan 1.06 for a spin over at PCjs!
Needless to say documenting and finding a pattern which could be used to identify the early versions of MultiPlan and Excel 1 are difficult. These versions are missing from the PRONOM registry, but hopefully with enough samples, some patterns can be found to confidently identify formats from the early days of spreadsheets!
Marco Pontello’s TrID identifier software has signatures for the early Multiplan and Excel formats. His software scans for patterns in samples and finds commonalities between them. So the more samples he can scan the more accurate the identification can be.
There seems to be some patterns between versions, but then also some major differences. Without a specification or an understanding of the system the samples were created on, it is hard to identify these formats with certainty. There could be hex values which are the same for the samples we have but different for others, headers can often have values indicating dates or length of the file, so finding variations in files is key to a good signature.
Keep an eye on my GitHub PRONOM Research folder as I add more samples and prepare a signature for PRONOM.
Originally the Adobe Illustrator Format (AI) was based on postscript. With each file having a postscript header. This all changed with Illustrator version 9 moving to PDF as its core, released in the year 2000.
Even though AI files begin with a PDF header, there is much more to them which makes them a unique file format. So as Dov Isaacs put it, “PDF files are not Adobe Illustrator files and vice versa”.
Working in digital preservation the need to identify a file format is vital to the process. It is also important to identify when the format changes over time in order to properly maintain that file. Adobe Illustrator files created in version 8 or earlier are substantially different than those created in version 9 and greater and will need different software to render properly.
This is where identification tools come in handy.
Using the “File” command in a CLI we get:
Illustrator9v8-s04.ai: PostScript document text conforming DSC level 3.0
Illustrator9-s04.ai: PDF document, version 1.4, 1 pages
While partly true, we need more specific identification if we want to properly preserve these file in the long term. Enter PRONOM, which is a file identification registry based on signatures to identify file formats. Using a tool like DROID, Siegfried with the PRONOM registry we can get better identification.
siegfried : 1.10.0
scandate : 2023-04-07T11:59:18-06:00
signature : default.sig
created : 2023-03-23T15:09:43Z
identifiers :
- name : 'pronom'
details : 'DROID_SignatureFile_V111.xml; container-signature-20230307.xml'
---
filename : 'Illustrator9-s04.ai'
filesize : 77829
modified : 2023-04-07T11:04:53-06:00
errors :
matches :
- ns : 'pronom'
id : 'fmt/558'
format : 'Adobe Illustrator'
version : '9.0'
mime : 'application/postscript'
class : 'Image (Vector)'
basis : 'extension match ai; byte match at [[0 8] [1536 557]]'
warning :
---
filename : 'Illustrator9v8-s04.ai'
filesize : 323748
modified : 2023-04-07T11:05:11-06:00
errors :
matches :
- ns : 'pronom'
id : 'fmt/557'
format : 'Adobe Illustrator'
version : '8.0'
mime : 'application/postscript'
class : 'Image (Vector)'
basis : 'extension match ai; byte match at 0, 673'
warning :
This identification is possible because of signatures built for the file format specific to each version. The file format wiki has a list of the current signatures for the Illustrator format. The problem is, the last signature added to PRONOM was for version 16 (CS6). Since then there have been more changes to the format.
If we attempt an identification of a Illustrator file created with current 2023 software we get this result.
filename : 'Illustrator2023-s01.ai'
filesize : 1195445
modified : 2023-02-16T12:29:16-07:00
errors :
matches :
- ns : 'pronom'
id : 'fmt/20'
format : 'Acrobat PDF 1.6 - Portable Document Format'
version : '1.6'
mime : 'application/pdf'
class : 'Page Description'
basis : 'byte match at [[0 8] [1195439 5]]'
warning : 'extension mismatch'
Although technically correct, as the Illustrator file has a PDF 1.6 header, identification needs to know this is an Illustrator file. So if we create a new signature by adding the following hexadecimal pattern:
filename : 'Illustrator2023-s01.ai'
filesize : 1195445
modified : 2023-02-16T12:29:16-07:00
errors :
matches :
- ns : 'pronom'
id : 'BYUDev/3'
format : 'Adobe Illustrator CC 2020'
version : '24.2+'
mime : 'application/postscript'
class :
basis : 'extension match ai; byte match at [[0 8] [8766 45] [45347 348]]'
warning :
Lets break down the hexadecimal pattern. The “*” is a wildcard indicating there is 0 to many bytes in between.
255044462D312E36 translates to: %PDF-1.63C696C6C7573747261746F723A547970653E446F63756D656E743C2F696C6C7573747261746F723A547970653E translates to: <illustrator:Type>Document</illustrator:Type>252150532D41646F62652D332E30 translates to: %!PS-Adobe-3.0254149355F46696C65466F726D6174203134 translates to: %AI5_FileFormat 14
Identification is based first on the PDF Header, then some XMP metadata indicating this is an Illustrator document, then the Postscript header, then finally the version identifier. Each Illustrator since version 5 has a file format version, when Adobe switched from the CS labels to CC, they stuck with version 13 until 2020, when the format was changed to version 14. With one catch, when Illustrator version 24 (2020) was first released it was format version 14, but still had the PDF 1.5 header. This was changed in version 24.2 to a PDF 1.6 header which added a bigger Canvas size.
In the current PRONOM signatures going back to version 9, there was some offsets assumed for the space between the PDF header, postscript header, and version number. I have found through many samples there are quite a few which are outside those offsets, especially as the size of the AI file gets larger. Therefore I am suggesting the “*” wildcard between all segments.
One area that still needs a bit more research is with Illustrator versions 9-12 (CS2). These do not include the XMP metadata indicating they are Illustrator Documents, so they will more often get misidentified as PDF. I did find, however, AI files have a string “/AIPrivateData”, while saved as PDF, they have “/AIPDFPrivateData”. So signature will have this added to distinguish.
Another anomaly is some samples I found on the Illustrator 9 CD-ROM. Illustrator 9 was released in June of 2000, but many of these files were created in February of 2000, they have a PDF 1.4 header but have a format version 4, which is what version 8 uses. So these files were probably created with an early build of Illustrator 9 and format was incremented to 5 in the public release.
You can see my submission suggestion on my GitHub page along with the PRONOM Signature and sample files. There is still a couple tweaks I need to make, but let me know what you think.
Note: All Illustrator files and PDF’s saved with Illustrator compatibility checked which include a section of the file called “AI Private Data”, this is where all the Illustrator data lives. It includes a “creator” version and a “container” version which could also be used to identify an Illustrator files version.