Daisy

October 4, 2024 by Thor Leave a comment

A single file can often be self contained, having all that is needed to render itself with the correct software, but more and more often files need other files to function properly. Sometimes these groups of dependent files are within a container, such as a DOCX or ePub, but can also be found all sitting nicely in a folder. I say nicely, partly because the structure works, that is until they are treated as individual files and renamed or moved around breaking that interdependence on each other.

In the case of many Apple bundle files, they appear to be a single file when using on the MacOS, but as a folder on Windows or Linux. This can be very confusing. In other cases such as the DAISY Digital Talking Book format, it is simply a folder or disc with a few or many files within.

Current tools used to identify file formats, such as DROID, look at individual files, not groups of files to determine format. Each file within a folder may have a unique format, but when grouped with other specific formats they become something more. We will have to work on enhancing current tools if we want to avoid breaking these format types and losing their ability to render properly.

DAISY, or Digital Accessible Information System, is a type of Digital Book. The format was originally conceived in 1988 as a method to create a talking book, designed for the purpose of giving those who are visually impaired the ability to listen to books. It wasn’t until 1996, the DAISY Consortium was created in order to take the technology to those who needed it. The original version of the the DAISY format in 1994 was proprietary, but once they formed the consortium, they decided to adopt open standards for the format and in 1998, the DAISY 2.0 standard was released. You can read more on the Library of Congress Format Description page.

Lets take a look at a folder containing a DAISY 2.0 book.

ls -la "DAISY 2.02 export"
total 536
drwx------  1 tyler  staff   16384 Sep 25 22:06 .
drwx------  1 tyler  staff   16384 Sep 25 22:06 ..
-rwx------@ 1 tyler  staff    1090 Sep 25 22:05 0002.smil
-rwx------  1 tyler  staff  228413 Sep 25 22:05 aud0001.mp3
-rwx------@ 1 tyler  staff     672 Sep 25 22:05 master.smil
-rwx------  1 tyler  staff    1703 Sep 25 22:05 ncc.html

We can see three different formats in this folder. The obvious well known MP3 files and an HTML file. We also see two files with the extension SMIL.

“Synchronized Multimedia Integration Language” or SMIL is a W3C XML standard used to describe multimedia presentations. It is used in the DAISY DTB as well as other applications, but we will focus on DAISY, and it is in its third version. A SMIL file has this structure:

<?xml version="1.0"?>
<!DOCTYPE smil PUBLIC "-//W3C//DTD SMIL 1.0//EN" "http://www.w3.org/TR/REC-smil/SMIL10.dtd">
<smil>
  <head>
    <meta name="dc:title" content="Obi Project" />
    <meta name="dc:identifier" content="589c550e-303b-4c0d-9921-ae76d782fd53" />
    <meta name="ncc:generator" content="Obi v5.0.0.0 with toolkit: UrakawaSDK.core v2.0.0.0 (http://urakawa.sf.net/obi)" />
    <meta name="dc:format" content="Daisy 2.02" />
    <meta name="ncc:timeInThisSmil" content="00:00:28" />
    <layout>
      <region id="textView" />
    </layout>
  </head>
  <body>
    <ref title="Testing" src="0002.smil" id="ms_0002" />
  </body>
</smil>

A standard XML file with a link to a SMIL DTD and a root tag of <smil>. This format is recognized by PRONOM as fmt/205, although is often identified as a standard XML file. It seems the signature was created with a small offset which works with some SMIL files, but the gap between the end of the XML declaration and the start of the <smil> tag is only 20-86 bytes, not enough to allow for different character sets and full DTD URL’s. We will have to increase this gap in order to get all the SMIL files identified correctly.

With this update all the files in a DAISY 2.0 files should be identified individually, but as a set of files they make up the DAISY 2.0 format. This format requires the ncc.html file be present at the root of the folder or CD, so this file will aid in the manual identification of this format.

DAISY 3 was released in 2002 and standardized using the ANSI/NISO Z39.86 2002 name. It has been revised a couple times with the current revision being 2012. This update adds more functionality to the format with many new optional and required formats/files included in the folder. Here is a simple example:

ls -la "DAISY3 Export"
total 784
drwx------  1 tyler  staff   16384 Sep 25 22:06 .
drwx------  1 tyler  staff   16384 Sep 25 22:06 ..
-rwx------@ 1 tyler  staff     979 Sep 25 22:05 0001.smil
-rwx------  1 tyler  staff  228413 Sep 25 22:05 aud0001.mp3
-rwx------  1 tyler  staff    1014 Sep 25 22:05 navigation.ncx
-rwx------  1 tyler  staff    1881 Sep 25 22:05 package.opf
-rwx------  1 tyler  staff    7838 Nov  2  2020 tpbnarrator.res
-rwx------  1 tyler  staff  117656 Nov  2  2020 tpbnarrator_res.mp3

The SMIL format is still included, along with MP3’s, but we have some addition formats. The NCX or “Navigation Control File”, the OPF or “Package file”, and the RES or “Resource file” are a few of them. The NCX file is the first file accessed as it lays out the navigation for the whole DTB. It is also XML:

cat DAISY3 Export/navigation.ncx 
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE ncx PUBLIC "-//NISO//DTD ncx 2005-1//EN" "http://www.daisy.org/z3986/2005/ncx-2005-1.dtd">
<ncx
	version="2005-1"
	xml:lang="en-US" xmlns="http://www.daisy.org/z3986/2005/ncx/">

This file is only recognized by DROID as a standard XML file. It probably should have unique identification like SMIL and with a root tag of <ncx>, that should be fairly easy to add.

The Package file with the extension OPF, is actually a format used by the openebook group, not to be confused by a format used by the Open Preservation Foundation 🤣. The Open Packaging Format is used and a DTB conforming to this standard must include exactly one Package File which must be a valid XML 1.0 document conforming to the OEBF Publication Structure 1.2 package.

cat DAISY3 Export/package.opf   
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE package PUBLIC "+//ISBN 0-9673008-1-9//DTD OEB 1.2 Package//EN" "http://openebook.org/dtds/oeb-1.2/oebpkg12.dtd">
<package
	unique-identifier="uid" xmlns="http://openebook.org/namespaces/oeb-package/1.0/">
	<metadata>
		<dc-metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oebpackage="http://openebook.org/namespaces/oeb-package/1.0/">
			<dc:Identifier
				id="uid">589c550e-303b-4c0d-9921-ae76d782fd53</dc:Identifier>
			<dc:Format>ANSI/NISO Z39.86-2005</dc:Format>
			<dc:Title>Obi Project</dc:Title>
			<dc:Publisher>N/A</dc:Publisher>
			<dc:Language>en-US</dc:Language>
			<dc:Creator>Creator name</dc:Creator>
			<dc:Date>2024-09-25</dc:Date>
		</dc-metadata>

The OPF format is also unknown to PRONOM and they identify as standard XML files as well. The root tag of “<package>” could be used elsewhere so the signature may need to reference the OEB package information.

The RES Resource file is also a standard XML and can be identified through its root tag of “<resources>” and resources DOCTYPE.

cat DAISY3 Export/tpbnarrator.res 
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE resources PUBLIC "-//NISO//DTD resource 2005-1//EN" "http://www.daisy.org/z3986/2005/resource-2005-1.dtd" []>
<resources xmlns="http://www.daisy.org/z3986/2005/resource/" version="2005-1">
  
  <!-- SKIPPABLE NCX -->
  
  <scope nsuri="http://www.daisy.org/z3986/2005/ncx/">
    <nodeSet id="ns001" select="//smilCustomTest[@bookStruct='LINE_NUMBER']">
      <resource xml:lang="en" id="r001">
        <text>Row</text>
        <audio src="tpbnarrator_res.mp3" clipBegin="0:00:02.379" clipEnd="0:00:03.416" />
      </resource>
    </nodeSet>

Now, adding these DAISY 3.0 formats will greatly increase the identification of this complex format. But we run into a problem with some of the software out there which generates these DAISY files, some of them include files not required by the format, but are included to be used by the different software. This can include some CSS files for formatting, additional XML, XSL files, DTD’s, and for DAISY files created by the PlexTalk software, additional project files.

ls -la MasterCD/AfterBuild 
total 7520
drwx------@ 1 tyler  staff    16384 Sep 24 19:34 .
drwx------@ 1 tyler  staff    16384 Sep 25 22:11 ..
-rwx------@ 1 tyler  staff     6688 Sep 25 01:32 ImdPhrInfo.imph
-rwx------@ 1 tyler  staff     3773 Sep 25 01:32 ImdTxtTabl.imtt
-rwx------@ 1 tyler  staff     1276 Sep 25 01:32 Ncc.imdn
-rwx------@ 1 tyler  staff  3716618 Sep 25 01:32 a000001.mp3
-rwx------@ 1 tyler  staff     4352 Sep 25 01:32 ncc.html
-rwx------@ 1 tyler  staff     1015 Sep 25 01:32 ptk000001.smil
-rwx------@ 1 tyler  staff      938 Sep 25 01:32 ptk000002.smil

The ncc.html file is here, indicating a DAISY 2.0 format, along with an MP3 and SMIL files, but including some additional formats.

In addition, when creating a project, four files with the extensions Ncc.imdn, ImdPhrInfo.imph, ImdTxtTabl.imtt, and METADATA.ini are automatically created. These files are called “Plextalk project files.” They store table of contents information, etc. (Plextalk project files generated by older versions of this product do not have METADATA.ini.)
http://www.plextalk.com/jp/dw_data/PRSStd/PLEX_RS_UM.html

These four files may not be crucial to the playing of the Daisy format, but they are important to the PlexTalk software.

hexdump -C ImdPhrInfo.imph | head
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000020  ff ff ff ff ff ff ff ff  00 00 00 00 00 00 00 00  |................|
00000030  00 00 00 00 00 00 00 00  f0 a3 0d 00 00 00 00 00  |................|
00000040  a3 06 00 00 a4 06 00 00  00 00 00 00 53 00 00 00  |............S...|
00000050  ff ff ff ff 01 00 00 00  03 00 00 00 00 00 00 00  |................|
00000060  00 00 00 00 00 00 00 00  c5 11 00 00 20 1a 00 00  |............ ...|
00000070  e5 2b 00 00 00 00 00 00  63 00 00 00 ff ff ff ff  |.+......c.......|
00000080  02 00 00 00 04 00 00 00  00 00 00 00 00 00 00 00  |................|
00000090  00 00 00 00 e5 2b 00 00  d6 0b 00 00 bb 37 00 00  |.....+.......7..|

hexdump -C ImdTxtTabl.imtt | head 
00000000  17 00 00 00 32 30 30 34  2f 30 35 2f 33 31 2f 31  |....2004/05/31/1|
00000010  36 3a 36 3a 34 37 2e 30  30 30 00 03 00 00 00 65  |6:6:47.000.....e|
00000020  6e 00 0b 00 00 00 69 73  6f 2d 38 38 35 39 2d 31  |n.....iso-8859-1|
00000030  00 0d 00 00 00 5a 3a 2f  42 6f 6f 6b 44 69 72 34  |.....Z:/BookDir4|
00000040  2f 00 0d 00 00 00 5a 3a  2f 42 6f 6f 6b 44 69 72  |/.....Z:/BookDir|
00000050  34 2f 00 0c 00 00 00 61  30 30 30 30 30 31 2e 6d  |4/.....a000001.m|
00000060  70 33 00 0c 00 00 00 61  30 30 30 30 30 31 2e 6d  |p3.....a000001.m|
*
00000980  70 33 00 08 00 00 00 48  65 61 64 69 6e 67 00 01  |p3.....Heading..|
00000990  00 00 00 00 08 00 00 00  48 65 61 64 69 6e 67 00  |........Heading.|

hexdump -C Ncc.imdn | head       
00000000  01 ff 00 ff c4 00 00 00  3c 00 00 00 2c 00 00 00  |........<...,...|
00000010  14 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000020  00 00 00 00 49 6d 64 54  78 74 54 61 62 6c 2e 69  |....ImdTxtTabl.i|
00000030  6d 74 74 00 00 00 00 00  00 00 00 00 00 00 00 00  |mtt.............|
00000040  00 00 00 00 49 6d 64 50  68 72 49 6e 66 6f 2e 69  |....ImdPhrInfo.i|
00000050  6d 70 68 00 00 00 00 00  00 00 00 00 00 00 00 00  |mph.............|
00000060  00 00 00 00 04 00 00 00  00 fa 00 00 44 ac 00 00  |............D...|
00000070  01 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000080  00 00 00 00 01 00 00 00  08 00 00 00 12 00 00 00  |................|
00000090  03 00 00 00 00 00 00 00  01 00 00 00 ff ff ff ff  |................|

I don’t have a METADATA.ini file to research, but I will be honest, these PlexTalk files will be hard to identify from their contents.

Looking at the IMPH file, there isn’t a lot of bytes which might indicate a format magic bytes. But I do see some patterns. The first 40 bytes all seem to be the same.

00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 FFFFFFFF FFFFFFFF

But making a signature from only 00 and FF might clash with other formats. It does appear that the 4 bytes FFFFFFFF occur every 40 bytes. This precision might be good enough if we repeat it a couple times.

The IMTT file is different. It appears to have information on the name, character set and all the files in the Daisy package. The first 4 bytes in my 14 samples either start with 17000000 or 18000000. Not knowing what the 17 or 18 refers to, I am hesitant to use it for identification. In between some of the data there is some consistent bytes, but at different offsets.


hexdump -C ImdTxtTabl.imtt | head
00000000  18 00 00 00 54 69 74 6c  65 00 35 39 2d 31 00 31  |....Title.59-1.1|
00000010  35 3a 35 34 3a 35 39 2e  32 36 30 00 03 00 00 00  |5:54:59.260.....|
00000020  65 6e 00 0b 00 00 00 69  73 6f 2d 38 38 35 39 2d  |en.....iso-8859-|
00000030  31 00 01 00 00 00 00 01  00 00 00 00 01 00 00 00  |1...............|
00000040  00 01 00 00 00 00 01 00  00 00 00 01 00 00 00 00  |................|
00000050  01 00 00 00 00 01 00 00  00 00 0c 00 00 00 4d 61  |..............Ma|
00000060  72 69 6f 6e 20 53 79 6d  65 00 28 00 00 00 4d 69  |rion Syme.(...Mi|
00000070  6e 75 74 65 73 20 6f 66  20 74 68 65 20 43 6f 6d  |nutes of the Com|
00000080  6d 69 74 74 65 65 20 4d  65 65 74 69 6e 67 20 32  |mittee Meeting 2|
00000090  34 30 35 30 34 00 08 00  00 00 48 65 61 64 69 6e  |40504.....Headin|

hexdump -C ImdTxtTabl.imtt | head
00000000  17 00 00 00 32 30 30 34  2f 30 35 2f 33 31 2f 31  |....2004/05/31/1|
00000010  36 3a 36 3a 34 37 2e 30  30 30 00 03 00 00 00 65  |6:6:47.000.....e|
00000020  6e 00 0b 00 00 00 69 73  6f 2d 38 38 35 39 2d 31  |n.....iso-8859-1|
00000030  00 0d 00 00 00 5a 3a 2f  42 6f 6f 6b 44 69 72 34  |.....Z:/BookDir4|
00000040  2f 00 0d 00 00 00 5a 3a  2f 42 6f 6f 6b 44 69 72  |/.....Z:/BookDir|
00000050  34 2f 00 0c 00 00 00 61  30 30 30 30 30 31 2e 6d  |4/.....a000001.m|
00000060  70 33 00 0c 00 00 00 61  30 30 30 30 30 31 2e 6d  |p3.....a000001.m|

Not sure what any of it means, but might be good enough for a signature.

Now the IMDN files might be a little easier:

hexdump -C Ncc.imdn | head
00000000  01 ff 00 ff d4 00 00 00  3c 00 00 00 2c 00 00 00  |........<...,...|
00000010  14 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000020  00 00 00 00 49 6d 64 54  78 74 54 61 62 6c 2e 69  |....ImdTxtTabl.i|
00000030  6d 74 74 00 00 00 00 00  00 00 00 00 00 00 00 00  |mtt.............|
00000040  00 00 00 00 49 6d 64 50  68 72 49 6e 66 6f 2e 69  |....ImdPhrInfo.i|
00000050  6d 70 68 00 00 00 00 00  00 00 00 00 00 00 00 00  |mph.............|
00000060  00 00 00 00 04 00 00 00  00 7d 00 00 22 56 00 00  |.........}.."V..|
00000070  01 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000080  00 00 00 00 01 00 00 00  28 00 00 00 28 00 00 00  |........(...(...|
00000090  00 00 00 00 00 00 00 00  28 00 00 00 ff ff ff ff  |........(.......|

This format directly names the two other formats. Should be easy to look for the two file names in the header. The NCC html file in Daisy 2.0 and the NCX xml file in Daisy 3.0 are directory files so it makes sense this file would do the same.

Not sure if these signatures will hold up over time, but they are a start. It would be nice if all the files we are given to preserve would have convenient static magic bytes, but alas, many do not and we have to guess.

These Daisy formats illustrate a problem in preservation that doesn’t quite have a good solution. Each of these files are individually unique and can be identified, but as a whole they represent another unique format. Tying formats together to link their interdependence on each other will be no small task, but will be necessary not only to understanding the format, but to avoid separating the files, renaming, or rearranging breaking that interdependence.

I have added the update to SMIL and new signatures for the other formats to my GitHub repository. Feel free to test and change if you find additional samples or information.

FASTA & FASTQ

June 21, 2024 by Thor 2 Comments

There seems to be a never ending source of file formats out there. Documenting past obsolete formats, one would assume a point at which there are no more to find, but in reality more are re-discovered everyday by the Digital Preservation community. When it comes to more modern formats, it seems more are invented everyday, too many to keep up with identification. Document one, 10 more pop up, it seems never-ending. Such is the case for scientific formats, including sequencing formats.

I was speaking with a colleague from another institution the other day and a file format was mentioned I hadn’t heard of before. It seems many of their scientific data was stored in a format called FASTA “Fast A” (“fast-aye”). This format specifically stores DNA sequence data and is used quite a bit, it seems. I was even more surprised the next day when I went to process some new submissions for our repository only to find one submission contained three FASTA files. I love researching file formats, but sometimes in order to understand the format structure you have to know something about the content as well. Let’s explore the FASTA and FASTQ file formats. If you would like to take a peek at the Human Genome in FASTA, go here.

Both the FASTA and FASTQ formats are text based and have a simple structure. Identification of each of these should be pretty simple, but to avoid conflicts with other formats, the signature might have to be more complex.

The FASTA format is well documented as many in the scientific community use it. Basically the format starts with the greater than “>” character followed by a description, a new line character, then the sequence. For example:

>MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken
MADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID
FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA
DIDGDGQVNYEEFVQMMTAK*

Pretty straight forward, but so much of the format can be variable, a simple signature would clash with too many other formats. There are some rules with what characters can be used in the sequence so it might be possible to limit the signature to only allow certain characters. At first I thought it might only be able to contain the standard characters representative of adenine (A), cytosine (C), guanine (G), and thymine (T), but as it turns out the FASTA format can contain Nucleic Acid Code’s and Amino Acid Code’s. These codes allow more than the four I was expecting, but do limit what can be represented.

Take the NCBI Sequence Viewer for a spin and download some data as FASTA.

The FASTQ format adds more structure and is more limiting, but also presents some challenges. Here is a sample of its structure:

@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

Instead of a greater than symbol, the FASTQ format uses an “@” symbol followed by an identifier. The identifier can be basically anything and as long as needed. There is a newline character followed by the DNA sequence, which is only the four characters I have heard before. It can contain an A, C, G, T, or N. The “N” can represent an unidentified nucleotide or indicate that the software was unable to make a basecall. A newline character again and the “+” symbol. This is place before the fourth line with is a quality score and is the same number of characters as the sequence.

See what I mean when you have to learn about the context of a format in order to make a proper signature!

One of the problems I am left with is how to determine how many of the sequence characters to use in the signature to not have any conflicts. Too few and it might conflict with another format or simple text file. Too many and the signature gets complicated and may exclude a short sequence file. As far as I can tell there is no set minimum or maximum for the sequence. Not sure what the genome for Pinus Taeda would look like in FASTA with 22.18 billion base pairs. The other problem is often times these formats are compressed into a GZIP file, so they need to be extracted before identification.

These two formats are just a couple of the many sequencing formats being used in the bioinformatics community. I am sure others will pop up in the future. Until then, I have with the help of others put together a signature which seems to work well for the samples and data sets we have access to. Take a look at my GitHub for the signature proposal. If you find any issues, let me know!

Final Cut Pro

December 15, 2023 by Thor Leave a comment

When it comes to Digital Preservation, the easiest types of file formats to preserve are often single self contained formats with lots of documentation. There are plenty of formats which break this norm, but a file format like a simple TIFF file is well understood and can stand on its own. The hardest file formats to preserve, I have found, are the complex under documented formats which often show up when you don’t expect them. There is a file format type which indeed makes things difficult. The project format.

There are many software tools out there which generate a “Project”, this is often proprietary and can only be used by the software which created it. Project files are also interdependent, meaning they require other files in known locations in order to be used. This interdependence is often links to images, audio, video, fonts, and other multimedia. The file format itself is just a reference to all the project settings and the paths to the files included in the project. This makes things very difficult to preserve and maintain the complex structure required. Any renaming, removing, or moving the files out of their original order can render the project useless. Many project formats are human readable in XML, or other human readable text, but others are not. I have made a recent attempt to document more Project formats on the File Format Wiki, including many Label and Optical disc project formats, along with updates to Adobe InDesign, QuarkXPress and other desktop publishing project formats. There is still plenty of work needed in other Video and Audio project formats.

Apple computers over the years has created some very powerful software for content creators to use, especially in Video editing. iMovie was used by many home movie editors and iDVD to burn those movies to DVD to share with family and friends, but Apple also sold a professional Video Editing suite which included Final Cut Pro.

Final Cut Pro started life as a Macromedia software tool called KeyGrip which never was released and later bought by Apple. Final Cut Pro was well used and loved by video editors and was given a major upgrade in 2011 to Final Cut Pro X, which was full re-written to be 64-bit. This change included a change to the Project file format. So for version 1 through version 7, Final Cut Pro used a project format with the extension .FCP. Lets take a closer look at the this project format.

hexdump -C Swing.fcp | head
00000000  a2 4b 65 79 47 0a 0d 0a  00 00 00 00 20 fc c5 5b  |.KeyG....... ..[|
00000010  00 de b3 11 d0 93 19 00  05 02 18 66 07 00 00 00  |...........f....|
00000020  03 00 00 00 00 00 00 00  00 01 00 00 00 00 01 00  |................|
00000030  00 00 11 07 73 75 62 74  79 70 65 00 00 00 01 01  |....subtype.....|
00000040  00 00 00 03 00 06 4e 4f  55 4e 44 4f 00 00 00 00  |......NOUNDO....|
00000050  01 01 00 00 00 00 00 00  00 00 00 00 00 07 52 55  |..............RU|
00000060  4e 54 49 4d 45 00 00 00  00 01 01 00 00 00 00 00  |NTIME...........|
00000070  00 00 00 01 07 76 69 65  77 65 72 73 00 00 00 00  |.....viewers....|
00000080  01 01 00 00 00 00 00 00  00 00 00 00 00 00 00 08  |................|
00000090  63 68 69 6c 64 72 65 6e  00 00 00 00 01 01 00 00  |children........|
*
00000e30  00 00 00 00 00 00 00 00  00 00 00 00 00 00 07 8c  |................|
00000e40  b3 2e 56 40 4d 6f 6f 56  54 56 4f 44 00 02 00 02  |..V@MooVTVOD....|
00000e50  00 00 00 11 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000e60  00 00 00 0b 44 61 6e 63  65 20 53 68 6f 74 73 00  |....Dance Shots.|
00000e70  00 01 00 08 00 00 07 8a  00 00 07 84 00 02 00 2f  |.............../|
00000e80  41 54 54 4f 20 52 41 49  44 30 20 47 72 6f 75 70  |ATTO RAID0 Group|
00000e90  3a 54 55 54 4f 52 49 41  4c 3a 44 61 6e 63 65 20  |:TUTORIAL:Dance |
00000ea0  53 68 6f 74 73 3a 49 6e  74 72 6f 2e 6d 6f 76 00  |Shots:Intro.mov.|
00000eb0  00 09 00 a8 00 a8 61 66  70 6d 00 00 00 00 00 03  |......afpm......|
00000ec0  00 18 00 39 00 59 00 75  00 95 00 9e 07 49 4c 31  |...9.Y.u.....IL1|
00000ed0  20 33 72 64 00 00 00 00  00 00 00 00 00 00 00 00  | 3rd............|
00000ee0  00 00 00 00 00 00 00 00  00 00 00 00 00 0f 77 61  |..............wa|
00000ef0  6c 74 d5 73 20 43 6f 6d  70 75 74 65 72 00 00 00  |lt.s Computer...|
00000f00  00 00 00 00 00 00 00 00  00 00 00 00 00 10 41 54  |..............AT|
00000f10  54 4f 20 52 41 49 44 30  20 47 72 6f 75 70 00 00  |TO RAID0 Group..|
00000f20  00 00 00 00 00 00 00 00  00 07 77 73 68 69 72 65  |..........wshire|
00000f30  73 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |s...............|
00000f40  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000f50  00 00 00 00 00 00 00 00  00 00 00 00 ff ff 00 00  |................|
00000f60  00 00 00 00 00 00 00 10  41 54 54 4f 20 52 41 49  |........ATTO RAI|
00000f70  44 30 20 47 72 6f 75 70  00 00 00 00 00 00 00 2b  |D0 Group.......+|
00000f80  00 00 00 01 00 00 00 03  00 00 00 03 54 55 54 4f  |............TUTO|
00000f90  52 49 41 4c 00 44 61 6e  63 65 20 53 68 6f 74 73  |RIAL.Dance Shots|
00000fa0  00 49 6e 74 72 6f 2e 6d  6f 76 00 00 00 00 00 00  |.Intro.mov......|

From the header we can see a remnant of the original KeyGrip software, but later in the file we find some references to files in the Mac HFS path format which includes a colon instead of a slash. These are the paths to the each of the MOV files used in the Project. This file is from the tutorial disk of Final Cut Pro version 1.2, so lets take a look at the last version released, version 7.

hexdump -C Lesson 1 Project.fcp | head
00000000  a2 4b 65 79 47 0a 0d 0a  01 de 00 00 00 20 08 92  |.KeyG........ ..|
00000010  66 c4 28 d7 11 8a e5 00  30 65 ec fe 98 03 00 00  |f.(.....0e......|
00000020  00 00 00 00 00 00 00 00  00 01 00 00 00 00 01 15  |................|
00000030  00 00 00 07 73 75 62 74  79 70 65 01 00 00 00 01  |....subtype.....|
00000040  03 00 00 00 00 06 4e 4f  55 4e 44 4f 00 00 00 00  |......NOUNDO....|
00000050  01 01 00 00 00 00 00 00  00 00 00 00 00 07 52 55  |..............RU|
00000060  4e 54 49 4d 45 00 00 00  00 01 01 00 00 00 00 00  |NTIME...........|
00000070  01 00 00 00 07 76 69 65  77 65 72 73 00 00 00 00  |.....viewers....|
00000080  01 01 00 00 00 00 00 00  00 00 00 00 00 00 00 08  |................|
00000090  63 68 69 6c 64 72 65 6e  00 00 00 00 01 01 01 00  |children........|

Almost identical to the first version, which is helpful for identification, but if we need to identify based on version, it might prove a little more difficult. It appears all the samples I have and have seen reference to all begin with the same 5 hex values, A24B657947, 0xA2 KeyG. It’s hard to know what other hex values might have something to do with versions of the file format. More samples could tell us, but from what I have the 20 bytes starting from offset 12 seems to be consistent among the different version samples. But for now the 5 bytes at the beginning of the file should suffice for identification.

When Final Cut Pro went through a complete re-write in 2011, the FCP format was abandoned. Not only made obsolete, but completely unsupported. The new Final Cut Pro X software was not able to support this now obsolete format. The new format followed the pattern of many other Apple formats of using a folder identified through an extension as a single file. Called a bundle format, Final Cut Pro X used the extension, .FCPBUNDLE. This bundle could include the media assets along with project settings/thumbnails and clips. Because of this “bundle” format, identification would have to be done at the individual file level inside the bundle. This would include formats with extensions such as .flexolibrary and .fcpevent, which appear to be SQLite databases. This complex format makes preservation of this type of object difficult with current methods and practices.

Luckily Apple didn’t leave Final Cut Pro users completely unable to migrate their content. Final Cut Pro could export the project as an XML file. This format is called Final Cut Pro XML Interchange Format and was well documented. The format was not made to bridge the gap from Final Cut Pro to Final Cut Pro X, but rather make the project file more useful outside of Final Cut Pro. Final Cut Pro X actually can’t open these files either, which is why a third party developer came in and developed 7toX (SendtoX) to allow for projects to be converted to a newer XML format.

Lets take a look at the basic Final Cut Pro XML Interchange Format which has a standard XML extension:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xmeml>
<xmeml version="5">
<sequence id="Sequence 1 ">...</sequence>
</xmeml>

Standard XML with a Doctype/root of xmeml. Clever. A little ways into the XML we also see:

<appspecificdata>
	<appname>Final Cut Pro</appname>
	<appmanufacturer>Apple Inc.</appmanufacturer>
	<appversion>7.0</appversion>
</appspecificdata>

Final Cut Pro X also has an XML format which is different than XMEML and has an extension FCPXML:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE fcpxml>

<fcpxml version="1.8">
    <resources>
        <format id="r1" name="FFVideoFormatDV720x480i5994" frameDuration="2002/60000s" fieldOrder="lower first" width="720" height="480" paspH="10" paspV="11" colorSpace="6-1-6 (Rec. 601 (NTSC))"/>
    </resources>
    <library location="file:///Untitled.fcpbundle/">...</library>
</fcpxml>

A different Doctype/root and structure but should be easy to identify.

The preservation of projects files, according to some, is not necessary since they are not the finalized product. Preserving the finalized output would be preferable as it can be managed easier and represent the final render of a project. But identification of the Final Cut Pro project and all the assets gives the option to access a collection more accurately. I was able to create a signature for the FCP, XML, and FCPXML formats. Take a look on my GitHub for the signatures and some test files.