DaVinci Resolve

May 2, 2025 by Thor Leave a comment

A previous post was about LUTs, the little files needed to color grade your photo’s and video’s. One of the best systems for color grading video in use by professionals today is DaVinci Resolve. The system originally was all hardware based, but in the 2004 as computers were able to process higher quality video, da Vinci Systems released new digital systems.

Like most professional multimedia editing software, projects are used to manage work and DaVinci Resolve is no different. Projects are generally where all the settings for the project are stored, but don’t generally store the actual media used in the project. Project files are often XML with unique schema’s, but other pack a little more into the project file.

hexdump -C project.drp | head
00000000  50 4b 03 04 14 00 08 00  08 00 f2 54 90 5a ef 18  |PK.........T.Z..|
00000010  b0 25 47 0c 00 00 db 1b  00 00 0b 00 00 00 70 72  |.%G...........pr|
00000020  6f 6a 65 63 74 2e 78 6d  6c 9d 58 d9 72 5b 37 12  |oject.xml.X.r[7.|
00000030  7d cf 57 68 f4 7e 4d ec  4b 8a 51 ca b1 92 89 aa  |}.Wh.~M.K.Q.....|
00000040  2c db 65 29 79 9d 6a 00  0d 85 09 45 aa 48 4a 71  |,.e)y.j....E.HJq|
00000050  fe 7e 0e ee 42 51 94 9c  68 c6 29 85 17 0d a0 d1  |.~..BQ..h.).....|
00000060  e8 3e bd 61 fe fd 97 db  e5 c9 03 6f b6 8b f5 ea  |.>.a.......o....|
00000070  bb 53 f9 46 9c 9e f0 2a  af cb 62 75 f3 dd e9 2f  |.S.F...*..bu.../|
00000080  d7 3f 75 e1 f4 fb b3 6f  e6 ff ea ba f3 f4 f6 ee  |.?u....o........|
00000090  ee 57 de 60 55 7c 23 df  98 37 42 48 79 7a 72 9e  |.W.`U|#..7BHyzr.|

DaVinci Resolve keeps all projects in a database, but you can export them to a project file. A DaVinci Resolve Project file uses a ZIP container to store all the project settings in one file. Let’s see what also might be inside.

Path = project.drp
Type = zip
Physical Size = 543860

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2018-02-27 20:25:08 .....      1010030       287793  project.xml
2018-02-27 20:25:08 .....        21173         6856  MediaPool/Master/000_Timelines/MpFolder.xml
2018-02-27 20:25:08 .....       492690        28067  MediaPool/Master/001_Audio/MpFolder.xml
2018-02-27 20:25:08 .....        20177         3588  MediaPool/Master/002_gfx/MpFolder.xml
2018-02-27 20:25:08 .....        11025         2611  MediaPool/Master/003_VO/MpFolder.xml
2018-02-27 20:25:08 .....        98309         7042  MediaPool/Master/004_ScreenCaptures Consolidated/MpFolder.xml
2018-02-27 20:25:08 .....      1278493        66424  MediaPool/Master/005_Video H264/MpFolder.xml
2018-02-27 20:25:08 .....         1995          748  MediaPool/Master/MpFolder.xml
2018-02-27 20:25:08 .....      1638204       137086  SeqContainer/909a0a2c-4183-4310-9f78-6e15c3c59cb4.xml
2018-02-27 20:25:08 .....         8806         1169  Gallery.xml
2018-02-27 20:25:08 .....        12697          696  media.dat
------------------- ----- ------------ ------------  ------------------------
2018-02-27 20:25:08            4593599       542080  11 files

Looks like a lot of XML! The consistent XML in all the DRP files is the apply named “project.xml” along with “Gallery.xml”.

cat project.xml | head
<?xml version="1.0" encoding="UTF-8"?>
<!--DbAppVer="19.1.4.0011" DbPrjVer="14"-->
<SM_Project DbId="db65f2ee-2bff-41cd-b478-f96c26e9609f">
 <FieldsBlob>000000010000000700000026005400650078007400520065006e006400650072004900740065006d005600650063004200410000000c00ffffffff0000002400520065006e0064006500720043006100630068006500560065007200730069006f006e0000000200000000010000001e00500072006f006a00650063007400460065006100740075007200650073000000050000000000000000010000002e00500072006f006a00650063007400440062004d006900670072006100740069006f006e00530074006100740065000000040000000000000000030000002e0049007300500072006f006a0065006300740041006700650049006e004d006900630072006f00530065006300730000000100010000001400470061006c006c0065007200790052006500660000000a000000004800330033003400320034003300380036002d0034006400330030002d0034003600610035002d0061006100340033002d006100330035003200620066006500370038003200640063000000260046007500730069006f006e00530069007a0069006e006700560065007200730069006f006e000000020000000002</FieldsBlob>
 <LockId/>
 <User>86f03abc-9354-47d9-9006-a55b6b1d49cf</User>
 <Folder/>
 <UserId>-1</UserId>
 <SysId>6CB133A11B81</SysId>
 <ProjectId>0</ProjectId>

It appears the version of DaVinci Resolve is pretty important. If you try and open a DRP file without using the most up-to-date software you might run into problems. From what I can see, every time a new major version is released, the updates to the XML cause the project error when imported. So knowing the version of the DRP file can be a critical piece of metadata needed in understanding the format. There are some helpful apps created by DaVinci Resolve users you can try, or you can try a little python script to report back the version used in a DRP or whole folder of DRP files.

There is one other file used by the DaVinci Resolve software. It uses the DRT extension and is for exporting and importing single timelines to the software. Like a DRP it is a simple project file that only points to the media used in the project and only stores the settings needed.

Path = timeline.drt
Type = zip
Physical Size = 215159

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2021-04-21 21:16:42 .....        45726         8888  project.xml
2021-04-21 21:16:42 .....       670306       198698  MediaPool/Master/MpFolder.xml
2021-04-21 21:16:42 .....        98268         7089  SeqContainer/7eb849f3-41cb-4e3f-baa8-d5b134b57aa7.xml
------------------- ----- ------------ ------------  ------------------------
2021-04-21 21:16:42             814300       214675  3 files

This DRT file also has a project.xml file, but doesn’t have the Gallery.xml file we normally find in a DRP file. We can use this to distinguish the difference. The project.xml is the same as the DRP, so this distinction is important.

cat project.xml |head
<?xml version="1.0" encoding="UTF-8"?>
<!--DbAppVer="17.1.1.0009" DbPrjVer="10"-->
<SM_Project DbId="ec6cb2e2-0b3c-43b8-8f90-a5fcb973af3b">
 <FieldsBlob>00000001000000040000002e00500072006f006a00650063007400440062004d006900670072006100740069006f006e00530074006100740065000000040000000000000000020000002e0049007300500072006f006a0065006300740041006700650049006e004d006900630072006f00530065006300730000000100010000001400470061006c006c0065007200790052006500660000000a000000004800660030003800380038003300390038002d0066006400620037002d0034006300320036002d0061003700310032002d003300360038006200300036003300300065003400330031000000260046007500730069006f006e00530069007a0069006e006700560065007200730069006f006e000000020000000002</FieldsBlob>
 <LockId/>
 <User>04d71873-a504-40c6-bde5-41709691a2c9</User>
 <Folder/>
 <UserId>-1</UserId>
 <SysId>94F6D6F3F60F</SysId>
 <ProjectId>0</ProjectId>

In both formats they use the XML root tag of “SM_Project”, this can also be used to define a signature for the two formats as “project.xml” could be used with a different format and we don’t want there to be a false identification.

I was able to trace back the use of the DRP format back to DaVinci Resolve version 9. In version 8, it appears projects are exported using the name and extension, “Default Project.resolve.zip”. From what I could find, DaVinci Resolve version 9 was a big re-write and so it makes sense to settle on more useful extension. The project.xml file in a version 8 format is slightly different.

cat project.xml | head
<SM_Project DbId="9ba0c4dc-d99c-4b7f-b0da-d254d91e34e2" DbAppVer="8.2 (#153)">
 <LockId></LockId>
 <User>159415b8-7515-43bf-b5f5-00d98949434b</User>
 <UserId>-1</UserId>
 <SysId>7cd1c388ea29</SysId>
 <ProjectId>0</ProjectId>
 <RevivalTaskSetID>-1</RevivalTaskSetID>
 <PlayHeadsSplitDisplay>false</PlayHeadsSplitDisplay>
 <pGallery>
  <Gallery::GyGallery DbId="9884d8ff-096e-4df0-b833-0e75e6e07e15">

Still uses the “SM_Project” root tag, but displays the DbAppVer information differently. It would be good to find more examples of the version 8 and earlier to see how this format has evolved over time. For now, I have created a signature you can test if you happen to have any DRP files in your archive.

Scrivener

March 21, 2025 by Thor 1 Comment

Word Processors are everywhere and have some of the most recognizable file formats. Some are very simple in that they just contain plain text, others are more complex. There are formats which allow for images and others which can handle different languages and writing directions.

A writing platform I recently learned about is called Scrivener. It was first released in 2007 by a company called Literature & Latte Ltd, and has a Macintosh and Windows version. The software is marketed toward writers as there is some features that help with note taking, research and much more. It also allows for adding multimedia and even full webpages.

This is accomplished by a file format which uses a non-traditional method for storing all the data needed to render the format.

tree Scrivener3-s01.scriv
Scrivener3-s01.scriv
├── Files
│   ├── Data
│   │   ├── 921B4A08-54C0-4B69-94FD-428F56FDAB89
│   │   │   └── content.rtf
│   │   └── docs.checksum
│   ├── binder.autosave
│   ├── binder.backup
│   ├── search.indexes
│   ├── styles.xml
│   ├── version.txt
│   └── writing.history
├── Scrivener3-s01.scrivx
└── Settings
    ├── recents.txt
    ├── ui-common.xml
    └── ui.ini

Scrivener uses a folder structure to store all the data used in the format. The folder has an extension, .scriv. The format includes some rich text, backups, indexes, version history and more. One unique format within the folder is an XML file with the extension .scrivx. This makes the format proprietary and can only be rendered using the Scrivener software.

cat Scrivener3-s01.scrivx | head
<?xml version="1.0" encoding="UTF-8"?>
<ScrivenerProject Template="No" Version="2.0" Identifier="DF5DA7F0-27DB-4815-A050-B4D6F23CABA7" Creator="SCRWIN-3.1.5.1" Device="DESKTOP-JMM4K7M" Modified="2025-03-14 22:15:28 -0600" ModID="B4A944C3-FF79-49F6-A737-158BEB4E58BB">
    <Binder>
        <BinderItem UUID="17807D28-117A-409E-B12D-B34922B6CC6F" Type="DraftFolder" Created="2025-03-14 22:15:17 -0600" Modified="2025-03-14 22:15:17 -0600">
            <Title>Draft</Title>
            <MetaData>
                <IncludeInCompile>Yes</IncludeInCompile>
            </MetaData>
            <Children>
                <BinderItem UUID="921B4A08-54C0-4B69-94FD-428F56FDAB89" Type="Text" Created="2025-03-14 22:15:17 -0600" Modified="2025-03-14 22:15:23 -0600">

The XML has enough to be able to identify them apart from other XML files. The signature would be straight forward. Earlier versions of Scrivener sometimes have the SCRIVX file but also sometimes has a
.scrivproj extension. This file on a Macintosh is in a Binary plist format, which is different than earlier Windows versions. Seems they may have unified them under version 2 or 3, where version 1 & 2 for Windows uses Project version 1 and version 3 uses project version 2.

hexdump -C Scrivener1-s01.scriv/binder.scrivproj | head
00000000  62 70 6c 69 73 74 30 30  d4 00 01 00 02 00 03 00  |bplist00........|
00000010  04 00 05 00 1d 01 d8 01  d9 54 24 74 6f 70 58 24  |.........T$topX$|
00000020  6f 62 6a 65 63 74 73 58  24 76 65 72 73 69 6f 6e  |objectsX$version|
00000030  59 24 61 72 63 68 69 76  65 72 dc 00 06 00 07 00  |Y$archiver......|
00000040  08 00 09 00 0a 00 0b 00  0c 00 0d 00 0e 00 0f 00  |................|
00000050  10 00 11 00 12 00 13 00  14 00 15 00 16 00 17 00  |................|
00000060  18 00 19 00 1a 00 15 00  1b 00 1c 5a 4c 61 62 65  |...........ZLabe|
00000070  6c 54 69 74 6c 65 59 4c  61 62 65 6c 4c 69 73 74  |lTitleYLabelList|
00000080  5e 42 69 6e 64 65 72 43  6f 6e 74 65 6e 74 73 5f  |^BinderContents_|
00000090  10 0f 44 65 66 61 75 6c  74 4c 61 62 65 6c 54 61  |..DefaultLabelTa|

Since the developers of Scrivener decided to make the SCRIV format simply a folder with different content within, something special happens on the MacOS. The Scrivener software registers all the extensions is uses with the MacOS launch services. This process then changes the way the SCRIV folder is displayed in the MacOS Finder. They now appears as a single file and given a file type. This is called a Document Package format.

By right-clicking on the “file” you can then browse the package contents. There is nothing in the folder itself or hidden in any attributes which causes this to happen, it is all controlled by what extensions have been registered with the launch services database. We can however ask the MacOS to give us some extended metadata details about the package, as long as the file is on a Apple filesystem like HFS or APFS.

mdls Scrivener3-s01.scriv 
_kMDItemDisplayNameWithExtensions      = "Scrivener3-s01.scriv"
kMDItemContentCreationDate             = 2025-03-15 04:15:17 +0000
kMDItemContentCreationDate_Ranking     = 2025-03-15 00:00:00 +0000
kMDItemContentModificationDate         = 2025-03-15 04:15:18 +0000
kMDItemContentModificationDate_Ranking = 2025-03-15 00:00:00 +0000
kMDItemContentType                     = "com.literatureandlatte.scrivener3.scriv"
kMDItemContentTypeTree                 = (
    "com.literatureandlatte.scrivener3.scriv",
    "public.directory",
    "public.item",
    "com.apple.package",
    "public.content",
    "public.composite-content"
)
kMDItemDateAdded                       = 2025-03-21 04:38:48 +0000
kMDItemDateAdded_Ranking               = 2025-03-21 00:00:00 +0000
kMDItemDisplayName                     = "Scrivener3-s01.scriv"
kMDItemDocumentIdentifier              = 0
kMDItemFSContentChangeDate             = 2025-03-15 04:15:18 +0000
kMDItemFSCreationDate                  = 2025-03-15 04:15:17 +0000
kMDItemFSCreatorCode                   = ""
kMDItemFSFinderFlags                   = 0
kMDItemFSHasCustomIcon                 = (null)
kMDItemFSInvisible                     = 0
kMDItemFSIsExtensionHidden             = 0
kMDItemFSIsStationery                  = (null)
kMDItemFSLabel                         = 0
kMDItemFSName                          = "Scrivener3-s01.scriv"
kMDItemFSNodeCount                     = 3
kMDItemFSOwnerGroupID                  = 20
kMDItemFSOwnerUserID                   = 501
kMDItemFSSize                          = 31155
kMDItemFSTypeCode                      = ""
kMDItemInterestingDate_Ranking         = 2025-03-15 00:00:00 +0000
kMDItemKind                            = "Scrivener Project"
kMDItemLogicalSize                     = 31155
kMDItemPhysicalSize                    = 69632

There is a lot of additional details available using the MDLS command, this includes the content type of “com.apple.package“. This tools works with any files in MacOS and can be a very useful tool in getting all the information you may need for preservation needs.

Until the tools we use for format identification can recognize package formats, tools like this may be needed to gather the neccessary metadata for preservation. But in the meantime, identification of the package content is the best we can hope for. Creating a signature for the XML based SCRIVX format is the first step.

Stay tuned for more on the package format as I will be bring it up more in the Digital Preservation community.

Daisy

October 4, 2024 by Thor Leave a comment

A single file can often be self contained, having all that is needed to render itself with the correct software, but more and more often files need other files to function properly. Sometimes these groups of dependent files are within a container, such as a DOCX or ePub, but can also be found all sitting nicely in a folder. I say nicely, partly because the structure works, that is until they are treated as individual files and renamed or moved around breaking that interdependence on each other.

In the case of many Apple bundle files, they appear to be a single file when using on the MacOS, but as a folder on Windows or Linux. This can be very confusing. In other cases such as the DAISY Digital Talking Book format, it is simply a folder or disc with a few or many files within.

Current tools used to identify file formats, such as DROID, look at individual files, not groups of files to determine format. Each file within a folder may have a unique format, but when grouped with other specific formats they become something more. We will have to work on enhancing current tools if we want to avoid breaking these format types and losing their ability to render properly.

DAISY, or Digital Accessible Information System, is a type of Digital Book. The format was originally conceived in 1988 as a method to create a talking book, designed for the purpose of giving those who are visually impaired the ability to listen to books. It wasn’t until 1996, the DAISY Consortium was created in order to take the technology to those who needed it. The original version of the the DAISY format in 1994 was proprietary, but once they formed the consortium, they decided to adopt open standards for the format and in 1998, the DAISY 2.0 standard was released. You can read more on the Library of Congress Format Description page.

Lets take a look at a folder containing a DAISY 2.0 book.

ls -la "DAISY 2.02 export"
total 536
drwx------  1 tyler  staff   16384 Sep 25 22:06 .
drwx------  1 tyler  staff   16384 Sep 25 22:06 ..
-rwx------@ 1 tyler  staff    1090 Sep 25 22:05 0002.smil
-rwx------  1 tyler  staff  228413 Sep 25 22:05 aud0001.mp3
-rwx------@ 1 tyler  staff     672 Sep 25 22:05 master.smil
-rwx------  1 tyler  staff    1703 Sep 25 22:05 ncc.html

We can see three different formats in this folder. The obvious well known MP3 files and an HTML file. We also see two files with the extension SMIL.

“Synchronized Multimedia Integration Language” or SMIL is a W3C XML standard used to describe multimedia presentations. It is used in the DAISY DTB as well as other applications, but we will focus on DAISY, and it is in its third version. A SMIL file has this structure:

<?xml version="1.0"?>
<!DOCTYPE smil PUBLIC "-//W3C//DTD SMIL 1.0//EN" "http://www.w3.org/TR/REC-smil/SMIL10.dtd">
<smil>
  <head>
    <meta name="dc:title" content="Obi Project" />
    <meta name="dc:identifier" content="589c550e-303b-4c0d-9921-ae76d782fd53" />
    <meta name="ncc:generator" content="Obi v5.0.0.0 with toolkit: UrakawaSDK.core v2.0.0.0 (http://urakawa.sf.net/obi)" />
    <meta name="dc:format" content="Daisy 2.02" />
    <meta name="ncc:timeInThisSmil" content="00:00:28" />
    <layout>
      <region id="textView" />
    </layout>
  </head>
  <body>
    <ref title="Testing" src="0002.smil" id="ms_0002" />
  </body>
</smil>

A standard XML file with a link to a SMIL DTD and a root tag of <smil>. This format is recognized by PRONOM as fmt/205, although is often identified as a standard XML file. It seems the signature was created with a small offset which works with some SMIL files, but the gap between the end of the XML declaration and the start of the <smil> tag is only 20-86 bytes, not enough to allow for different character sets and full DTD URL’s. We will have to increase this gap in order to get all the SMIL files identified correctly.

With this update all the files in a DAISY 2.0 files should be identified individually, but as a set of files they make up the DAISY 2.0 format. This format requires the ncc.html file be present at the root of the folder or CD, so this file will aid in the manual identification of this format.

DAISY 3 was released in 2002 and standardized using the ANSI/NISO Z39.86 2002 name. It has been revised a couple times with the current revision being 2012. This update adds more functionality to the format with many new optional and required formats/files included in the folder. Here is a simple example:

ls -la "DAISY3 Export"
total 784
drwx------  1 tyler  staff   16384 Sep 25 22:06 .
drwx------  1 tyler  staff   16384 Sep 25 22:06 ..
-rwx------@ 1 tyler  staff     979 Sep 25 22:05 0001.smil
-rwx------  1 tyler  staff  228413 Sep 25 22:05 aud0001.mp3
-rwx------  1 tyler  staff    1014 Sep 25 22:05 navigation.ncx
-rwx------  1 tyler  staff    1881 Sep 25 22:05 package.opf
-rwx------  1 tyler  staff    7838 Nov  2  2020 tpbnarrator.res
-rwx------  1 tyler  staff  117656 Nov  2  2020 tpbnarrator_res.mp3

The SMIL format is still included, along with MP3’s, but we have some addition formats. The NCX or “Navigation Control File”, the OPF or “Package file”, and the RES or “Resource file” are a few of them. The NCX file is the first file accessed as it lays out the navigation for the whole DTB. It is also XML:

cat DAISY3 Export/navigation.ncx 
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE ncx PUBLIC "-//NISO//DTD ncx 2005-1//EN" "http://www.daisy.org/z3986/2005/ncx-2005-1.dtd">
<ncx
	version="2005-1"
	xml:lang="en-US" xmlns="http://www.daisy.org/z3986/2005/ncx/">

This file is only recognized by DROID as a standard XML file. It probably should have unique identification like SMIL and with a root tag of <ncx>, that should be fairly easy to add.

The Package file with the extension OPF, is actually a format used by the openebook group, not to be confused by a format used by the Open Preservation Foundation 🤣. The Open Packaging Format is used and a DTB conforming to this standard must include exactly one Package File which must be a valid XML 1.0 document conforming to the OEBF Publication Structure 1.2 package.

cat DAISY3 Export/package.opf   
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE package PUBLIC "+//ISBN 0-9673008-1-9//DTD OEB 1.2 Package//EN" "http://openebook.org/dtds/oeb-1.2/oebpkg12.dtd">
<package
	unique-identifier="uid" xmlns="http://openebook.org/namespaces/oeb-package/1.0/">
	<metadata>
		<dc-metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oebpackage="http://openebook.org/namespaces/oeb-package/1.0/">
			<dc:Identifier
				id="uid">589c550e-303b-4c0d-9921-ae76d782fd53</dc:Identifier>
			<dc:Format>ANSI/NISO Z39.86-2005</dc:Format>
			<dc:Title>Obi Project</dc:Title>
			<dc:Publisher>N/A</dc:Publisher>
			<dc:Language>en-US</dc:Language>
			<dc:Creator>Creator name</dc:Creator>
			<dc:Date>2024-09-25</dc:Date>
		</dc-metadata>

The OPF format is also unknown to PRONOM and they identify as standard XML files as well. The root tag of “<package>” could be used elsewhere so the signature may need to reference the OEB package information.

The RES Resource file is also a standard XML and can be identified through its root tag of “<resources>” and resources DOCTYPE.

cat DAISY3 Export/tpbnarrator.res 
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE resources PUBLIC "-//NISO//DTD resource 2005-1//EN" "http://www.daisy.org/z3986/2005/resource-2005-1.dtd" []>
<resources xmlns="http://www.daisy.org/z3986/2005/resource/" version="2005-1">
  
  <!-- SKIPPABLE NCX -->
  
  <scope nsuri="http://www.daisy.org/z3986/2005/ncx/">
    <nodeSet id="ns001" select="//smilCustomTest[@bookStruct='LINE_NUMBER']">
      <resource xml:lang="en" id="r001">
        <text>Row</text>
        <audio src="tpbnarrator_res.mp3" clipBegin="0:00:02.379" clipEnd="0:00:03.416" />
      </resource>
    </nodeSet>

Now, adding these DAISY 3.0 formats will greatly increase the identification of this complex format. But we run into a problem with some of the software out there which generates these DAISY files, some of them include files not required by the format, but are included to be used by the different software. This can include some CSS files for formatting, additional XML, XSL files, DTD’s, and for DAISY files created by the PlexTalk software, additional project files.

ls -la MasterCD/AfterBuild 
total 7520
drwx------@ 1 tyler  staff    16384 Sep 24 19:34 .
drwx------@ 1 tyler  staff    16384 Sep 25 22:11 ..
-rwx------@ 1 tyler  staff     6688 Sep 25 01:32 ImdPhrInfo.imph
-rwx------@ 1 tyler  staff     3773 Sep 25 01:32 ImdTxtTabl.imtt
-rwx------@ 1 tyler  staff     1276 Sep 25 01:32 Ncc.imdn
-rwx------@ 1 tyler  staff  3716618 Sep 25 01:32 a000001.mp3
-rwx------@ 1 tyler  staff     4352 Sep 25 01:32 ncc.html
-rwx------@ 1 tyler  staff     1015 Sep 25 01:32 ptk000001.smil
-rwx------@ 1 tyler  staff      938 Sep 25 01:32 ptk000002.smil

The ncc.html file is here, indicating a DAISY 2.0 format, along with an MP3 and SMIL files, but including some additional formats.

In addition, when creating a project, four files with the extensions Ncc.imdn, ImdPhrInfo.imph, ImdTxtTabl.imtt, and METADATA.ini are automatically created. These files are called “Plextalk project files.” They store table of contents information, etc. (Plextalk project files generated by older versions of this product do not have METADATA.ini.)
http://www.plextalk.com/jp/dw_data/PRSStd/PLEX_RS_UM.html

These four files may not be crucial to the playing of the Daisy format, but they are important to the PlexTalk software.

hexdump -C ImdPhrInfo.imph | head
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000020  ff ff ff ff ff ff ff ff  00 00 00 00 00 00 00 00  |................|
00000030  00 00 00 00 00 00 00 00  f0 a3 0d 00 00 00 00 00  |................|
00000040  a3 06 00 00 a4 06 00 00  00 00 00 00 53 00 00 00  |............S...|
00000050  ff ff ff ff 01 00 00 00  03 00 00 00 00 00 00 00  |................|
00000060  00 00 00 00 00 00 00 00  c5 11 00 00 20 1a 00 00  |............ ...|
00000070  e5 2b 00 00 00 00 00 00  63 00 00 00 ff ff ff ff  |.+......c.......|
00000080  02 00 00 00 04 00 00 00  00 00 00 00 00 00 00 00  |................|
00000090  00 00 00 00 e5 2b 00 00  d6 0b 00 00 bb 37 00 00  |.....+.......7..|

hexdump -C ImdTxtTabl.imtt | head 
00000000  17 00 00 00 32 30 30 34  2f 30 35 2f 33 31 2f 31  |....2004/05/31/1|
00000010  36 3a 36 3a 34 37 2e 30  30 30 00 03 00 00 00 65  |6:6:47.000.....e|
00000020  6e 00 0b 00 00 00 69 73  6f 2d 38 38 35 39 2d 31  |n.....iso-8859-1|
00000030  00 0d 00 00 00 5a 3a 2f  42 6f 6f 6b 44 69 72 34  |.....Z:/BookDir4|
00000040  2f 00 0d 00 00 00 5a 3a  2f 42 6f 6f 6b 44 69 72  |/.....Z:/BookDir|
00000050  34 2f 00 0c 00 00 00 61  30 30 30 30 30 31 2e 6d  |4/.....a000001.m|
00000060  70 33 00 0c 00 00 00 61  30 30 30 30 30 31 2e 6d  |p3.....a000001.m|
*
00000980  70 33 00 08 00 00 00 48  65 61 64 69 6e 67 00 01  |p3.....Heading..|
00000990  00 00 00 00 08 00 00 00  48 65 61 64 69 6e 67 00  |........Heading.|

hexdump -C Ncc.imdn | head       
00000000  01 ff 00 ff c4 00 00 00  3c 00 00 00 2c 00 00 00  |........<...,...|
00000010  14 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000020  00 00 00 00 49 6d 64 54  78 74 54 61 62 6c 2e 69  |....ImdTxtTabl.i|
00000030  6d 74 74 00 00 00 00 00  00 00 00 00 00 00 00 00  |mtt.............|
00000040  00 00 00 00 49 6d 64 50  68 72 49 6e 66 6f 2e 69  |....ImdPhrInfo.i|
00000050  6d 70 68 00 00 00 00 00  00 00 00 00 00 00 00 00  |mph.............|
00000060  00 00 00 00 04 00 00 00  00 fa 00 00 44 ac 00 00  |............D...|
00000070  01 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000080  00 00 00 00 01 00 00 00  08 00 00 00 12 00 00 00  |................|
00000090  03 00 00 00 00 00 00 00  01 00 00 00 ff ff ff ff  |................|

I don’t have a METADATA.ini file to research, but I will be honest, these PlexTalk files will be hard to identify from their contents.

Looking at the IMPH file, there isn’t a lot of bytes which might indicate a format magic bytes. But I do see some patterns. The first 40 bytes all seem to be the same.

00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 FFFFFFFF FFFFFFFF

But making a signature from only 00 and FF might clash with other formats. It does appear that the 4 bytes FFFFFFFF occur every 40 bytes. This precision might be good enough if we repeat it a couple times.

The IMTT file is different. It appears to have information on the name, character set and all the files in the Daisy package. The first 4 bytes in my 14 samples either start with 17000000 or 18000000. Not knowing what the 17 or 18 refers to, I am hesitant to use it for identification. In between some of the data there is some consistent bytes, but at different offsets.


hexdump -C ImdTxtTabl.imtt | head
00000000  18 00 00 00 54 69 74 6c  65 00 35 39 2d 31 00 31  |....Title.59-1.1|
00000010  35 3a 35 34 3a 35 39 2e  32 36 30 00 03 00 00 00  |5:54:59.260.....|
00000020  65 6e 00 0b 00 00 00 69  73 6f 2d 38 38 35 39 2d  |en.....iso-8859-|
00000030  31 00 01 00 00 00 00 01  00 00 00 00 01 00 00 00  |1...............|
00000040  00 01 00 00 00 00 01 00  00 00 00 01 00 00 00 00  |................|
00000050  01 00 00 00 00 01 00 00  00 00 0c 00 00 00 4d 61  |..............Ma|
00000060  72 69 6f 6e 20 53 79 6d  65 00 28 00 00 00 4d 69  |rion Syme.(...Mi|
00000070  6e 75 74 65 73 20 6f 66  20 74 68 65 20 43 6f 6d  |nutes of the Com|
00000080  6d 69 74 74 65 65 20 4d  65 65 74 69 6e 67 20 32  |mittee Meeting 2|
00000090  34 30 35 30 34 00 08 00  00 00 48 65 61 64 69 6e  |40504.....Headin|

hexdump -C ImdTxtTabl.imtt | head
00000000  17 00 00 00 32 30 30 34  2f 30 35 2f 33 31 2f 31  |....2004/05/31/1|
00000010  36 3a 36 3a 34 37 2e 30  30 30 00 03 00 00 00 65  |6:6:47.000.....e|
00000020  6e 00 0b 00 00 00 69 73  6f 2d 38 38 35 39 2d 31  |n.....iso-8859-1|
00000030  00 0d 00 00 00 5a 3a 2f  42 6f 6f 6b 44 69 72 34  |.....Z:/BookDir4|
00000040  2f 00 0d 00 00 00 5a 3a  2f 42 6f 6f 6b 44 69 72  |/.....Z:/BookDir|
00000050  34 2f 00 0c 00 00 00 61  30 30 30 30 30 31 2e 6d  |4/.....a000001.m|
00000060  70 33 00 0c 00 00 00 61  30 30 30 30 30 31 2e 6d  |p3.....a000001.m|

Not sure what any of it means, but might be good enough for a signature.

Now the IMDN files might be a little easier:

hexdump -C Ncc.imdn | head
00000000  01 ff 00 ff d4 00 00 00  3c 00 00 00 2c 00 00 00  |........<...,...|
00000010  14 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000020  00 00 00 00 49 6d 64 54  78 74 54 61 62 6c 2e 69  |....ImdTxtTabl.i|
00000030  6d 74 74 00 00 00 00 00  00 00 00 00 00 00 00 00  |mtt.............|
00000040  00 00 00 00 49 6d 64 50  68 72 49 6e 66 6f 2e 69  |....ImdPhrInfo.i|
00000050  6d 70 68 00 00 00 00 00  00 00 00 00 00 00 00 00  |mph.............|
00000060  00 00 00 00 04 00 00 00  00 7d 00 00 22 56 00 00  |.........}.."V..|
00000070  01 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000080  00 00 00 00 01 00 00 00  28 00 00 00 28 00 00 00  |........(...(...|
00000090  00 00 00 00 00 00 00 00  28 00 00 00 ff ff ff ff  |........(.......|

This format directly names the two other formats. Should be easy to look for the two file names in the header. The NCC html file in Daisy 2.0 and the NCX xml file in Daisy 3.0 are directory files so it makes sense this file would do the same.

Not sure if these signatures will hold up over time, but they are a start. It would be nice if all the files we are given to preserve would have convenient static magic bytes, but alas, many do not and we have to guess.

These Daisy formats illustrate a problem in preservation that doesn’t quite have a good solution. Each of these files are individually unique and can be identified, but as a whole they represent another unique format. Tying formats together to link their interdependence on each other will be no small task, but will be necessary not only to understanding the format, but to avoid separating the files, renaming, or rearranging breaking that interdependence.

I have added the update to SMIL and new signatures for the other formats to my GitHub repository. Feel free to test and change if you find additional samples or information.

Final Cut Pro

December 15, 2023 by Thor Leave a comment

When it comes to Digital Preservation, the easiest types of file formats to preserve are often single self contained formats with lots of documentation. There are plenty of formats which break this norm, but a file format like a simple TIFF file is well understood and can stand on its own. The hardest file formats to preserve, I have found, are the complex under documented formats which often show up when you don’t expect them. There is a file format type which indeed makes things difficult. The project format.

There are many software tools out there which generate a “Project”, this is often proprietary and can only be used by the software which created it. Project files are also interdependent, meaning they require other files in known locations in order to be used. This interdependence is often links to images, audio, video, fonts, and other multimedia. The file format itself is just a reference to all the project settings and the paths to the files included in the project. This makes things very difficult to preserve and maintain the complex structure required. Any renaming, removing, or moving the files out of their original order can render the project useless. Many project formats are human readable in XML, or other human readable text, but others are not. I have made a recent attempt to document more Project formats on the File Format Wiki, including many Label and Optical disc project formats, along with updates to Adobe InDesign, QuarkXPress and other desktop publishing project formats. There is still plenty of work needed in other Video and Audio project formats.

Apple computers over the years has created some very powerful software for content creators to use, especially in Video editing. iMovie was used by many home movie editors and iDVD to burn those movies to DVD to share with family and friends, but Apple also sold a professional Video Editing suite which included Final Cut Pro.

Final Cut Pro started life as a Macromedia software tool called KeyGrip which never was released and later bought by Apple. Final Cut Pro was well used and loved by video editors and was given a major upgrade in 2011 to Final Cut Pro X, which was full re-written to be 64-bit. This change included a change to the Project file format. So for version 1 through version 7, Final Cut Pro used a project format with the extension .FCP. Lets take a closer look at the this project format.

hexdump -C Swing.fcp | head
00000000  a2 4b 65 79 47 0a 0d 0a  00 00 00 00 20 fc c5 5b  |.KeyG....... ..[|
00000010  00 de b3 11 d0 93 19 00  05 02 18 66 07 00 00 00  |...........f....|
00000020  03 00 00 00 00 00 00 00  00 01 00 00 00 00 01 00  |................|
00000030  00 00 11 07 73 75 62 74  79 70 65 00 00 00 01 01  |....subtype.....|
00000040  00 00 00 03 00 06 4e 4f  55 4e 44 4f 00 00 00 00  |......NOUNDO....|
00000050  01 01 00 00 00 00 00 00  00 00 00 00 00 07 52 55  |..............RU|
00000060  4e 54 49 4d 45 00 00 00  00 01 01 00 00 00 00 00  |NTIME...........|
00000070  00 00 00 01 07 76 69 65  77 65 72 73 00 00 00 00  |.....viewers....|
00000080  01 01 00 00 00 00 00 00  00 00 00 00 00 00 00 08  |................|
00000090  63 68 69 6c 64 72 65 6e  00 00 00 00 01 01 00 00  |children........|
*
00000e30  00 00 00 00 00 00 00 00  00 00 00 00 00 00 07 8c  |................|
00000e40  b3 2e 56 40 4d 6f 6f 56  54 56 4f 44 00 02 00 02  |..V@MooVTVOD....|
00000e50  00 00 00 11 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000e60  00 00 00 0b 44 61 6e 63  65 20 53 68 6f 74 73 00  |....Dance Shots.|
00000e70  00 01 00 08 00 00 07 8a  00 00 07 84 00 02 00 2f  |.............../|
00000e80  41 54 54 4f 20 52 41 49  44 30 20 47 72 6f 75 70  |ATTO RAID0 Group|
00000e90  3a 54 55 54 4f 52 49 41  4c 3a 44 61 6e 63 65 20  |:TUTORIAL:Dance |
00000ea0  53 68 6f 74 73 3a 49 6e  74 72 6f 2e 6d 6f 76 00  |Shots:Intro.mov.|
00000eb0  00 09 00 a8 00 a8 61 66  70 6d 00 00 00 00 00 03  |......afpm......|
00000ec0  00 18 00 39 00 59 00 75  00 95 00 9e 07 49 4c 31  |...9.Y.u.....IL1|
00000ed0  20 33 72 64 00 00 00 00  00 00 00 00 00 00 00 00  | 3rd............|
00000ee0  00 00 00 00 00 00 00 00  00 00 00 00 00 0f 77 61  |..............wa|
00000ef0  6c 74 d5 73 20 43 6f 6d  70 75 74 65 72 00 00 00  |lt.s Computer...|
00000f00  00 00 00 00 00 00 00 00  00 00 00 00 00 10 41 54  |..............AT|
00000f10  54 4f 20 52 41 49 44 30  20 47 72 6f 75 70 00 00  |TO RAID0 Group..|
00000f20  00 00 00 00 00 00 00 00  00 07 77 73 68 69 72 65  |..........wshire|
00000f30  73 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |s...............|
00000f40  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000f50  00 00 00 00 00 00 00 00  00 00 00 00 ff ff 00 00  |................|
00000f60  00 00 00 00 00 00 00 10  41 54 54 4f 20 52 41 49  |........ATTO RAI|
00000f70  44 30 20 47 72 6f 75 70  00 00 00 00 00 00 00 2b  |D0 Group.......+|
00000f80  00 00 00 01 00 00 00 03  00 00 00 03 54 55 54 4f  |............TUTO|
00000f90  52 49 41 4c 00 44 61 6e  63 65 20 53 68 6f 74 73  |RIAL.Dance Shots|
00000fa0  00 49 6e 74 72 6f 2e 6d  6f 76 00 00 00 00 00 00  |.Intro.mov......|

From the header we can see a remnant of the original KeyGrip software, but later in the file we find some references to files in the Mac HFS path format which includes a colon instead of a slash. These are the paths to the each of the MOV files used in the Project. This file is from the tutorial disk of Final Cut Pro version 1.2, so lets take a look at the last version released, version 7.

hexdump -C Lesson 1 Project.fcp | head
00000000  a2 4b 65 79 47 0a 0d 0a  01 de 00 00 00 20 08 92  |.KeyG........ ..|
00000010  66 c4 28 d7 11 8a e5 00  30 65 ec fe 98 03 00 00  |f.(.....0e......|
00000020  00 00 00 00 00 00 00 00  00 01 00 00 00 00 01 15  |................|
00000030  00 00 00 07 73 75 62 74  79 70 65 01 00 00 00 01  |....subtype.....|
00000040  03 00 00 00 00 06 4e 4f  55 4e 44 4f 00 00 00 00  |......NOUNDO....|
00000050  01 01 00 00 00 00 00 00  00 00 00 00 00 07 52 55  |..............RU|
00000060  4e 54 49 4d 45 00 00 00  00 01 01 00 00 00 00 00  |NTIME...........|
00000070  01 00 00 00 07 76 69 65  77 65 72 73 00 00 00 00  |.....viewers....|
00000080  01 01 00 00 00 00 00 00  00 00 00 00 00 00 00 08  |................|
00000090  63 68 69 6c 64 72 65 6e  00 00 00 00 01 01 01 00  |children........|

Almost identical to the first version, which is helpful for identification, but if we need to identify based on version, it might prove a little more difficult. It appears all the samples I have and have seen reference to all begin with the same 5 hex values, A24B657947, 0xA2 KeyG. It’s hard to know what other hex values might have something to do with versions of the file format. More samples could tell us, but from what I have the 20 bytes starting from offset 12 seems to be consistent among the different version samples. But for now the 5 bytes at the beginning of the file should suffice for identification.

When Final Cut Pro went through a complete re-write in 2011, the FCP format was abandoned. Not only made obsolete, but completely unsupported. The new Final Cut Pro X software was not able to support this now obsolete format. The new format followed the pattern of many other Apple formats of using a folder identified through an extension as a single file. Called a bundle format, Final Cut Pro X used the extension, .FCPBUNDLE. This bundle could include the media assets along with project settings/thumbnails and clips. Because of this “bundle” format, identification would have to be done at the individual file level inside the bundle. This would include formats with extensions such as .flexolibrary and .fcpevent, which appear to be SQLite databases. This complex format makes preservation of this type of object difficult with current methods and practices.

Luckily Apple didn’t leave Final Cut Pro users completely unable to migrate their content. Final Cut Pro could export the project as an XML file. This format is called Final Cut Pro XML Interchange Format and was well documented. The format was not made to bridge the gap from Final Cut Pro to Final Cut Pro X, but rather make the project file more useful outside of Final Cut Pro. Final Cut Pro X actually can’t open these files either, which is why a third party developer came in and developed 7toX (SendtoX) to allow for projects to be converted to a newer XML format.

Lets take a look at the basic Final Cut Pro XML Interchange Format which has a standard XML extension:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xmeml>
<xmeml version="5">
<sequence id="Sequence 1 ">...</sequence>
</xmeml>

Standard XML with a Doctype/root of xmeml. Clever. A little ways into the XML we also see:

<appspecificdata>
	<appname>Final Cut Pro</appname>
	<appmanufacturer>Apple Inc.</appmanufacturer>
	<appversion>7.0</appversion>
</appspecificdata>

Final Cut Pro X also has an XML format which is different than XMEML and has an extension FCPXML:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE fcpxml>

<fcpxml version="1.8">
    <resources>
        <format id="r1" name="FFVideoFormatDV720x480i5994" frameDuration="2002/60000s" fieldOrder="lower first" width="720" height="480" paspH="10" paspV="11" colorSpace="6-1-6 (Rec. 601 (NTSC))"/>
    </resources>
    <library location="file:///Untitled.fcpbundle/">...</library>
</fcpxml>

A different Doctype/root and structure but should be easy to identify.

The preservation of projects files, according to some, is not necessary since they are not the finalized product. Preserving the finalized output would be preferable as it can be managed easier and represent the final render of a project. But identification of the Final Cut Pro project and all the assets gives the option to access a collection more accurately. I was able to create a signature for the FCP, XML, and FCPXML formats. Take a look on my GitHub for the signatures and some test files.

Apple Mail

October 27, 2023 by Thor Leave a comment

There really is no “Macintosh Format”, but there sure are a lot of formats you only find on the MacOS. From Resource Forks and iWork formats to unique sound formats, MacOS has them all! Majority of cross-platform software vendors have done a much better job in recent years in making their file formats the same across platforms, but for Apple, they love to make things unique, just for their platform.

Take EMLX for example. Seems to be a trend to add “X” to the end of an older format to breath new life into it. The EML format, or Electronic Mail, has existed for a few decades now, but in 2005 Apple updated their Apple Mail application to use a new format, EMLX.

As far as I know, Apple hasn’t released any documentation on the EMLX format, but many folks out there have asked the question and have been able to “reverse engineer” the format. Lets take a look.

An EMLX file consists of three parts:

bytecount on first line;
email content in MIME format (headers, body, attachments);
Apple property list (plist) with metadata.

The bytecount is a variable number which consists of the total bytes starting from the start of the MIME format, including HTML, to the start of the XML property list. Lets look at a simple EMLX.

The byte count is on line 1 with the MIME email (EML) taking up the 556 bytes, then the XML plist at the end. You may ask, what is a plist? Well, it is another Apple (originally NextStep) invention which is embedded throughout the MacOS operating system. A Plist is usually an XML with keys but can also be in a binary format. The Plist can contain properties of the email within Apple Mail like special color flags, tagged as junk, date received and last reviewed.

If you do happen across an EMLX file or group of them, there are a few tools you can use to convert them to a plain old EML. There are python libraries or many other tools to do the job.

But first we need to be sure of identification beyond the extension. Adding this file format to PRONOM would help in identification for preservation purposes. If ran through PRONOM today we get:

filename : '9.emlx'
filesize : 18582
modified : 2023-10-26T22:16:25-06:00
errors   : 
matches  :
  - ns      : 'pronom'
    id      : 'fmt/950'
    format  : 'MIME Email'
    version : '1.0'
    mime    : 'message/rfc822'
    class   : 'Text (Structured)'
    basis   : 'byte match at [[31 17] [599 4] [339 6] [426 6] [90 14]]'
    warning : 'extension mismatch'

Because the format has a EML plain text format within its structure, it is assumed to be an EML file. While technically accurate, Identifying as a unique EMLX format would be beneficial in a preservation system so you can properly assign risk and choose the right tool to parse or migrate.

In looking at the three parts of an EMLX format, we know the EML file is not a good way to show the difference as they are the same structure. The byte count on the first line is variable, so there is no static byte sequence to use for identification. That leaves the Plist section at the end to distinguish the difference.

The PRONOM entry for a Plist looks for the typical XML strings present in most XML files, but then uses the root element “<plist version=”1.0″>” for identification. We could combine the existing EML signature and the Plist signature to identify an EMLX, or just take the existing EML signature and put in a small byte sequence for the closing of the </plist> tag near the EOF? There would be a need for a priority over EML, both would essentially accomplish the same thing.

Take a look at latter idea on my GitHub page and tell me which makes the most sense.