Scrivener

Word Processors are everywhere and have some of the most recognizable file formats. Some are very simple in that they just contain plain text, others are more complex. There are formats which allow for images and others which can handle different languages and writing directions.

A writing platform I recently learned about is called Scrivener. It was first released in 2007 by a company called Literature & Latte Ltd, and has a Macintosh and Windows version. The software is marketed toward writers as there is some features that help with note taking, research and much more. It also allows for adding multimedia and even full webpages.

This is accomplished by a file format which uses a non-traditional method for storing all the data needed to render the format.

tree Scrivener3-s01.scriv
Scrivener3-s01.scriv
├── Files
│   ├── Data
│   │   ├── 921B4A08-54C0-4B69-94FD-428F56FDAB89
│   │   │   └── content.rtf
│   │   └── docs.checksum
│   ├── binder.autosave
│   ├── binder.backup
│   ├── search.indexes
│   ├── styles.xml
│   ├── version.txt
│   └── writing.history
├── Scrivener3-s01.scrivx
└── Settings
├── recents.txt
├── ui-common.xml
└── ui.ini

Scrivener uses a folder structure to store all the data used in the format. The folder has an extension, .scriv. The format includes some rich text, backups, indexes, version history and more. One unique format within the folder is an XML file with the extension .scrivx. This makes the format proprietary and can only be rendered using the Scrivener software.

cat Scrivener3-s01.scrivx | head
<?xml version="1.0" encoding="UTF-8"?>
<ScrivenerProject Template="No" Version="2.0" Identifier="DF5DA7F0-27DB-4815-A050-B4D6F23CABA7" Creator="SCRWIN-3.1.5.1" Device="DESKTOP-JMM4K7M" Modified="2025-03-14 22:15:28 -0600" ModID="B4A944C3-FF79-49F6-A737-158BEB4E58BB">
<Binder>
<BinderItem UUID="17807D28-117A-409E-B12D-B34922B6CC6F" Type="DraftFolder" Created="2025-03-14 22:15:17 -0600" Modified="2025-03-14 22:15:17 -0600">
<Title>Draft</Title>
<MetaData>
<IncludeInCompile>Yes</IncludeInCompile>
</MetaData>
<Children>
<BinderItem UUID="921B4A08-54C0-4B69-94FD-428F56FDAB89" Type="Text" Created="2025-03-14 22:15:17 -0600" Modified="2025-03-14 22:15:23 -0600">

The XML has enough to be able to identify them apart from other XML files. The signature would be straight forward. Earlier versions of Scrivener sometimes have the SCRIVX file but also sometimes has a
.scrivproj extension. This file on a Macintosh is in a Binary plist format, which is different than earlier Windows versions. Seems they may have unified them under version 2 or 3, where version 1 & 2 for Windows uses Project version 1 and version 3 uses project version 2.

hexdump -C Scrivener1-s01.scriv/binder.scrivproj | head
00000000 62 70 6c 69 73 74 30 30 d4 00 01 00 02 00 03 00 |bplist00........|
00000010 04 00 05 00 1d 01 d8 01 d9 54 24 74 6f 70 58 24 |.........T$topX$|
00000020 6f 62 6a 65 63 74 73 58 24 76 65 72 73 69 6f 6e |objectsX$version|
00000030 59 24 61 72 63 68 69 76 65 72 dc 00 06 00 07 00 |Y$archiver......|
00000040 08 00 09 00 0a 00 0b 00 0c 00 0d 00 0e 00 0f 00 |................|
00000050 10 00 11 00 12 00 13 00 14 00 15 00 16 00 17 00 |................|
00000060 18 00 19 00 1a 00 15 00 1b 00 1c 5a 4c 61 62 65 |...........ZLabe|
00000070 6c 54 69 74 6c 65 59 4c 61 62 65 6c 4c 69 73 74 |lTitleYLabelList|
00000080 5e 42 69 6e 64 65 72 43 6f 6e 74 65 6e 74 73 5f |^BinderContents_|
00000090 10 0f 44 65 66 61 75 6c 74 4c 61 62 65 6c 54 61 |..DefaultLabelTa|

Since the developers of Scrivener decided to make the SCRIV format simply a folder with different content within, something special happens on the MacOS. The Scrivener software registers all the extensions is uses with the MacOS launch services. This process then changes the way the SCRIV folder is displayed in the MacOS Finder. They now appears as a single file and given a file type. This is called a Document Package format.

By right-clicking on the “file” you can then browse the package contents. There is nothing in the folder itself or hidden in any attributes which causes this to happen, it is all controlled by what extensions have been registered with the launch services database. We can however ask the MacOS to give us some extended metadata details about the package, as long as the file is on a Apple filesystem like HFS or APFS.

mdls Scrivener3-s01.scriv 
_kMDItemDisplayNameWithExtensions = "Scrivener3-s01.scriv"
kMDItemContentCreationDate = 2025-03-15 04:15:17 +0000
kMDItemContentCreationDate_Ranking = 2025-03-15 00:00:00 +0000
kMDItemContentModificationDate = 2025-03-15 04:15:18 +0000
kMDItemContentModificationDate_Ranking = 2025-03-15 00:00:00 +0000
kMDItemContentType = "com.literatureandlatte.scrivener3.scriv"
kMDItemContentTypeTree = (
"com.literatureandlatte.scrivener3.scriv",
"public.directory",
"public.item",
"com.apple.package",
"public.content",
"public.composite-content"
)
kMDItemDateAdded = 2025-03-21 04:38:48 +0000
kMDItemDateAdded_Ranking = 2025-03-21 00:00:00 +0000
kMDItemDisplayName = "Scrivener3-s01.scriv"
kMDItemDocumentIdentifier = 0
kMDItemFSContentChangeDate = 2025-03-15 04:15:18 +0000
kMDItemFSCreationDate = 2025-03-15 04:15:17 +0000
kMDItemFSCreatorCode = ""
kMDItemFSFinderFlags = 0
kMDItemFSHasCustomIcon = (null)
kMDItemFSInvisible = 0
kMDItemFSIsExtensionHidden = 0
kMDItemFSIsStationery = (null)
kMDItemFSLabel = 0
kMDItemFSName = "Scrivener3-s01.scriv"
kMDItemFSNodeCount = 3
kMDItemFSOwnerGroupID = 20
kMDItemFSOwnerUserID = 501
kMDItemFSSize = 31155
kMDItemFSTypeCode = ""
kMDItemInterestingDate_Ranking = 2025-03-15 00:00:00 +0000
kMDItemKind = "Scrivener Project"
kMDItemLogicalSize = 31155
kMDItemPhysicalSize = 69632

There is a lot of additional details available using the MDLS command, this includes the content type of “com.apple.package“. This tools works with any files in MacOS and can be a very useful tool in getting all the information you may need for preservation needs.

Until the tools we use for format identification can recognize package formats, tools like this may be needed to gather the neccessary metadata for preservation. But in the meantime, identification of the package content is the best we can hope for. Creating a signature for the XML based SCRIVX format is the first step.

Stay tuned for more on the package format as I will be bring it up more in the Digital Preservation community.

LUTS

If you are looking for LUTs, you’re in luck. There is a website for sharing your FreshLUTs. Even though they are fresh, they are probably not as exciting as one might think.

LUTs are short for Look-Up Tables, which doesn’t sound as exciting as you were probably hoping. They are a pretty interesting process for dealing with color in high end Image and Video processing applications. Often called 3D Look-up Tables, they are used for color grading, an essential step in film production and restoration to map from one color space to another. LUTs are not to be confused with ICC profiles which aim for color accuracy, while LUTs are looking for more color quality and aesthetics.

There are a lot of LUT formats out there, it seems. In looking into this format, I have found dozens of others to investigate, but today lets look at the four available as an export from Photoshop.

Above you can see a simple screenshot for the export of different formats from Adobe Photoshop. Adobe is one of the biggest developer and supporter of the formats used in LUTs, but there are many other graphics tools which create and support LUTs. In this Photoshop export we can see four formats included in the export. Lets take a look at each of these.

ICC Profiles are well documented and available for identification in PRONOM.

filename : 'LUTs-Export-s01.icc'
filesize : 197024
modified : 2025-02-25T09:37:24-07:00
errors :
matches :
- ns : 'pronom'
id : 'fmt/1975'
format : 'ICC Profile'
version : '2'
mime : 'application/vnd.iccprofile'
class : 'Dataset'
basis : 'extension match icc; byte match at 8, 32'

But the other three are plain text files and still identify as such. Let us start with the CUBE format.

filename : 'LUTs-Export-s01.cube'
filesize : 884963
modified : 2025-02-25T09:37:24-07:00
errors :
matches :
- ns : 'pronom'
id : 'x-fmt/111'
format : 'Plain Text File'
version :
mime : 'text/plain'
class :
basis : 'text match ASCII'
warning : 'match on text only; extension mismatch'

cat LUTs-Export-s01.cube
#Created by: Adobe Photoshop Export Color Lookup Plugin
#Copyright: (C) Copyright 2025 ObsoleteThor
TITLE "LUT-export-s01"

#LUT size
LUT_3D_SIZE 32

#data domain
DOMAIN_MIN 0.0 0.0 0.0
DOMAIN_MAX 1.0 1.0 1.0

#LUT data points
0.000000 0.000000 0.000000

The CUBE format was first developed by IRIDAS in 2003 as a answer to ensure interoperability with other software. Adobe acquired IRIDAS in 2011 in a effort to be a leader in the color grading and enhancement market. They have published the CUBE specifications for version 1.0 in 2013.

A Cube file is a text file that defines a look-up table in the Cube format.
The Cube look-up tables store RGB values.
Advantages of the Cube format include:
  • The Cube format can describe look-up tables for a wide range of purposes, from simple gamma adjustments for display output to complex HDR image processing.
  • The format is well suited for professional digital cinema applications and for both normal range and High-Dynamic Range image processing.
  • As Cube files are text files, they are easily edited or reviewed using a text editor.
  • A Cube file can include three 1-dimensional tables or one 3-dimensional table.
  • The tables can be in a wide range of sizes.
  • Cube files are trivial to write and read.
  • All values are human-readable as they are in decimal form, and can be of high precision.
  • The input domain and output range are not limited to the range 0.0 to 1.0.

According to the specifications, a CUBE file can be a One-Dimensional Cube file or a Three-Dimensional Cube file. From the example above you can see the file is a Three-Dimensional file with the required line “LUT_3D_SIZE“. But in a One-Dimensional file, the required line is “LUT_1D_SIZE“.

cat Demo.cube
TITLE "Demo"
LUT_1D_SIZE 3
DOMAIN_MIN 0 0 0
DOMAIN_MAX 1 2 3
0 0 0
# Comments can go anywhere
0.5 1 1.5
1 1 1

Each CUBE file has one or the other and should be an easy string to look for. It is in a variable position as there can be comments before the required line and also may have a TITLE line. The TITLE and DOMAIN lines are common to every file but not required.

Now, the CUBE format is a bit different depending on the source. They all seem to have the same header, but different elements. It seems the IRIDAS Cube format is the most interoperable. The Truelight Cube format generally has the CUB extension, and the Cinespace Cube has the CSP extension, which will look at next/ You can read more about the differences on this format comparison table. This LUTCalc web site has many different types of Cube’s it can output, so there are some differences.

The other file format available in the export is a CSP. The CSP is also a plain text file, often called a cineSpace LUT file. This format come from the cineSpace software, a color management software for the film and television industry.

cat LUTS-s01.csp 
CSPLUTV100
3D

BEGIN METADATA
#Created by: Adobe Photoshop Export Color Lookup Plugin
TITLE "LUTS"
END METADATA

2
0.0 1.0
0.0 1.0
2
0.0 1.0
0.0 1.0
2
0.0 1.0
0.0 1.0

32 32 32
0.000000 0.000000 0.000000

The CSP File Format specifications outlines header and the other two sections.:

The cineSpace LUT format contains three main sections.
Header
This section contains the LUT identifier and the LUT type, 3D or 1D.
It is made up of the first two (2) valid lines in the file. See Notes below for the definition of a valid line.

Examples
• (3D LUT) header:
CSPLUTV100
3D
• (1D LUT) header:
CSPLUTV100
1D

So there is a pretty obvious header to work with in identification. “CSPLUTV100” can be used to identify both 1D and 3D CSP files.

The other format available to export from Photoshop is 3DL. They seem to be connected to the Assimilate Inc. company and software. A specification has been posted, and it looks like there is only ASCII and not much in the way of a header.

cat LUTS-s01.3dl 
#Created by: Adobe Photoshop Export Color Lookup Plugin
#Description: LUTS
0 33 66 99 132 165 198 231 264 297 330 363 396 429 462 495 528 561 594 627 660 693 726 759 792 825 858 891 924 957 990 1023

It does not appear there is any headers or static strings to use for identification. The specification calls the format, 3DL ASCII format and that “All lines starting with ‘#’ are treated as comments.” Because of this, I don’t think positive identification can happen at this time.

For now I am just proposing 2 new file formats to PRONOM, The CUBE format And the CSP Format. Click on my GitHub submission page to see the signatures and enjoy some samples!

Pro Tools Sessions

One of the most important software titles related to professional audio recording and mixing is Pro Tools. The Digital Audio Workstation by Digidesign, now Avid, has been around since 1991 and was born from the very popular Sound Designer software first released in 1985. When Sound Designer II was released a few years later, the audio format used became the standard file format for audio recordings. Pro Tools progressed from there to become the industry standard for professional audio production, even winning a Technical Grammy, Emmy, and Oscar.

Pro Tools helped produce amazing music for artists such as No Doubt, Maroon 5, Ricky Martin, and many others. Obviously the best part is the final mixed audio used to make the music we love, but the work that goes into creating the audio mixes is saved in a Pro Tools session. The session is where all the magic happens. A Pro Tools session is actually a project file within a folder where all the supporting files are located.

tree PT Sample/
├── Audio Files
│   ├── GTR 1_02.wav
│   ├── GTR 1_03.wav
│   └── GTR 1_04.wav
└── Test.ptx

These Session “Folders” can get pretty complex as more audio and effects are added to the session, adding folders such as Fade Files, Rendered Files, and Plug-in settings. The current version of Pro Tools uses a project session file with the extension PTX, but that wasn’t always the case. The current version of Pro Tools can be run on Macintosh and Windows, but that also was not always the case. Because the software was originally written for Macintosh hardware, the session files were only compatible on the Macintosh file system as well.

Lets start by looking at a session from Pro Tools version 1.1 from 1991.

ls -l@ Demo Disk 1 
total 1504
-rw-r--r--@ 1 thorsted Domain Users 45056 Sep 13 1991 Backward Kick
com.apple.FinderInfo 32
com.apple.ResourceFork 1354
com.apple.provenance 11
-rw-r--r--@ 1 thorsted Domain Users 0 Sep 16 1991 Demo Session
com.apple.FinderInfo 32
com.apple.ResourceFork 13671
com.apple.provenance 11
-rw-r--r--@ 1 thorsted Domain Users 0 Sep 16 1991 Desktop
com.apple.FinderInfo 32
com.apple.ResourceFork 3081
com.apple.provenance 11
-rw-r--r--@ 1 thorsted Domain Users 339456 Sep 13 1991 Solo 1
com.apple.FinderInfo 32
com.apple.ResourceFork 2040
com.apple.provenance 11
-rw-r--r--@ 1 thorsted Domain Users 350390 Sep 13 1991 Solo 2
com.apple.FinderInfo 32
com.apple.ResourceFork 2006
com.apple.provenance 11

You might notice the “Demo Session” file is Zero Bytes, but the Resource Fork is 13671 bytes in size.

The Pro Tools Sessions from the beginning until version 5 used this method of storing the session data. ALL in the Resource Fork. Because the session data was in the resource fork and the supporting audio files were in the Sound Designer II format, which also stored important information in the resource fork, this made it impossible to use on anything but a Macintosh file system.

Version 10 of Pro Tools allows you to export the full session back into older versions of the software to version 3.2. When you choose version 5 on a Mac, it forces you to also convert the audio formats to SD2 files as well. For versions 1 & 2 of Pro Tools, there was no official extension for the session files, but starting with version 3, you might often find the extension PT3, then PT4, and PT5. With version 4, there was also a version P24 extension used when Pro Tools version 4 made the leap to 24bit. But for each of these versions identification is not possible with current preservation tools like PRONOM. You could encode the session as a MacBinary to retain everything for modern systems, which is identifiable, but you could also use my proposal for a lookup in the TCDB python tool located here.

python3 TC-lookup-draft-uni.py "PT Session 02-41.pt4"
Type Code: PT4S
Creator Code: PTul
Size of Data Fork: 0 bytes
Size of Resource Fork: 14003 bytes
Rows with Type Code b'PT4S' and Creator Code b'PTul':
Row index: 32813
File Name: Pro Tools 4
Type: PT4S
Creator: PTul
Extension: pt4
Data by Ilan Szekely, Jerusalem: nan
ExtensionVersionTypeCreator
Pro Tools 1.1mtSFTLin
Pro Tools 2PSesPTul
PT3Pro Tools 3.2PSesPTul
PT4Pro Tools 4 16bitPT4SPTul
PT24Pro Tools 4 24bitPT24PTul
PT5Pro Tools 5PT5SPTul
PTSPro Tools 5.1-6.9PTS PTul
PTFPro Tools 7-9PTF PTul
PTXPro Tools 10+PTX PTul

There isn’t a lot of information about when Pro Tools was made for Windows. I found some references to a Windows NT version of the 16bit and 24bit version 4. I did also find a copy of the free Pro Tools version 5.01 for Windows 98. In the Read Me it states:

Cross–platform File Exchange is not supported in this version of Pro Tools FREE

File interchange between Mac and PC versions of Pro Tools FREE is not possible in this 5.0.1 release. We hope to include this functionality in a future release of Pro Tools FREE.You can exchange files with Pro Tools LE and TDM users who use the same platform (Mac or Win98/Me) as you, but remember, Pro Tools FREE is limited to 8 audio and 48 MIDI tracks.

Running the software confirms the session file for this version has the extension PT5 and not the later PTS for version 5.1. This version of Pro Tools also allows you to save back to the P24 and PT4 versions, which are probably the first Windows versions. But they are entirely different file formats from the Macintosh versions.

hexdump -C PT5-Win-s03.pt5 | head
00000000 00 00 01 00 00 00 45 ae 00 00 44 ae 00 00 03 98 |......E...D.....|
00000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000100 00 00 00 5e 50 53 56 45 00 01 04 31 00 52 05 00 |...^PSVE...1.R..|
00000110 45 44 05 00 45 44 19 99 03 26 0c 50 72 6f 54 6f |ED..ED...&.ProTo|
00000120 6f 6c 73 20 35 2e 30 fc c5 00 d7 12 00 78 5e 00 |ols 5.0......x^.|
00000130 00 00 0e 32 00 78 5e 00 00 00 00 00 00 00 00 00 |...2.x^.........|
00000140 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|

hexdump -C PT24-Win-s03.p24 | head
00000000 00 00 01 00 00 00 3f d3 00 00 3e d3 00 00 02 f1 |......?...>.....|
00000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000100 00 00 01 0a 50 61 74 68 00 01 02 b4 37 0e 2e 48 |....Path....7..H|
00000110 43 3a 5c 57 49 4e 44 4f 57 53 5c 44 65 73 6b 74 |C:\WINDOWS\Deskt|
00000120 6f 70 5c 50 54 5c 50 54 35 2d 57 69 6e 2d 73 30 |op\PT\PT5-Win-s0|
00000130 33 5c 41 75 64 69 6f 20 46 69 6c 65 73 00 00 00 |3\Audio Files...|
00000140 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|

hexdump -C PT4-Win-s03-16.pt4 | head
00000000 00 00 01 00 00 00 3f d9 00 00 3e d9 00 00 02 f1 |......?...>.....|
00000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000100 00 00 01 0a 50 61 74 68 00 01 02 b4 37 0e 2e 48 |....Path....7..H|
00000110 43 3a 5c 57 49 4e 44 4f 57 53 5c 44 65 73 6b 74 |C:\WINDOWS\Deskt|
00000120 6f 70 5c 50 54 5c 50 54 35 2d 57 69 6e 2d 73 30 |op\PT\PT5-Win-s0|
00000130 33 5c 41 75 64 69 6f 20 46 69 6c 65 73 00 00 00 |3\Audio Files...|
00000140 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|

Starting with Pro Tools 5.1 in 2001 things began to change. Pro Tools has always been tied very closely with hardware and software so with Apple launching Mac OS X, this provided an opportunity for DigiDesign/Avid to revamp their hardware and software for better compatibility and this included a cross-platform session format.

Pro Tools 5.1 used a new session format which used the extension PTS. Let’s take a look at a sample.

hexdump -C PT Session 02-51.pts | head
00000000 03 30 30 31 30 31 31 31 31 30 30 31 30 31 30 31 |.001011110010101|
00000010 31 00 01 3d 6e 1c 06 eb d8 c1 aa 16 fd 65 4e 6d |1..=n........eNm|
00000020 23 09 96 db c4 ad 95 7f 68 5d 3a 23 0c a5 ac a8 |#.......h]:#....|
00000030 90 cd ed 04 38 4e 06 47 bc e2 ca b3 9c 8f 6e 57 |....8N.G......nW|
00000040 40 2a 12 fb e4 c4 b6 9f 88 77 5a 43 2c 24 ce c9 |@*.......wZC,$..|
00000050 e3 97 9b 8a 73 5d 46 2f 4a 64 86 b6 dd d6 eb 77 |....s]F/Jd.....w|
00000060 76 49 32 1b 54 9f b9 9f fc fe 15 0f 3f 15 4d 62 |vI2.T.......?.Mb|
00000070 83 aa ab c4 fa 5d 20 26 54 44 0b f3 d9 c5 ae 97 |.....] &TD......|
00000080 cd 08 31 74 77 0d f6 df c8 b5 c0 8b 6c 7c 3f 27 |..1tw.......l|?'|
00000090 10 9e c2 cb b4 9d 86 45 58 41 2a ad e1 78 2d b4 |.......EXA*..x-.|

The session is a new proprietary binary format with an interesting header. There is one byte and then a sequence of ASCII characters in the form of a binary string. 0010111100101011 What it means is unknown to me. In Decimal, the binary reads “12075”, or hex values “2F2B” or in text “/+”. Regardless of what it means, this header was used from versions 5.1 through 9. The extension changed to PTF with version 7-9, but the header is the same. This is why PRONOM PUID fmt/1951 refers to both extensions covering 5.1-9.

hexdump -C PT Session 02-7.ptf | head
00000000 03 30 30 31 30 31 31 31 31 30 30 31 30 31 30 31 |.001011110010101|
00000010 31 00 01 4c 6a cd 68 00 a0 3c d8 d2 c1 ac 48 be |1..Lj.h..<....H.|
00000020 85 1c 25 54 f0 8c 31 e1 61 fc 98 34 d0 6c 08 a4 |..%T..1.a..4.l..|
00000030 40 dc 79 14 b0 4c eb 84 21 bc 58 f4 90 2c cc 64 |@.y..L..!.X..,.d|
00000040 00 9c 0e a7 15 6f a9 44 e0 7c 18 b4 7a ec 88 24 |.....o.D.|..z..$|
00000050 c6 42 65 77 5d b8 f2 80 a1 3c d8 2e 12 ac 6b e4 |.Bew]....<....k.|
00000060 80 1c a2 71 f0 8c 2c c4 60 fc ae 47 b5 0f 09 a4 |...q..,.`..G....|
00000070 40 dc 78 14 9a 4c e8 84 26 a2 c5 17 fd 58 52 e0 |@.x..L..&....XR.|
00000080 01 9c 38 d4 70 0d a8 44 e0 26 1a b4 73 ec 88 24 |..8.p..D.&..s..$|
00000090 da 79 f8 94 34 cc 68 04 96 4f bd 17 11 ac 48 e4 |.y..4.h..O....H.|

It might be possible to look closer at the two extensions and find something which can distinguish between them, but because they are in a proprietary binary format, there isn’t much to go on. There has been a few attempts at reverse engineering the formats, but they even choose to lump the two extensions together.

The other import byte in this header is the second byte after the odd binary ASCII sequence. Above highlighted in purple. 0x01 is important because in the next version PTX, this changes to 0x05, highlighted below in purple.

Pro Tools version 10 was a big release, it added new features and started to phase out the HD hardware. With this release we see a new session format which is still used by the current version of Pro Tools.

hexdump -C PT Session 02-10.ptx | head
00000000 03 30 30 31 30 31 31 31 31 30 30 31 30 31 30 31 |.001011110010101|
00000010 31 00 05 13 5a 01 00 04 00 00 00 49 a4 00 00 5a |1...Z......I...Z|
00000020 03 00 64 00 00 00 03 00 00 0c 00 00 00 50 72 6f |..d..........Pro|
00000030 20 54 6f 6f 6c 73 20 48 44 03 00 00 00 0a 00 00 | Tools HD.......|
00000040 00 03 00 00 00 09 00 00 00 06 00 00 00 31 30 2e |.............10.|
00000050 33 2e 39 01 07 00 00 00 52 65 6c 65 61 73 65 00 |3.9.....Release.|
00000060 16 00 00 00 50 72 6f 20 54 6f 6f 6c 73 20 53 65 |....Pro Tools Se|
00000070 73 73 69 6f 6e 20 46 69 6c 65 06 00 05 00 00 00 |ssion File......|
00000080 4d 61 63 4f 53 00 00 00 00 05 5a 08 00 eb 00 00 |MacOS.....Z.....|
00000090 00 67 20 00 00 00 00 2a 00 00 00 be 1d 9d e3 03 |.g ....*........|

This new session format has the same binary ASCII string, but a lot more plain text in the header and throughout the file. This gives us more to explore and understand with even listing the linked Audio files and their paths. PRONOM has this new format assigned to PUID fmt/1727. The signature for these files is the same sequence as the previous version, also the 0x05 byte, but with a couple additional bytes, 5A010004, after the main sequence. I am not sure of the bytes significance, but they are in all the samples I have, even from the current version.

Pro Tools has some other formats which go along with their sessions. One I’ll highlight is the Groove template format. They end with the extension GRV. You can see some samples here. They also have the odd binary ASCII header, but with 0x00 for the second byte after the main header. Highlighted in purple below.

hexdump -C DiskoKonga.grv| head
00000000 03 30 30 31 30 31 31 31 31 30 30 31 30 31 30 31 |.001011110010101|
00000010 31 01 00 5a 00 01 00 00 00 04 00 00 15 f8 5a 00 |1..Z..........Z.|
00000020 01 00 00 15 d3 10 42 04 04 00 64 00 64 00 64 00 |......B...d.d.d.|
00000030 01 00 01 00 01 00 00 00 00 01 d4 c0 00 00 00 00 |................|
00000040 00 00 00 00 81 00 00 00 00 00 00 00 81 5a 00 01 |.............Z..|
00000050 00 00 00 24 10 43 00 00 00 00 00 00 00 00 00 00 |...$.C..........|
00000060 00 00 00 01 d4 c0 00 00 00 00 00 00 00 00 00 00 |................|
00000070 00 00 00 01 d4 c0 00 49 5a 00 01 00 00 00 24 10 |.......IZ.....$.|
00000080 43 00 00 00 00 00 01 d4 c0 00 00 00 00 00 05 7e |C..............~|
00000090 40 00 00 00 00 00 04 8e e0 00 00 00 00 00 01 d4 |@...............|

Other extensions associated with Pro Tools which use the same format are: PIO, PIM, PTT, PTXT, RGRP.

Pro Tools has always been software directly tied to audio hardware and system software. In addition they also used software dongles to control software licensing and the licenses were not cheap. Because of this, trying to use older versions is very difficult. Finding samples for each version is difficult as each version allows for a variety of features that may not be available in another version. Luckily, there are some older “Free” versions out there with limited features we can get some ideas of the session format.

PRONOM has working identification for the two major formats and until PRONOM can incorporate Macintosh Resource Fork identification it will have to do. The PC version 4 and 5 formats could use more research as I only have one source. The groove and other formats all seem to have the same header so they will need more research as well. Until then, enjoy some sample files and also a disk image of some older Macintosh Pro Tools 3 sessions.

Script Writing

A few of you may remember a couple years ago reading in a Vice article about Eric Roth and his use of an old DOS only software program for writing all his Hollywood scripts. The Vice article was based on some earlier reporting in 2014 about his writing process. You can watch the full interview of Eric Roth on YouTube.

I remember seeing a link to the Vice article a couple years ago and finding the screenwriters use of an old DOS program, Movie Master, funny and interesting. He says in his interview that out of half superstition and half fear of change he prefers to use this very old software to write his screenplays. It’s so old and obsolete, he can’t even email the files to Hollywood. He has to print them out and have the studio scan them into modern software for use. The interview shows the screen of his old Windows computer and you can see the software he is using.

Of course because I love researching obsolete software and formats so much, I wanted to know if the scripts generated by “Movie Master”, version 3.09, are in a format that needed to be documented. I was a little surprised that this version of Movie Master was no where to be found. It was on none of the old abandoned software sites. Not on Internet Archive, no where it seemed. I did find a later version of Movie Master, version 5, but found this software was not the same thing.

The original programmer of Movie Master was Adam Greissman, which you can clearly see in the screenshot above. The software was copyright Comprehensive Video Supply in the 1980’s, but the Movie Master version 5 was developed by Ballistic Software, Inc, which was also known as “Comprehensive Cinema Software” or “Hollywood Cinema Software” later in the 1990’s.

According to a very in depth article by Daniel Plagens, Reinventing the Typewriter, mentions Adam Greissman not wanting to move the software from DOS to Windows as he didn’t feel there was enough of a market at the time. As it turns out the founder of Comprehensive Video Supply, Jules Leni, got a lot of pressure from users of Movie Master after Greissman, who left the company in 1991, to develop a Windows and Macintosh version of the software. They released this new version in October of 1996.

Let’s take a look at a couple of example files from version 5.

hexdump -C Scene.scr | head
00000000 11 0d 0a 32 2e 20 20 20 20 15 0d 0a 15 0d 0a 15 |...2. .......|
00000010 0d 0a 15 0d 0a 11 0d 0a 10 0d 0a 15 0d 0a 15 0d |................|
00000020 0a 15 0d 0a 10 0d 0a 46 41 44 45 20 49 4e 3a 15 |.......FADE IN:.|
00000030 0d 0a 54 68 65 20 66 6f 6c 6c 6f 77 69 6e 67 20 |..The following |
00000040 22 73 63 72 69 70 74 6c 65 74 22 20 64 65 6d 6f |"scriptlet" demo|
00000050 6e 73 74 72 61 74 65 73 20 68 6f 77 20 4d 6f 76 |nstrates how Mov|
00000060 69 65 20 4d 61 73 74 65 72 20 0d 0a 63 61 6e 20 |ie Master ..can |
00000070 62 65 20 75 73 65 64 20 74 6f 20 6f 75 74 6c 69 |be used to outli|
00000080 6e 65 20 73 63 65 6e 65 73 2e 20 20 4f 6e 63 65 |ne scenes. Once|
00000090 20 79 6f 75 20 68 61 76 65 20 66 69 6e 69 73 68 | you have finish|

hexdump -C MM5-s01.scr | head
00000000 11 0d 0a 31 2e 20 20 20 20 15 0d 0a 15 0d 0a 15 |...1. .......|
00000010 0d 0a 15 0d 0a 11 0d 0a 10 0d 0a 15 0d 0a 15 0d |................|
00000020 0a 15 0d 0a 10 0d 0a 54 45 53 54 49 4e 47 15 0d |.......TESTING..|
00000030 0a 7e 60 21 40 23 24 25 5e 26 2a 28 29 2d 2b 7c |.~`!@#$%^&*()-+||
00000040 3d 2d 54 65 43 66 4d 74 0d 0a 01 00 00 07 00 02 |=-TeCfMt........|
00000050 00 00 00 00 00 00 01 00 00 01 00 00 01 00 00 01 |................|
00000060 00 00 01 00 00 01 00 00 01 00 00 01 00 00 01 00 |................|
00000070 00 01 00 00 bf 03 00 00 0c 00 43 6f 75 72 69 65 |..........Courie|
00000080 72 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |r...............|
00000090 00 00 00 00 00 00 00 00 00 30 00 00 00 00 00 00 |.........0......|

Version 5 of Movie Master uses the extension SCR, which one could assume is short for “Script”. There does appear to be a header before any readable text starts, so that will be helpful in identification. Currently there is only one PUID, x-fmt/100, in PRONOM with the extension SCR, which happens to be for an AutoCAD script and has no signature, so anything you ask DROID or Siegfried to identify with the SCR extension will default to an AutoCAD script, which is frustrating. According to the File Format Wiki, there are quite a few formats with the SCR extension. More work to be done there for sure.

So I tried for a few weeks to find a copy of Movie Master version 3.09, I even put in a eBay favorite search for the name so it would alert me to a copy being sold, but no such luck. I gave up for awhile, then recently someone posted a link to a large collection of early warez. Warez is the name given to software that has been illegally copied. When I followed the link and searched though the vast amount of software titles, I got excited to see a couple matches to “Movie Master”. After a little wrangling of some downloads, I spun up a copy of DOSBox and low and behold, Movie Master 3.09!

Welcome to Movie Master V3.09 about screen

A lot of people have compared the old DOS scriptwriting tools to early word processors like Word, Perfect Writer, WordStar, etc. They did much of the same thing, but with special controls for helping with scenes, characters, indents, and everything writers needed to make some of the best Hollywood films out there. As Daniel Plagens noted:

The program proved popular for many years. Greissman estimates they sold over 10,000 units—“saturating the market,” as he put it—and recalls seeing help wanted ads in Hollywood Reporter and Variety where knowledge of Movie Master was a hiring requirement. He visited the sets of Days of Thunder and Hunt for Red October to help their writers and production teams acclimate to Movie Master.

Makes me wonder where all the old scripts from Hollywood movies are located in their electronic form? I am sure Eric Roth probably has quite the collection of different scripts he has written. I sure hope he backs them up and donates them to a library in the future.

Well, let’s take a look at a couple sample files from Movie Master version 3 and version 4. Version 4.04 was also in the collection uploaded to Internet Archive.

hexdump -C TEST3.SCR | head 
00000000 33 2e 30 39 0a 00 00 00 00 31 00 00 00 00 00 00 |3.09.....1......|
00000010 31 00 00 00 00 00 00 0a 00 4e 41 4d 45 20 3f 0a |1........NAME ?.|
00000020 ff 53 43 52 45 45 4e 0a 2a 42 01 19 3c 01 1e 37 |.SCREEN.*B..<..7|
00000030 01 1c 2f 01 14 25 01 18 24 01 39 4c 01 31 42 01 |../..%..$.9L.1B.|
00000040 35 41 01 0a 46 01 0a 46 01 3d 4b 01 02 00 01 0a |5A..F..F.=K.....|
00000050 03 00 54 65 73 74 69 6e 67 20 4d 6f 76 69 65 20 |..Testing Movie |
00000060 4d 61 73 74 65 72 20 76 65 72 73 69 6f 6e 20 33 |Master version 3|
00000070 2e 30 39 11 11 31 11 31 0a |.09..1.1.|

hexdump -C TEST.SCR
00000000 34 2e 30 34 0a 00 00 00 00 31 00 00 00 00 00 00 |4.04.....1......|
00000010 31 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0a |1...............|
00000020 ff 0a 2a 42 01 00 19 3c 01 00 1e 37 01 00 1c 2f |..*B...<...7.../|
00000030 01 00 14 25 01 00 18 24 01 00 39 4c 01 00 31 42 |...%...$..9L..1B|
00000040 01 00 35 41 01 00 0a 46 01 00 0a 46 01 00 3d 4b |..5A...F...F..=K|
00000050 01 00 0a 18 01 00 0a 46 01 00 02 00 00 54 68 69 |.......F.....Thi|
00000060 73 20 69 73 20 61 20 74 65 73 74 20 6f 66 20 4d |s is a test of M|
00000070 6f 76 69 65 20 4d 61 73 74 65 72 20 53 63 72 69 |ovie Master Scri|
00000080 70 74 20 77 72 69 74 69 6e 67 20 73 6f 66 74 77 |pt writing softw|
00000090 61 72 65 2e 0a 01 03 00 00 31 0a 01 00 00 00 00 |are......1......|
000000a0 0a 03 01 0a |....|

hexdump -C COVER.SCR | head
00000000 33 2e 30 35 0a 01 00 00 00 31 00 00 00 00 00 00 |3.05.....1......|
00000010 31 00 00 00 00 00 00 0a ff 43 4f 56 45 52 0a 2a |1........COVER.*|
00000020 42 01 19 3c 01 1e 37 01 1c 2f 01 14 25 01 18 24 |B..<..7../..%..$|
00000030 01 39 4c 01 31 42 01 35 41 01 0a 46 01 0a 46 01 |.9L.1B.5A..F..F.|
00000040 3d 4b 01 06 00 00 0a 03 01 31 0a 01 03 00 00 11 |=K.......1......|
00000050 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 |................|
00000060 11 11 11 11 11 11 11 11 20 20 20 20 20 20 20 20 |........ |
00000070 20 20 20 20 20 20 20 20 20 20 20 20 20 22 4d 65 | "Me|
00000080 65 74 20 74 68 65 20 44 72 61 63 75 6c 61 73 22 |et the Draculas"|
00000090 11 11 11 11 11 20 20 20 20 20 20 20 20 20 20 20 |.....

hexdump -C DRAC2.SCR | head
00000000 34 2e 30 30 0a 01 00 2b 00 36 00 00 00 00 00 00 |4.00...+.6......|
00000010 35 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0a |5...............|
00000020 00 42 4f 42 0a 01 54 45 44 0a 02 43 41 52 4f 4c |.BOB..TED..CAROL|
00000030 0a 03 41 4c 49 43 45 0a 04 49 47 4f 52 0a 05 44 |..ALICE..IGOR..D|
00000040 45 4e 4e 49 53 0a 06 4d 55 46 46 49 4e 0a ff 53 |ENNIS..MUFFIN..S|
00000050 43 52 45 45 4e 0a 2a 42 01 00 19 3c 01 00 1e 37 |CREEN.*B...<...7|
00000060 01 00 1c 2f 01 00 14 25 01 00 18 24 01 00 39 4c |.../...%...$..9L|
00000070 01 00 31 42 01 00 35 41 01 00 0a 46 01 00 0a 46 |..1B..5A...F...F|
00000080 01 00 3d 4b 01 00 0a 18 01 00 0a 46 01 00 02 01 |..=K.......F....|
00000090 01 35 0a 03 00 45 58 54 20 54 45 44 20 44 52 41 |.5...EXT TED DRA|

The first thing to notice is they all start with the version number of the software which wrote the file. Really nice to have, but a terrible magic header. The files also all begin (after the version number) and end with the Hex value “0A”. Which happens to be a line feed control character. So super common, but could be helpful. Another pattern is that on the 9th byte it is “31” on most of the samples and “36” on one of them. “31” is the start of the ASCII number sequence, so could be the sequence number for the script as each SCR file could only store what was in memory.

I fear the rest of the format will have the same issue most word processors had at the time which is not having a header, but lots of formatting codes which may or may not be in every file, making programatic identification difficult. Might take awhile to identify all the formatting codes, but could lead to better identification and possibly an import module for tools like LibeOffice or Final Draft.

Screenshot of Movie Master 4.04 start screen

I didn’t find much different with Movie Master 4, seemed to have the same restrictions to 16 files in a script. The files from version 4 also seem to follow the same patterns from version 3. But both versions are different from the the Windows version of Movie Master, version 5. Click here for Movie Master 5 help menu on “Introduction for Movie Master DOS Users“.

There was another elusive script writing software title which adds to the confusion. Scriptware was another screenwriting software tool which seems to have had a large following. They produced a Windows and Macintosh version. It also started out for DOS and also used the SCR extension. The website is still active for the software, but hasn’t updated in 24 years. I wrote a little about in my post on PROmotion. All the demo versions out there are not useable demos, but animation demos. In this nice batch of old software on the Internet Archive I was able to find an early copy. Wasn’t able to get it to run, but the folder did have some samples.

hexdump -C SAMPLE1.SCR | head
00000000 32 5f 01 00 00 00 00 00 00 00 00 39 01 4a 5f 00 |2_.........9.J_.|
00000010 ff ff 2c 01 00 00 00 00 00 00 95 80 01 00 11 53 |..,............S|
00000020 63 72 69 70 74 77 61 72 65 20 53 63 72 69 70 74 |criptware Script|
00000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
000000b0 00 00 00 00 00 00 00 00 11 00 02 00 02 00 14 00 |................|
000000c0 12 00 6f 02 f9 04 0b 00 7b 04 01 00 05 00 00 02 |..o.....{.......|
000000d0 00 11 00 00 00 0c 00 00 00 06 00 ed 01 05 06 00 |................|
000000e0 00 00 00 00 08 00 0b 00 00 00 04 00 00 00 04 00 |................|
000000f0 82 00 01 01 00 00 00 00 00 00 00 00 00 00 00 00 |................|

hexdump -C SAMPLE2.SCR | head
00000000 0b 53 63 72 69 70 74 77 61 72 65 1a 95 80 04 80 |.Scriptware.....|
00000010 1e 53 63 72 69 70 74 77 61 72 65 20 53 63 72 69 |.Scriptware Scri|
00000020 70 74 20 32 2e 32 33 3a 34 3b 37 30 32 32 31 00 |pt 2.23:4;70221.|
00000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
000000a0 00 00 00 00 00 00 00 00 00 00 11 00 02 00 02 00 |................|
000000b0 11 00 00 00 34 02 01 05 0b 00 89 04 01 00 05 00 |....4...........|
000000c0 00 02 00 11 00 00 00 0b 00 00 00 06 00 aa 01 05 |................|
000000d0 06 00 00 00 00 00 08 00 0c 00 00 00 05 00 00 00 |................|
000000e0 05 00 8a 00 01 01 00 00 00 00 00 00 00 00 00 00 |................|

Luckily, they make it quite easy to identify these SCR files. ScriptWare was very popular and continued on with Windows and Macintosh versions. Later on, the format was changed along with the extension, which changed to SW3.

The SCR extension has been used often. On my desktop they default as a Paintbrush document. Apparently SCR is sometimes used as an extension for the ZSoft Paintbrush (PCX) format. It is also used on older postscript fonts on the Macintosh as a Type 1 screen font. Can also be a screensaver on Windows, but watch out, they can hide malicious code. You get the idea, SCR is a very common extension, identifying it up front can help avoid problems later!

Moral of the story is to never give up searching for old software and even though illegal copying of software should be avoided, I am grateful to those who help save abandoned software. Without them many titles would be lost.

I don’t have a good signature for these formats yet, but you can find a few samples on my GitHub page.

CD Architect

Receiving electronic media from an outside source can be an adventure. Often times you find yourself sorting the valuable files and separating them from the chaff. There can be hidden files, cache files, application files, drivers, and everything in between. Determining what formats are important can sometimes be difficult, especially if you don’t know the file format of some of the files.

I was recently working on a collection of files which had been produced through some audio software. When working with audio, a WAVE file is what is usually kept as they contain the actual audio data. With these files they came with a couple other formats. One of those formats was a bunch of SFK peak files. These files are meant to be temporary as they are generated from the WAVE file to make opening of audio data faster. They are important, but can easily be regenerated. One could argue they have historical value, but also they don’t contain anything that can be used by itself, so alone they don’t have much value.

The other format found with the WAVE files have a CDP extension. These came up as unknown when using DROID. It is not a common extension so finding the name of the software which created the files wasn’t too hard. Let’s take a look at one of them.

hexdump -C tutor1.cdp | head
00000000 52 49 46 46 79 03 00 00 53 46 50 4a 66 6d 74 20 |RIFFy...SFPJfmt |
00000010 18 00 00 00 00 00 01 00 02 00 00 00 10 00 00 00 |................|
00000020 44 ac 00 00 03 00 00 00 01 00 00 00 4c 49 53 54 |D...........LIST|
00000030 88 00 00 00 66 6c 73 74 66 69 6c 65 23 00 00 00 |....flstfile#...|
00000040 44 3a 5c 53 6f 75 6e 64 73 5c 4e 65 77 20 54 75 |D:\Sounds\New Tu|
00000050 74 6f 72 20 66 69 6c 65 73 5c 53 6f 6e 67 33 2e |tor files\Song3.|
00000060 77 61 76 00 66 69 6c 65 23 00 00 00 44 3a 5c 53 |wav.file#...D:\S|
00000070 6f 75 6e 64 73 5c 4e 65 77 20 54 75 74 6f 72 20 |ounds\New Tutor |
00000080 66 69 6c 65 73 5c 53 6f 6e 67 32 2e 77 61 76 00 |files\Song2.wav.|
00000090 66 69 6c 65 23 00 00 00 44 3a 5c 53 6f 75 6e 64 |file#...D:\Sound|

Huh, this is a RIFF file. RIFF is most commonly used as the container used for WAVE and AVI files. You can read more about the RIFF format on a previous post. The RIFF container format can be used for all sorts of things. Looking at the internals we can see a few unique list chunk’s.

Lots of references to other files, specifically WAVE files. But not a lot of actual data. That is because this format turns out to be just a project format for some software called “CD Architect“. Sonic Foundry was an audio software developer for a few years before they sold their catalog to Sony in 2003. In looking at the manual for CD Architect version 5.2, it explains the CDP Project format.

CD Architect software handles the organization of your CD using a small project file (CDP) that saves information about source file locations, edits, cuts, and insertion points. This project file is not a multimedia file, but is instead used to create the CD when editing is finished.

Looking at another CDP file from the collection, I noticed something different.

hexdump -C CDArch50a-s01.cdp | head
00000000 72 69 66 66 2e 91 cf 11 a5 d6 28 db 04 c1 00 00 |riff......(.....|
00000010 20 0a 00 00 00 00 00 00 84 38 15 b3 da 08 85 44 | ........8.....D|
00000020 b2 2a 5b 70 a1 32 15 ff 5a 2d 8f b2 0f 23 d2 11 |.*[p.2..Z-...#..|
00000030 86 af 00 c0 4f 8e db 8a 00 02 00 00 00 00 00 00 |....O...........|
00000040 78 00 00 00 00 00 04 00 11 00 00 00 44 ac 00 00 |x...........D...|
00000050 00 00 00 00 00 c0 52 40 00 00 00 00 00 00 5e 40 |......R@......^@|
00000060 00 00 00 00 00 00 00 00 04 00 04 00 40 00 00 00 |............@...|
00000070 00 00 00 00 00 00 00 00 00 00 00 00 7c 00 00 00 |............|...|
00000080 50 00 00 00 a0 00 00 00 00 00 00 00 00 00 00 00 |P...............|
00000090 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|

That’s odd, the RIFF format is always uppercase ASCII, this is lowercase. Also the important RIFF form, which was “SFPJ” in the other sample, is missing. This is not a valid RIFF format.

But further down in the file I can see the same list chunks. Did they take RIFF format and make a proprietary version of their own? I think they may have. It seems the first example was from CD Architect version 4 and these other files are from CD Architect version 5. That complicates things. Sony stopped developing CD Architect after version 5.2d and maintained it for a few years before selling many of their titles to MAGIX Software. As far as I know there was never any new versions released. The software was very popular, as it had some really nice audio mastering features and was easy to use. Many were upset when the software was abandoned.

Creating a signature for both version 4 and version 5 CDP files will be pretty straightforward. I feel knowing what you have in a collection you are processing is the first step in making informed decisions. Wether or not you keep the project files are up for debate. Some may only want the final audio created from a CD Architect project, while others may want to see the way the audio was put together and mixed. Either way, the more you know…..

One more thing. CD Architect would default to saving a CDP project file, but could also save a “CD Image file”. This process actually would save the project to a full WAVE file with some extras baked in.

An image file is essentially a wave file with volume, crossfades, effects, mixes, and track information embedded. Burning an image file will reduce the risk of buffer underruns (especially if you have a complex project or are using a slow computer) since no audio processing is required. 

Interesting, normally when working with track information in a single WAVE file you would need a companion CUE Sheet in order to reference the track layout of the Audio CD. So I am curious how they do all of this. Lets take a look at a “CD Image”.

mediainfo CDArch52d-s02.wav
General
Complete name : CDArch52d-s02.wav
Format : Wave
Format settings : PcmWaveformat
File size : 5.05 MiB
Duration : 30 s 0 ms
Overall bit rate mode : Constant
Overall bit rate : 1 411 kb/s
Conformance errors : 2
RIFF : Yes
General compliance : File size 5292434 is less than expected size 5292823 (offset 0x8)
WAVE : Yes
General compliance : Element size 5292811 is more than maximal permitted size 5292422 (offset 0xC)

Audio
Format : PCM
Format settings : Little / Signed
Codec ID : 1
Duration : 30 s 0 ms
Bit rate mode : Constant
Bit rate : 1 411.2 kb/s
Channel(s) : 2 channels
Sampling rate : 44.1 kHz
Bit depth : 16 bits
Stream size : 5.05 MiB (100%)

Already seeing some issues with the format, but all the important bits are there. JHOVE doesn’t like them much either.

JhoveView (Rel. 1.32.0, 2024-09-12)
Date: 2024-12-11 16:01:08 MST
RepresentationInformation: CDArch52d-s02.wav
ReportingModule: WAVE-hul, Rel. 1.8.3 (2024-03-05)
LastModified: 2024-12-11 15:58:02 MST
Size: 5292434
Format: WAVE
Status: Not well-formed
SignatureMatches:
WAVE-hul
InfoMessage: Ignored unrecognized list type: "pqls"
ID: WAVE-HUL-15
Offset: 5292044
ErrorMessage: Unexpected end of file: Bytes missing = 389
ID: WAVE-HUL-3
Offset: 5292434
MIMEtype: audio/vnd.wave; codec=1
Profile: PCMWAVEFORMAT

JHOVE is giving me two issues. The major error is the file appears truncated according to both MediaInfo and JHOVE. The InfoMessage which is less of an issue but more of a heads up that the WAVE file has an extra LIST type. “PQLS”, which was also in the CPD RIFF file we looked at earlier. So it seems by making a “CD Image” of a project embeds the project chunk data into the WAVE container. Identification is not an issue as these WAVE’s follow the standard pattern and therefore identify correctly, but one might want to be aware through further characterization these WAVE’s have some not so obvious extra data.

My attempts to find any samples from version 3 of CD Architect have failed. Until then, my proposal is to add version 4 & 5 to PRONOM with the signature on my Github page. There you will find a few samples as well.

NCH Software

Recently I came across a piece of software which used dozens of extensions for a single file format.

This T-Shirt Factory Deluxe files are a bit of an extreme, probably a prank against all of us doing file format identification. If you know who made this decision, I would like to have a chat.

This is not first time I have come across a format which seems to have been used for more than one software title. Awhile back I tried to find more information on a file format used with many tools created by MetaCreations. It was called “Composite File Management System“, and was used with Kai’s Power tools, Bryce3D, Ray Dream, Poser, and others. I did a previous post about the format.

I came across another recently with a similar issue. They are also many different software titles with the same native format.

NCH Software is an Australian software company who produce a massive number of software titles covering many different needs. From Audio Editing to Business charts and from Accounting tools to a 3D model converter, they have it all. Their audio editing software WavePad is quite popular. My initial entry into their software world was for the specialized Dictation/Scribe software which produced a slightly proprietary audio format with the extension DCT. This format does not use the format many of the other titles use.

With the number of different titles, it probably makes sense they use the same file structure to make processing/programming more efficient. They appear to be mostly proprietary binary files.

hexdump -C Wavepad/Untitled2.wpp | head
00000000 6c 73 64 66 01 00 1a 00 00 00 07 00 00 00 00 00 |lsdf............|
00000010 ca 84 20 00 00 00 00 00 e9 03 00 00 a5 84 20 00 |.. ........... .|
00000020 00 00 00 00 d0 07 00 00 99 84 20 00 00 00 00 00 |.......... .....|
00000030 d1 07 06 00 24 00 00 00 00 00 00 00 2f 55 73 65 |....$......./Use|
00000040 72 73 2f 74 79 6c 65 72 2f 44 65 73 6b 74 6f 70 |rs/tyler/Desktop|
00000050 2f 55 6e 74 69 74 6c 65 64 5f 30 2e 77 61 76 00 |/Untitled_0.wav.|
00000060 dc 07 02 00 04 00 00 00 00 00 00 00 00 00 00 00 |................|
00000070 d2 07 03 00 08 00 00 00 00 00 00 00 00 00 00 00 |................|
00000080 00 00 00 00 d3 07 03 00 08 00 00 00 00 00 00 00 |................|
00000090 00 00 00 00 00 00 00 00 d4 07 03 00 08 00 00 00 |................|

hexdump -C Crescendo/examples/Grooving.cdo | head
00000000 6c 73 64 66 01 00 05 00 00 00 03 00 00 00 00 00 |lsdf............|
00000010 8a b5 00 00 00 00 00 00 00 10 00 00 65 05 00 00 |............e...|
00000020 00 00 00 00 01 11 04 00 04 00 00 00 00 00 00 00 |................|
00000030 00 00 00 41 02 11 02 00 04 00 00 00 00 00 00 00 |...A............|
00000040 05 00 00 00 03 11 04 00 04 00 00 00 00 00 00 00 |................|
00000050 00 00 52 43 04 11 04 00 04 00 00 00 00 00 00 00 |..RC............|
00000060 00 80 94 43 05 11 04 00 04 00 00 00 00 00 00 00 |...C............|
00000070 00 00 a0 41 06 11 02 00 04 00 00 00 00 00 00 00 |...A............|
00000080 01 00 00 00 07 11 04 00 04 00 00 00 00 00 00 00 |................|
00000090 00 00 00 00 08 11 04 00 04 00 00 00 00 00 00 00 |................|

hexdump -C Spin3D/bunny.3dp | head
00000000 6c 73 64 66 01 00 20 00 00 00 01 00 00 00 00 00 |lsdf.. .........|
00000010 ec bc 65 00 00 00 00 00 00 10 00 00 e0 bc 65 00 |..e...........e.|
00000020 00 00 00 00 00 12 00 00 38 bc 65 00 00 00 00 00 |........8.e.....|
00000030 01 12 07 00 8c 26 26 00 00 00 00 00 cc d1 27 3f |.....&&.......'?|
00000040 1c b5 80 3f 3c f4 bd 3d d9 79 27 3f de af 80 3f |...?<..=.y'?...?|
00000050 bf 81 a9 3d ad fa 28 3f 10 e7 7d 3f 05 a8 a9 3d |...=..(?..}?...=|
00000060 ec a4 1a 3f 56 29 49 3f ab d0 c0 3d 3e 3c 1f 3f |...?V)I?...=><.?|
00000070 5f ed 4c 3f 5a 48 c0 3d 04 59 1b 3f 48 53 49 3f |_.L?ZH.=.Y.?HSI?|
00000080 42 e9 ab 3d 74 5d 1c 3f 05 6c 3b 3f f7 03 5e 3d |B..=t].?.l;?..^=|
00000090 46 d2 1a 3f f6 d4 3e 3f ef ac 5d 3d 94 db 1a 3f |F..?..>?..]=...?|

hexdump -C Voxal/Geek.voxal | head
00000000 6c 73 64 66 01 00 0c 00 00 00 01 00 00 00 00 00 |lsdf............|
00000010 ea 01 00 00 00 00 00 00 ec 03 01 00 01 00 00 00 |................|
00000020 00 00 00 00 01 e8 03 00 00 a9 01 00 00 00 00 00 |................|
00000030 00 00 20 02 00 04 00 00 00 00 00 00 00 13 00 00 |.. .............|
00000040 00 00 10 00 00 39 00 00 00 00 00 00 00 00 10 00 |.....9..........|
00000050 00 0d 00 00 00 00 00 00 00 00 20 01 00 01 00 00 |.......... .....|
00000060 00 00 00 00 00 00 01 20 04 00 04 00 00 00 00 00 |....... ........|
00000070 00 00 c3 f5 40 41 02 20 02 00 04 00 00 00 00 00 |....@A. ........|
00000080 00 00 22 00 00 00 00 20 02 00 04 00 00 00 00 00 |..".... ........|
00000090 00 00 0e 00 00 00 00 10 00 00 29 00 00 00 00 00 |..........).....|

hexdump -C PhotoPad/test.ppp | head
00000000 6c 73 64 66 01 00 02 00 00 00 00 00 00 00 00 00 |lsdf............|
00000010 ee 3c 00 00 00 00 00 00 c9 00 01 00 01 00 00 00 |.<..............|
00000020 00 00 00 00 00 04 00 00 00 d5 3c 00 00 00 00 00 |..........<.....|
00000030 00 02 00 00 00 c9 3c 00 00 00 00 00 00 03 00 06 |......<.........|
00000040 00 0f 00 00 00 00 00 00 00 6f 72 69 67 69 6e 61 |.........origina|
00000050 6c 5f 69 6d 61 67 65 00 01 00 00 00 85 3c 00 00 |l_image......<..|
00000060 00 00 00 00 07 00 07 00 79 3c 00 00 00 00 00 00 |........y<......|
00000070 89 50 4e 47 0d 0a 1a 0a 00 00 00 0d 49 48 44 52 |.PNG........IHDR|
00000080 00 00 04 00 00 00 03 00 08 06 00 00 00 ba ba 15 |................|
00000090 0d 00 00 00 01 73 52 47 42 00 ae ce 1c e9 00 00 |.....sRGB.......|

Above are just a few of the titles which use the same structure. The LSDF string is the first 4 bytes and always the last 4 bytes. The next two bytes, 0100, seem consistent for all samples, but the two bytes after that seem to be unique to the software. So far I have found the following titles use the format.

Software TitleNameExtensionPattern
WavePadWavePad Audio Editor Project FileWPP6C736466 01001A00
CrescendoCrescendo Score FileCDO6C736466 01000500
Spin3DNCH Software model format3DP6C736466 01002000
VoxalVoxal Voices FileVOXAL6C736466 01000C00
PhotoPadPhotoPad Project FilePPP6C736466 01000200
MixPadMixPad ProjectMPDP6C736466 01000400
DisketchDisketch ProjectDEPROJ6C736466 01000700
ClickChartsClickCharts DiagramCCD6C736466 01000A00
DreamPlanDreamPlan FileDDP6C736466 01001300
DrawPadDrawPad FileDRP6C736466 01001500

Without downloading and installing their vast library of software it’s hard to know all the different titles which use the format. The rest of the file for each sample seems to be proprietary in a binary format, except a few with a PNG image mixed in.

The simplest sample I could find was a preset file for the Zulu DJ Software which uses the ECF extension. The ECF extension is common with a few of the titles, like effect chains for WavePad and MixPad.

hexdump -C Zulu/Untitled.ecf
00000000 6c 73 64 66 01 00 0c 00 00 00 01 00 00 00 00 00 |lsdf............|
00000010 6b 00 00 00 00 00 00 00 00 10 00 00 1a 00 00 00 |k...............|
00000020 00 00 00 00 00 00 01 00 01 00 00 00 00 00 00 00 |................|
00000030 00 01 00 01 00 01 00 00 00 00 00 00 00 00 00 30 |...............0|
00000040 02 00 04 00 00 00 00 00 00 00 01 00 00 00 00 20 |............... |
00000050 00 00 29 00 00 00 00 00 00 00 00 10 00 00 0d 00 |..).............|
00000060 00 00 00 00 00 00 00 20 01 00 01 00 00 00 00 00 |....... ........|
00000070 00 00 00 00 20 02 00 04 00 00 00 00 00 00 00 00 |.... ...........|
00000080 00 00 01 6c 73 64 66 |...lsdf|

This header is identical to the header for the VOXAL format, so not sure if the second set of 4 bytes is directly connected to the software title. Or if there purpose is something else.

The question that needs to be answered is how we might represent these formats in PRONOM if needed. We could create a unique signature for each title based on the magic header and footer and the second set of 4 bytes which may indicate the software. Or create a single generic signature to identify the basic format using the magic header and footer and adding all the extensions to the list, which would be lengthy. This would be the easiest and catch all formats related to NCH Software using this file format, but then additional characterization would need to happen to identify the specific software title needed to render the file.

The NCH Software company seems to churn out new software and versions quite frequently and a search for reviews of their software turns up some questionable results. Many might enjoy their software as they are easy to use and are free for home use. I had lots of trouble with a few of them as they wanted to mount network locations and disk images I had used recently, which seems sketchy. I would love to know if anyone uses their software and has any need to preserve these formats. I currently don’t, but found the common use of a file format intriguing. I also found no reference to the magic bytes they use, except for a few TrID entries. Marco always is a step ahead!

KODAK TIFF

Years ago I bought my first digital camera. It was an Epson PhotoPC 3100z and I bought it because it could capture a digital image directly to a TIFF file. I don’t think most people would care about such a feature, but I thought it was awesome. Granted it filled up the small 32MB compact flash card pretty quick, I had to upgrade to a 512MB card, that set me back.

TIFF images are pretty universal, they have a well known structure and have been around for a very long time. I have written about TIFF’s before, so I wont go into too much about the format. The format is well respected in the preservation community, although one of the best websites, Aware Systems, documenting the various TIFF tags has gone dark in the this year, here is an archived version.

Many of the digital camera’s from the beginning to now use the TIFF format to store RAW sensor data. Most use their own extension and follow well established methods for storing the sensor data in an IFD with lots of common and custom tags. The DNG format is an open RAW format which uses the TIFF format to store sensor data, although many use SubIFD’s and can be incompatible with some software.

The first Digital Camera was invented by a Kodak employee, Steve Sasson in 1975, well, he was the first to use a CCD sensor in a self contained unit. This led Kodak to push the technology forward and in 1991 released the Kodak DCS digital system which used Nikon cameras equipped with a digital sensor. These early digital cameras were quite expensive, they used early CF cards and SCSI connections. Kodak released a few models of the DCS series, first on Nikon bodies, then on some Canon bodies. These early cameras used the TIFF format to store the RAW sensor data. For some reason, they decided to use a proprietary method and compression while still using the TIF extension.

Kodak was responsible for many new image file formats. Not sure why they decided to use a common format like TIFF and still use the TIF extension, but make it proprietary. The RAW file created by the DCS series of camera’s had to be opened with special plugins or software, if you tried to open the TIFF’s with anything else, you would only see the small thumbnail image located at IFD0 instead of the full size image hidden in a SubIFD1.

Finding samples of this format is particularly hard as they have the common TIF extension. The camera’s are also pretty rare and finding one is difficult, especially in working condition. I was only aware of a couple samples on the rawsamples.ch site, but that wasn’t enough to understand the format as the two files had a different structure.

hexdump -C RAW_KODAK_DCS460D_FILEVERSION_3.TIF | head
00000000 49 49 2a 00 00 03 00 00 7c 01 00 00 00 00 00 00 |II*.....|.......|
00000010 4b 4f 44 41 4b 20 20 20 20 20 20 20 20 20 20 20 |KODAK |
00000020 44 43 53 34 36 30 44 20 20 20 20 20 20 20 20 20 |DCS460D |
00000030 46 49 4c 45 20 56 45 52 53 49 4f 4e 20 33 20 20 |FILE VERSION 3 |
00000040 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000050 30 35 31 39 39 38 20 20 20 20 20 20 20 20 20 20 |051998 |
00000060 34 36 30 2d 32 39 35 30 00 00 00 00 00 00 00 00 |460-2950........|
00000070 31 39 39 30 3a 30 31 3a 30 31 20 31 32 3a 30 32 |1990:01:01 12:02|
00000080 3a 30 37 00 5b 20 32 5d 0d 49 53 4f 3a 20 20 20 |:07.[ 2].ISO: |
00000090 20 20 20 20 20 38 30 20 20 0d 41 70 65 72 74 75 | 80 .Apertu|

hexdump -C RAW_KODAK_DCS560C.TIF | head
00000000 4d 4d 00 2a 00 00 11 76 00 04 f7 50 00 00 00 00 |MM.*...v...P....|
00000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000040 54 68 69 73 20 69 6d 61 67 65 20 66 69 6c 65 20 |This image file |
00000050 77 61 73 20 63 72 65 61 74 65 64 20 62 79 20 61 |was created by a|
00000060 20 4b 6f 64 61 6b 20 44 43 53 35 36 30 43 20 64 | Kodak DCS560C d|
00000070 69 67 69 74 61 6c 20 63 61 6d 65 72 61 2e 20 28 |igital camera. (|
00000080 6e 75 6c 6c 29 20 20 00 00 00 00 00 00 00 00 00 |null) .........|
00000090 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|

There is/was a website called https://raw.pixls.us/, but it has been offline since last June, the regular site still works, but the raw sub-domain is unreachable. Luckily the wayback machine had archived a few samples.

I also found a reference on an older website referring to a sample set maintained by Kodak for developers using the SDK, but also no longer available. You can find the old website also on the wayback machine.

With a few more samples to refer to, it makes it easier to understand the headers and put together a signature. There was an SDK, but seems to be difficult to locate today, but the manual does give us a little more info on the different models and their format.

So from the SDK statement, the samples I have in TIF, and others I have in the more recent DCR format, I can conclude the custom TIF format was used with the DCS 3xx, 4xx, 5xx, 6xx models and from 7xx on the DCR format was used as the camera RAW. Looking closer at the samples in TIF, we can see all the 4xx models used the “FILE VERSION 3” version of the format, while the others have the full statement in the header. Not 100% clear on which format came first, but the 4xx models are some of the earliest models.

At the time, there was only Kodak software that could properly “develop” the RAW file taken by these camera models. Today that has changed and the format has been added to many open source libraries such as libraw and rawspeed. Many other commercial products also claim to support the DCS models including Adobe Camera Raw, which seems to be able to open these TIF’s.

Distinguishing these RAW TIF’s is important to properly manage them over the long term. These images currently identify in the PRONOM repository as regular TIF’s, fmt/353, so we would need to create a signature which identifies the standard TIFF header, but also uses bytes unique to this format. In the few samples I have the “VERSION 3” images all start with the litte-endian header, “49492A00”, while the other samples start with the big-endian header, “4D4D002A”. That makes it a little easier for each signature.

For for the “VERSION 3” format we could use a pattern such as 49492A00{12}4B4F44414B{11}(444353|454F53444353). This looks for the TIFF header, skips 12 bytes, looks for the word “KODAK”, skips 11 more bytes to then look for either “DCS” or “EOSDCS” right before the camera model number.

For the other format we also look for the TIFF header, but then find the whole string used in all the samples. 4D4D002A{60}5468697320696D6167652066696C652077617320637265617465642062792061204B6F64616B20444353{5}6469676974616C2063616D6572612E

This looks for the big-endian header, then the string, “This image file was created by a Kodak DCS”, skipping the model number, then the end of the string, “digital camera.” This should catch all the different models of this format.

You can find my proposed signature on my GitHub page, since none of the samples belong to me, you can find them above in some of the links.

RealVideo

For #WDPD24 and PRONOM Hackathon week this year, I want to find some older formats listed which did not have a signature. There is a list to choose from, but I wanted to find something I hadn’t worked on before. I came across two entries for Real Video:

PUIDNameExtension
fmt/204RealVideo Cliprv
x-fmt/277Real Videorv

I was familiar with Real Media and Real Audio, but had yet to come across any RealVideo with the RV extension. I thought it would be easy to find some references and samples, but that was not the case. I assume PRONOM originally added these based on MIME types available.

Real or RealNetworks is/was an Internet media company who jumped on the rapidly growing World Wide Web in 1995 to become a leader in Internet Media Delivery. Their initial offerings mainly focused on audio streaming and they accomplished all of this by providing free players and web browser extensions to make it easy to serve up a website with streaming media everyone could enjoy. Later adding video streaming optimized for the slower dialup and connections of the day. They used codecs based on common technology like H.263 and H.264, but used then to make their own proprietary codecs identified through FourCC codes, RV10-RV60.

So thought it would be easy to find a reference to the RV extension, I quickly discovered it wasn’t. Looking at the Wikipedia page on RealVideo, I found no reference to the RV extension. RV is an abbreviation for RealVideo, right? Well, I ended up finding a reference in the RealAudio page under file extensions. Ok, First clue to the existence of the RV extension. The page references RV as being used for video only files and was used by the flagship encoder (RealProducer).

RealProducer was the tool for creating the streaming audio and video formats that could then be used for your website or streaming platform. The RealProducer software came in a Basic version, which was free, and the Plus or Pro version, which was not free and provided more options. The first version of RealProducer to make video files was version 4. I was able to find a copy of the encoder and installed it under a Windows 95 emulator. To my surprise it only saved to the RealMedia RM file format. This format is well known and identified with PRONOM as x-fmt/190 also documented at the LoC.

This was the same with RealProducer 5, 7, 8, 9, and 10 that I was able to try. All made no mention of the RV extension. I was starting to feel this format didn’t exist or that some decided to use the RV extension on their own. Searches on Google yielded a couple results, mostly from users who had found a few files on their older discs and wanted to migrate them to something newer. I was able to find one example, one user shared, but it had the same header as the RealMedia format. The clue was in the file.

hexdump -C ambush_abb.rv
00000000  2e 52 4d 46 00 00 00 12  00 01 00 00 00 00 00 00  |.RMF............|
00000010  00 07 50 52 4f 50 00 00  00 32 00 00 00 03 6e e8  |..PROP...2....n.|
00000020  00 03 6e e8 00 00 03 e0  00 00 01 b3 00 00 6a 6f  |..n...........jo|
00000030  00 06 80 fa 00 00 08 b5  00 ba 41 73 00 00 03 55  |..........As...U|
00000040  00 03 00 09 43 4f 4e 54  00 00 00 40 00 00 00 00  |....CONT...@....|
00000050  00 00 00 08 28 43 29 20  32 30 30 35 00 26 00 00  |....(C) 2005.&..|
00000060  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000270  00 09 61 75 64 69 6f 4d  6f 64 65 00 00 00 02 00  |..audioMode.....|
00000280  06 76 6f 69 63 65 00 00  00 00 2d 00 00 0d 43 72  |.voice....-...Cr|
00000290  65 61 74 69 6f 6e 20 44  61 74 65 00 00 00 02 00  |eation Date.....|
000002a0  13 39 2f 32 30 2f 32 30  30 36 20 31 34 3a 30 37  |.9/20/2006 14:07|
000002b0  3a 30 38 00 00 00 00 53  00 00 0c 47 65 6e 65 72  |:08....S...Gener|
000002c0  61 74 65 64 20 42 79 00  00 00 02 00 3a 52 65 61  |ated By.....:Rea|
000002d0  6c 50 72 6f 64 75 63 65  72 28 52 29 20 42 61 73  |lProducer(R) Bas|
000002e0  69 63 20 31 31 2e 30 20  66 6f 72 20 57 69 6e 64  |ic 11.0 for Wind|
000002f0  6f 77 73 2c 20 42 75 69  6c 64 20 31 31 2e 30 2e  |ows, Build 11.0.|
00000300  30 2e 32 30 30 39 00 00  00 00 31 00 00 11 4d 6f  |0.2009....1...Mo|
00000310  64 69 66 69 63 61 74 69  6f 6e 20 44 61 74 65 00  |dification Date.|
00000320  00 00 02 00 13 39 2f 32  30 2f 32 30 30 36 20 31  |.....9/20/2006 1|
00000330  34 3a 30 37 3a 30 38 00  00 00 00 1d 00 00 09 76  |4:07:08........v|
00000340  69 64 65 6f 4d 6f 64 65  00 00 00 02 00 07 6e 6f  |ideoMode......no|
00000350  72 6d 61 6c 00 44 41 54  41 00 ba 3e 1e 00 00 00  |rmal.DATA..>....|

RealProducer Basic 11 for Windows. The Wikipedia article did hint at this by saying “the latest version of RealProducer reverted to using .ra for audio only files and began using .rv for video files with or without audio.” Why would they use the RM extension for so long, then revert to a different extension with a later version? I found more in the User Manual for version 11.

• .rv – RealVideo
RealProducer uses the .rv file extension if the input is video-only or video-with-audio. You can also select the .rm file extension for video content.
Tip: Using the .rv file extension helps search engines identify the file as a RealVideo clip.

• .rm – RealAudio or RealVideo
RealProducer chooses the .rm file extension if it cannot determine the content of the input clip. You can use .rm file extension for any RealAudio or RealVideo clip, except for variable bit-rate clips.

Ok, so a few things to learn from this. One is the RV extension was used as the default for version 11 as they wanted search engines to identify them as a RealVideo clip. Second thing we learned is there is no difference between the two placeholders in PRONOM, one being a RealVideo file and the other being a RealVideo Clip. We don’t need both.

Now, is there any difference between an RV and RM file?

hexdump -C Producer11-01.rv | head
00000000 2e 52 4d 46 00 00 00 12 00 01 00 00 00 00 00 00 |.RMF............|
00000010 00 07 50 52 4f 50 00 00 00 32 00 00 00 03 6e e8 |..PROP...2....n.|
00000020 00 03 6e e8 00 00 03 e0 00 00 01 c7 00 00 01 66 |..n............f|
00000030 00 00 1b 57 00 00 07 41 00 02 91 0a 00 00 03 5e |...W...A.......^|
00000040 00 03 00 09 43 4f 4e 54 00 00 00 40 00 00 00 00 |....CONT...@....|
00000050 00 00 00 08 28 43 29 20 32 30 30 35 00 26 00 00 |....(C) 2005.&..|
00000060 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000080 00 00 00 00 4d 44 50 52 00 00 00 70 00 00 00 00 |....MDPR...p....|
00000090 00 02 c2 a4 00 02 c2 a4 00 00 03 e0 00 00 01 9f |................|

hexdump -C Producer11-01.rm | head
00000000 2e 52 4d 46 00 00 00 12 00 01 00 00 00 00 00 00 |.RMF............|
00000010 00 07 50 52 4f 50 00 00 00 32 00 00 00 03 6e e8 |..PROP...2....n.|
00000020 00 03 6e e8 00 00 03 e0 00 00 01 a4 00 00 01 64 |..n............d|
00000030 00 00 1b 57 00 00 05 a4 00 02 5c 35 00 00 03 5e |...W......\5...^|
00000040 00 03 00 09 43 4f 4e 54 00 00 00 40 00 00 00 00 |....CONT...@....|
00000050 00 00 00 08 28 43 29 20 32 30 30 35 00 26 00 00 |....(C) 2005.&..|
00000060 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000080 00 00 00 00 4d 44 50 52 00 00 00 70 00 00 00 00 |....MDPR...p....|
00000090 00 02 c2 a4 00 02 c2 a4 00 00 03 e0 00 00 01 a4 |................|

They both look very similar to me. Aside from a few bytes, they are practically identical. Lets see what MediaInfo has to say.

mediainfo Producer11-01.rv
General
Complete name : Producer11-01.rv
Format : RealMedia
File size : 164 KiB
Duration : 6 s 999 ms
Overall bit rate : 225 kb/s
Frame rate : 24.000 FPS
Copyright : (C) 2005
FileExtension_Invalid : rm rmvb ra

Video
ID : 0
Format : RealVideo 4
Codec ID : RV40
Codec ID/Info : Based on AVC (H.264), Real Player 9
Duration : 6 s 999 ms
Bit rate : 181 kb/s
Width : 640 pixels
Height : 424 pixels
Display aspect ratio : 3:2
Frame rate : 24.000 FPS
Bits/(Pixel*Frame) : 0.028
Stream size : 155 KiB (94%)

Audio
ID : 1
Format : Cooker
Codec ID : cook
Codec ID/Info : Based on G.722.1, Real Player 6
Duration : 7 s 429 ms
Bit rate : 44.1 kb/s
Channel(s) : 2 channels
Sampling rate : 44.1 kHz
Bit depth : 16 bits
Stream size : 40.0 KiB (24%)

mediainfo Producer11-01.rm
General
Complete name : Producer11-01.rm
Format : RealMedia
File size : 151 KiB
Duration : 6 s 999 ms
Overall bit rate : 225 kb/s
Frame rate : 24.000 FPS
Copyright : (C) 2005

Video
ID : 0
Format : RealVideo 4
Codec ID : RV40
Codec ID/Info : Based on AVC (H.264), Real Player 9
Duration : 6 s 999 ms
Bit rate : 181 kb/s
Width : 640 pixels
Height : 424 pixels
Display aspect ratio : 3:2
Frame rate : 24.000 FPS
Bits/(Pixel*Frame) : 0.028
Stream size : 155 KiB

Audio
ID : 1
Format : Cooker
Codec ID : cook
Codec ID/Info : Based on G.722.1, Real Player 6
Bit rate : 44.1 kb/s
Channel(s) : 2 channels
Sampling rate : 44.1 kHz
Bit depth : 16 bits

Other than the RV file having a invalid file extension, they both identify as a RealMedia file and have identical properties. So it seems the RV file is really no different than the RM file. I think the best course of action for PRONOM is to deprecate these two RV PUID’s and just ad RV as an acceptable extension for the RealMedia format.

To add to the evidence, here is the output from ffprobe:

Input #0, rm, from 'Producer11-01.rm':
Metadata:
copyright : (C) 2005
comment :
ASMRuleBook : #($Bandwidth >= 0),Stream1Bandwidth = 44100, Stream0Bandwidth = 180900;
Audiences : 256k DSL or Cable;
audioMode : music
Creation Date : 11/12/2024 20:28:55
Generated By : RealProducer(R) Plus 11.1 for Windows, Build 11.1.0.2676
Modification Date: 11/12/2024 20:28:55
videoMode : normal
Duration: 00:00:07.00, start: 0.000000, bitrate: 176 kb/s
Stream #0:0: Video: rv40 (RV40 / 0x30345652), yuv420p, 640x424, 180 kb/s, 24 fps, 24 tbr, 1k tbn
Stream #0:1: Audio: cook (cook / 0x6B6F6F63), 44100 Hz, stereo, fltp, 44 kb/s

Input #0, rm, from 'Producer11-01.rv':
Metadata:
copyright : (C) 2005
comment :
ASMRuleBook : #($Bandwidth >= 0),Stream1Bandwidth = 44100, Stream0Bandwidth = 180900;
Audiences : 256k DSL or Cable;
audioMode : music
Creation Date : 11/12/2024 20:28:16
Generated By : RealProducer(R) Plus 11.1 for Windows, Build 11.1.0.2676
Modification Date: 11/12/2024 20:28:16
videoMode : normal
Duration: 00:00:07.43, start: 0.000000, bitrate: 181 kb/s
Stream #0:0: Video: rv40 (RV40 / 0x30345652), yuv420p, 640x424, 180 kb/s, 24 fps, 24 tbr, 1k tbn
Stream #0:1: Audio: cook (cook / 0x6B6F6F63), 44100 Hz, stereo, fltp, 44 kb/s

But wait, there are a couple formats we could add which are related to RealProducer. RealProducer used a few other formats to manage projects and other metadata for streaming. They include:

  • .RP RealPix Image
  • .RT RealText
  • .RPAD RealProducer Audience File
  • .RPJF RealProducer Job File
  • .RPSD RealProducer Server Destination
  • .RMHD RealMediaHD file
  • .RAM Playlist
  • .RPM Embedded RAM
File TypeExtensionMIME Type
Ram.ramaudio/x-pn-realaudio
Embedded Ram.rpmaudio/x-pn-realaudio-plugin
SMIL.smil and .smiapplication/smil
RealAudio.raaudio/x-pn-realaudio
RealVideo.rmapplication/x-pn-realmedia
Flash.swfapplication/x-shockwave-flash
RealPix.rpimage/vnd.rn-realpix
RealText.rttext/vnd.rn-realtext
https://web.archive.org/web/20120513203726/http://service.real.com/help/library/guides/production8/htmfiles/server.htm

Don’t get excited, the RealPix Image format really isn’t an image, it is simply an XML file with all the details of an image or group of images. Pretty boring. It was however a big thing in the day, even got a full guide written up for the process. “All information in the file occurs between an opening <imfl> tag and a closing </imfl> tag. This is the only tag that uses an end tag.” This format was the topic of discussion as malicious code could be in the RP file and executed just by having someone load your webpage. IMFL is obviously an acronym, but none of the documents I could find tells me what it stands for, so I did what everyone does now, I asked ChatGPT.

The RealPix format by RealNetworks, which was used for interactive multimedia content, indeed utilized IMFL as its tagged format. IMFL stands for “Interleaved Media File Language.” This markup was particularly designed to handle multimedia presentations, allowing the synchronization of images, audio, and video in a slideshow-style format. It used XML-like syntax where elements like <imfl>, <head>, and <fadein/> defined media objects, transitions, and their timing. Key components included attributes for positioning, color, and animation effects, making RealPix a flexible format for creating multimedia sequences compatible with RealPlayer.

For technical details, the RealPix format closely resembles SMIL (Synchronized Multimedia Integration Language) and supports strict tag closure and case sensitivity. This means all tags and attribute names must be lowercase, and attributes must be in double quotes, as seen in SMIL and RealSystem G2 markup, RealNetworks’ broader multimedia framework.

When I asked for a source, it could not give me one. So not sure if it is the correct answer, but it seems to fit. Here are some samples of RP, RT and SMIL files.

For RealText with the RT extension, we find a similar tagged text. This format is used to provide text presentations to go along with Images, Audio, or Video. The tagged text then describes when and how the text is displayed. This is all done in a player window, therefore the root tag of these RT documents starts and ends with <window>. I guess these could be considered a subtitle format for streaming media.

The SMIL files is interesting, it is known standard, but in many cases, does not have an XML declaration, therefore not identified by current PRONOM. They are used to link everything together. I might suggest a variant of the SMIL format to not have the XML declaration to identify these formats correctly.

<smil>
<body>
<par>
<textstream src=”rtsp://realserver.company.com/mary.rt”/>
<video src=”rtsp://realserver.company.com/mary.rm”/>
</par>
</body>
</smil>

The .RPAD RealProducer Audience File, .RPJF RealProducer Job File, .RPSD RealProducer Server Destination are all XML files for managing some of the configuration found in the RealProducer software.

cat 56k\ Dial-up.rpad
<?xml version="1.0" encoding="UTF-8"?>
<audience xmlns="http://ns.real.com/tools/audience.2.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://ns.real.com/tools/audience.2.0 http://ns.real.com/tools/audience.2.0.xsd">
<avgBitrate type="uint">34000</avgBitrate>
<maxBitrate type="uint">68000</maxBitrate>
<streams>

cat RealProducer11-01.rpjf
<?xml version="1.0" encoding="UTF-8"?>
<job xmlns="http://ns.real.com/tools/job.2.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://ns.real.com/tools/job.2.0 http://ns.real.com/tools/job.2.0.xsd">
<enableTwoPass type="bool">true</enableTwoPass>
<clipInfo>

cat Multicast\ Push\ Server.rpsd
<?xml version="1.0" encoding="UTF-8"?>
<destination xsi:type="pushServer" xmlns="http://ns.real.com/tools/server.2.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://ns.real.com/tools/server.2.0 http://ns.real.com/tools/server.2.0.xsd">
<pluginName type="string">rn-server-rbs</pluginName>

Those three formats should be easy enough, especially if we look for Namespace urls.

The RAM and RPM formats are simply text files with a URL. You can find some samples here and here.

An RM and RV file are the same format as the RMVB file but just with a variable bitrate. Later on a new format was used to improve the quality of video. This format has the extension RMHD, referring to RealMedia HD. Let’s take a look.

hexdump -C DSC_0009.rmhd | head
00000000 2e 52 4d 50 00 00 00 12 00 01 00 00 00 00 00 00 |.RMP............|
00000010 00 07 50 52 4f 50 00 00 00 36 00 02 00 04 f7 33 |..PROP...6.....3|
00000020 00 04 f7 33 00 00 11 bd 00 00 02 5d 00 00 01 d2 |...3.......]....|
00000030 00 00 1b 2e 00 00 00 00 00 00 00 00 00 04 65 68 |..............eh|
00000040 00 00 01 6f 00 02 00 03 43 4f 4e 54 00 00 00 12 |...o....CONT....|
00000050 00 00 00 00 00 00 00 00 00 00 4d 44 50 52 00 00 |..........MDPR..|
00000060 00 76 00 00 00 00 00 03 24 64 00 03 24 64 00 00 |.v......$d..$d..|
00000070 11 bd 00 00 04 2a 00 00 00 00 00 00 00 00 00 00 |.....*..........|
00000080 1b 2e 0c 56 69 64 65 6f 20 53 74 72 65 61 6d 14 |...Video Stream.|
00000090 76 69 64 65 6f 2f 78 2d 70 6e 2d 72 65 61 6c 76 |video/x-pn-realv|

The format looks very similar, but has the magic header of .RMP instead of .RMF. MediaInfo and FFProbe are unaware of the format. The software mentions a RV11 codec which is confusing as the codecs went from RV10-RV60.

Phew, that was a lot considering the two formats I tried to research came up the same as an existing format. There are probably others I have missed. I did see a reference to an RMX format which seems to be an encrypted RM file. The header is the same so it will identify as a RealMedia file, but with the wrong extension. Let me know if you come across any. I have some samples of the formats mentioned here, plus a proposal of new signatures on my Github repository.

PAR

Some file formats have a unique extension. Some formats use three character extensions which are well known, so its not common for them to be used with other software. Take the extension PDF for example, pretty sure no one else will use it as it is so well known. Other extensions often get reused by a few different software titles. There are plenty of titles which use the DOC extension.

Part of defining a file format I come across is also defining other formats which use the same extension or the same basic patterns within the format. I want the format I am researching to be identified correctly, but I also don’t want other formats to falsely identify as them either.

When using the DROID tool, if a file can’t be identified using a signature, the tool will then look to see if the extension matches any formats within the PRONOM registry, if it finds one, it will identify as that format with the identification method as “Extension”. This can be confusing and dangerous.

The topic of a format came up recently in reference to the extension PAR. Lets take a look at what we know about files with the extension PAR. Using the handy tool at digipres.org, we can see there are many formats using the PAR extension.

Apparently many people like to use the extension with their software. One might think their files with the PAR extension have to be in this list, and they would be wrong in that assumption. The PRONOM registry has no records of any format using the PAR extension. Hopefully we can add a few to help with proper identification instead of using the extension only.

A PArchive or Parity Volume Set is a group of file formats used in error correction and data integrity. Only the first version used the PAR extension, it is now obsolete with version 2 being the last stable version.

hexdump -C archive.par | head
00000000 50 41 52 00 00 00 00 00 00 00 01 00 00 09 00 02 |PAR.............|
00000010 8f d0 ce 2e 21 db 3b e5 41 d5 18 be d3 0e 52 f0 |....!.;.A.....R.|
00000020 de b6 b3 9f 53 09 ff ba 16 6b ca d2 48 a6 ca 45 |....S....k..H..E|
00000030 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 |................|
00000040 60 00 00 00 00 00 00 00 4e 00 00 00 00 00 00 00 |`.......N.......|
00000050 ae 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000060 4e 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 |N...............|
00000070 45 16 01 00 00 00 00 00 76 da 44 2b 43 5f b5 bd |E.......v.D+C_..|
00000080 08 7b d2 b0 2e 16 7d 86 46 75 7b 79 f0 36 75 3b |.{....}.Fu{y.6u;|
00000090 a1 14 22 f3 0c 77 85 3c 70 00 61 00 72 00 2d 00 |.."..w.<p.a.r.-.|

hexdump -C Testing.docx.par2 | head
00000000 50 41 52 32 00 50 4b 54 84 00 00 00 00 00 00 00 |PAR2.PKT........|
00000010 76 1f e0 a4 5a 32 e0 84 d9 e9 32 32 06 9f 03 ff |v...Z2....22....|
00000020 71 48 73 d5 59 c6 ae 7c c7 21 3d ba 8d e5 ea 04 |qHs.Y..|.!=.....|
00000030 50 41 52 20 32 2e 30 00 46 69 6c 65 44 65 73 63 |PAR 2.0.FileDesc|
00000040 5d 74 b5 3d 64 ae 1f d8 ae 41 f1 8c 2f 7a cc c1 |]t.=d....A../z..|
00000050 27 9b bc 61 46 21 4d 37 a3 c7 f2 07 b4 b8 df 81 |'..aF!M7........|

Pretty straightforward. The only thing that would have made it easier is if the first version used “PAR1”, but be glad they didn’t as that signature is used by another!

hexdump -C null_list.parquet | head
00000000 50 41 52 31 15 00 15 18 15 18 2c 15 02 15 00 15 |PAR1......,.....|
00000010 06 15 06 00 00 02 00 00 00 02 00 02 00 00 00 02 |................|
00000020 01 26 42 1c 15 02 19 25 00 06 19 38 09 65 6d 70 |.&B....%...8.emp|
00000030 74 79 6c 69 73 74 04 6c 69 73 74 04 69 74 65 6d |tylist.list.item|
00000040 15 00 16 02 16 3a 16 3a 26 08 3c 36 02 00 00 00 |.....:.:&.<6....|
00000050 15 02 19 4c 48 0c 61 72 72 6f 77 5f 73 63 68 65 |...LH.arrow_sche|
00000060 6d 61 15 02 00 35 02 18 09 65 6d 70 74 79 6c 69 |ma...5...emptyli|
00000070 73 74 15 02 15 06 4c 3c 00 00 00 35 04 18 04 6c |st....L<...5...l|
00000080 69 73 74 15 02 00 15 02 25 02 18 04 69 74 65 6d |ist.....%...item|
00000090 6c bc 00 00 00 16 02 19 1c 19 1c 26 42 1c 15 02 |l..........&B...|

Apache Parquet is a more modern format used to store column-oriented data. At least they used a unique file extension!

Another common bit of software which uses the PAR extension is Solid Edge by Siemens. They use the PAR extension to encode their 3D parts format. For some reason this format still uses the OLE compound object container.

7z l tinyscrew.par 

Path = tinyscrew.par
Type = Compound
Physical Size = 86528
Extension = compound
Cluster Size = 512
Sector Size = 64

Date Time Attr Size Compressed Name
------------------- ----- ------------ ------------ ------------------------
..... 31964 32256 PSMcluster0
..... 12 64 Versions
2001-12-19 15:44:14 D.... Display
2001-12-19 15:44:14 D.... ACIS
..... 8462 8704 ACIS/Solid1.sab
..... 238 256 PSMroots
2001-12-19 15:44:14 D.... Display/Cache0
2001-12-19 15:44:14 D.... Display/Styles
..... 1725 1728 Display/Styles/Library0
..... 12 64 Display/Styles/DefaultStyles
..... 88 128 Display/Cache0/Info
..... 4248 4608 Display/Cache0/L1-T1
..... 8 64 JSitesList
2001-12-19 15:44:14 D.... PARASOLID
..... 3389 3392 PARASOLID/STREAM434.D_B
..... 10402 10752 PARASOLID/STREAM434.P_B
..... 4 64 DocVersion2
..... 199 256 PSMclustertable
..... 8 64 PSMuserroots
..... 512 512 JVisibleData
2001-12-19 15:44:14 D.... PSMspacemap
..... 66 128 PSMspacemap/0x00002000
..... 6090 6144 PSMspacemap/0x00000000
..... 174 192 PSMspacemap/0x00004000
..... 4716 5120 PSMtypetable
..... 8 64 FamilyMembers
..... 8 64 BuildVersions
..... 150 192 PartsLiteData
..... 596 640 [5]C3teagxwOttdbfkuIaamtae3Ie
..... 476 512 [5]SummaryInformation
..... 12 64 PSMsegmenttable
..... 96 128 MSConvertedPropertyset
..... 148 192 [5]K4teagxwOttdbfkuIaamtae3Ie
..... 280 320 [5]DocumentSummaryInformation
..... 116 128 [5]SszbwomgY1udb2whAaq5u2jwCg
..... 264 320 [5]Rfunnyd1AvtdbfkuIaamtae3Ie
..... 140 192 Dynamic Attributes Metadata
..... 458 512 Unclustered Dynamic Attributes
------------------- ----- ------------ ------------ ------------------------
2001-12-19 15:44:14 75069 77824 32 files, 6 folders

We will have to use the a container signature to correctly identify this format. There are also ASM and DFT formats which are also Solid Edge formats which use the same OLE container. Hopefully there are some unique features we can use to identify them.

One other file format which uses the PAR extension is not listed in any of the registries. Not in PRONOM, TrID, Wikidata, or others. I came across it while researching another format, DVD Studio Pro. On a Macintosh computer running the now discontinued DVD Studio Pro, one could save their DVD mastering project as a “file” which used the DSPPROJ extension. I use the term file loosely here as it wasn’t actually a file, it was a folder with an extension which MacOS would interpret as a single file. These are the package formats Apple used and still uses quite frequently. Moving this folder to another other system results in a folder of content.

tree sample.dspproj 
/sample.dspproj
└── Contents
├── PkgInfo
└── Resources
├── Audio
├── MPEG
├── Menu
├── ModuleDataB
├── ObjectDataB
├── Openers.plist
├── Overlay
├── Picture
├── Render Data
│   ├── C4272B0100797459.M2V
│   └── PAR
│   └── C4272B0100797459.M2V.par
├── Styles
├── Temp
├── Templates
└── Thumbnails

14 directories, 6 files

This PAR extension is explained in the DVD Studio Pro manual:

About the Parse Files
To use an asset in a project, DVD Studio Pro needs to know some general information about it, such as its length, type, and integrity. Video assets encoded within DVD Studio Pro can include this information in the encoded files, or can create separate files for it. Assets encoded by Compressor outside of DVD Studio Pro can include this information if you select the “Add DVD Studio Pro meta-data” option in the Extras pane of the Encoder settings.
Assets encoded with other encoders, or with the “Add DVD Studio Pro meta-data” option disabled when using Compressor, must be parsed before DVD Studio Pro can use them. Parsing creates a small file, with the same name as the video asset and a “.par” extension that contains the required information. The parse file can take from several seconds to several minutes to create, depending on the size of the asset file.

hexdump -C E4712E541A60E300.M2V.par | head
00000000 56 50 41 52 00 00 00 20 00 00 00 00 00 01 e2 40 |VPAR... .......@|
00000010 00 00 00 00 00 c6 19 7c 2f 55 73 65 72 73 2f 74 |.......|/Users/t|
00000020 79 6c 65 72 2f 44 6f 63 75 6d 65 6e 74 73 2f 46 |yler/Documents/F|
00000030 69 6e 61 6c 20 52 65 6e 64 65 72 20 66 6f 72 20 |inal Render for |
00000040 44 56 44 20 56 51 42 2f 56 61 72 73 69 74 79 51 |DVD VQB/VarsityQ|
00000050 42 20 44 56 44 2f 56 61 72 73 69 74 79 51 42 2d |B DVD/VarsityQB-|
00000060 44 69 73 63 32 2e 64 73 70 70 72 6f 6a 2f 43 6f |Disc2.dspproj/Co|
00000070 6e 74 65 6e 74 73 2f 52 65 73 6f 75 72 63 65 73 |ntents/Resources|
00000080 2f 52 65 6e 64 65 72 20 44 61 74 61 2f 45 34 37 |/Render Data/E47|
00000090 31 32 45 35 34 31 41 36 30 45 33 30 30 2e 4d 32 |12E541A60E300.M2|

Parity, Parts, and Parse files, oh my.

If you thought we were done, you would be wrong! Let’s look at yet another PAR format.

hexdump -C MESSROH.PAR | head
00000000 08 69 64 73 32 30 30 30 30 d0 4e 01 51 46 42 00 |.ids20000.N.QFB.|
00000010 98 d0 4e 01 80 01 58 01 b6 b9 f7 bf 82 30 00 00 |..N...X......0..|
00000020 dc 08 00 00 60 51 f2 bf 82 30 01 59 ff ff ff ff |....`Q...0.Y....|
00000030 a4 d0 4e 01 28 3e f2 bf 78 63 a4 01 dc 08 00 0b |..N.(>..xc......|
00000040 5a 45 52 4f 2d 4f 46 46 53 45 54 01 18 0e ac 01 |ZERO-OFFSET.....|
00000050 d4 d0 4e 01 00 ac 43 00 18 0e ac 01 d4 d0 4e 01 |..N...C.......N.|
00000060 51 46 42 00 ec d0 4e 01 d4 00 4e 01 b6 b9 f7 bf |QFB...N...N.....|
00000070 5c 4c 75 81 5c 81 00 00 45 07 41 00 c0 0a 00 01 |\Lu.\...E.A.....|
00000080 cd d0 41 00 d5 d0 41 00 5c 81 00 00 dc 0a a4 01 |..A...A.\.......|
00000090 5b 5d 42 00 cc d0 4e 01 72 5d 42 00 7a 5d 42 00 |[]B...N.r]B.z]B.|

hexdump -C DUMMYDAT.PAR | head
00000000 08 73 65 69 73 6d 69 63 31 00 00 00 00 00 00 00 |.seismic1.......|
00000010 00 00 00 00 00 01 58 00 00 00 00 00 00 00 00 00 |......X.........|
00000020 00 00 00 00 00 00 00 00 00 00 01 59 00 00 00 00 |...........Y....|
00000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0a |................|
00000040 41 4b 55 53 54 49 4b 4c 4f 47 00 00 00 00 00 00 |AKUSTIKLOG......|
00000050 00 00 00 00 02 2f 2f 00 08 41 47 43 2d 47 41 49 |.....//..AGC-GAI|
00000060 4e 00 00 00 00 00 00 00 00 00 00 00 00 32 00 00 |N............2..|
00000070 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|

This PAR format is called “Reflexw data-format“. This is a RAW format header that always is paired with a DAT file, together used to store geophysical wave data from devices such as GPR. Relexw is software made by Sandmeier geophysical research.

The PAR file samples I have don’t seem to have a consistent header as each have a unique set of bytes, but all of them have some similar bytes later in the file at around the 0x1D8 (472) offset:

000001d0  00 00 a0 3d 00 00 a0 41  00 00 00 00 00 00 00 00  |...=...A........|
000001e0 0a d7 23 3c 00 00 80 3f 00 00 00 00 00 00 00 00 |..#<...?........|
000001f0 00 00 00 00 cc cc dc 40 00 00 00 00 00 00 00 00 |.......@........|
00000200 00 00 80 3f 00 00 00 00 00 00 00 00 00 00 00 00 |...?............|
00000210 00 00 00 00 00 00 00 00 17 b7 d1 38 00 00 00 00 |...........8....|

It seems these sequence of bytes are the only consistent bytes among all my samples. I have no idea what they mean or reference. The specification does indicate some bytes which should lead to proper identification, but the integer used for the “HeaderMarker” is looking for a 4 byte “00 00 00 01”, which won’t be enough to cleanly identify the format. Love to hear what others can see from the spec. You can find some samples files here.

So we have some Parity files, Parts files, Parse files, Parquet files, and a Header file. I am sure other will be found and added to this lot. Hopefully the PAR files you run across will match one of these patterns! I am still working on a signature proposal. Stay Tuned!

Daisy

A single file can often be self contained, having all that is needed to render itself with the correct software, but more and more often files need other files to function properly. Sometimes these groups of dependent files are within a container, such as a DOCX or ePub, but can also be found all sitting nicely in a folder. I say nicely, partly because the structure works, that is until they are treated as individual files and renamed or moved around breaking that interdependence on each other.

In the case of many Apple bundle files, they appear to be a single file when using on the MacOS, but as a folder on Windows or Linux. This can be very confusing. In other cases such as the DAISY Digital Talking Book format, it is simply a folder or disc with a few or many files within.

Current tools used to identify file formats, such as DROID, look at individual files, not groups of files to determine format. Each file within a folder may have a unique format, but when grouped with other specific formats they become something more. We will have to work on enhancing current tools if we want to avoid breaking these format types and losing their ability to render properly.

DAISY, or Digital Accessible Information System, is a type of Digital Book. The format was originally conceived in 1988 as a method to create a talking book, designed for the purpose of giving those who are visually impaired the ability to listen to books. It wasn’t until 1996, the DAISY Consortium was created in order to take the technology to those who needed it. The original version of the the DAISY format in 1994 was proprietary, but once they formed the consortium, they decided to adopt open standards for the format and in 1998, the DAISY 2.0 standard was released. You can read more on the Library of Congress Format Description page.

Lets take a look at a folder containing a DAISY 2.0 book.

ls -la "DAISY 2.02 export"
total 536
drwx------ 1 tyler staff 16384 Sep 25 22:06 .
drwx------ 1 tyler staff 16384 Sep 25 22:06 ..
-rwx------@ 1 tyler staff 1090 Sep 25 22:05 0002.smil
-rwx------ 1 tyler staff 228413 Sep 25 22:05 aud0001.mp3
-rwx------@ 1 tyler staff 672 Sep 25 22:05 master.smil
-rwx------ 1 tyler staff 1703 Sep 25 22:05 ncc.html

We can see three different formats in this folder. The obvious well known MP3 files and an HTML file. We also see two files with the extension SMIL.

Synchronized Multimedia Integration Language” or SMIL is a W3C XML standard used to describe multimedia presentations. It is used in the DAISY DTB as well as other applications, but we will focus on DAISY, and it is in its third version. A SMIL file has this structure:

<?xml version="1.0"?>
<!DOCTYPE smil PUBLIC "-//W3C//DTD SMIL 1.0//EN" "http://www.w3.org/TR/REC-smil/SMIL10.dtd">
<smil>
<head>
<meta name="dc:title" content="Obi Project" />
<meta name="dc:identifier" content="589c550e-303b-4c0d-9921-ae76d782fd53" />
<meta name="ncc:generator" content="Obi v5.0.0.0 with toolkit: UrakawaSDK.core v2.0.0.0 (http://urakawa.sf.net/obi)" />
<meta name="dc:format" content="Daisy 2.02" />
<meta name="ncc:timeInThisSmil" content="00:00:28" />
<layout>
<region id="textView" />
</layout>
</head>
<body>
<ref title="Testing" src="0002.smil" id="ms_0002" />
</body>
</smil>

A standard XML file with a link to a SMIL DTD and a root tag of <smil>. This format is recognized by PRONOM as fmt/205, although is often identified as a standard XML file. It seems the signature was created with a small offset which works with some SMIL files, but the gap between the end of the XML declaration and the start of the <smil> tag is only 20-86 bytes, not enough to allow for different character sets and full DTD URL’s. We will have to increase this gap in order to get all the SMIL files identified correctly.

With this update all the files in a DAISY 2.0 files should be identified individually, but as a set of files they make up the DAISY 2.0 format. This format requires the ncc.html file be present at the root of the folder or CD, so this file will aid in the manual identification of this format.

DAISY 3 was released in 2002 and standardized using the ANSI/NISO Z39.86 2002 name. It has been revised a couple times with the current revision being 2012. This update adds more functionality to the format with many new optional and required formats/files included in the folder. Here is a simple example:

ls -la "DAISY3 Export"
total 784
drwx------ 1 tyler staff 16384 Sep 25 22:06 .
drwx------ 1 tyler staff 16384 Sep 25 22:06 ..
-rwx------@ 1 tyler staff 979 Sep 25 22:05 0001.smil
-rwx------ 1 tyler staff 228413 Sep 25 22:05 aud0001.mp3
-rwx------ 1 tyler staff 1014 Sep 25 22:05 navigation.ncx
-rwx------ 1 tyler staff 1881 Sep 25 22:05 package.opf
-rwx------ 1 tyler staff 7838 Nov 2 2020 tpbnarrator.res
-rwx------ 1 tyler staff 117656 Nov 2 2020 tpbnarrator_res.mp3

The SMIL format is still included, along with MP3’s, but we have some addition formats. The NCX or “Navigation Control File”, the OPF or “Package file”, and the RES or “Resource file” are a few of them. The NCX file is the first file accessed as it lays out the navigation for the whole DTB. It is also XML:

cat DAISY3 Export/navigation.ncx 
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE ncx PUBLIC "-//NISO//DTD ncx 2005-1//EN" "http://www.daisy.org/z3986/2005/ncx-2005-1.dtd">
<ncx
version="2005-1"
xml:lang="en-US" xmlns="http://www.daisy.org/z3986/2005/ncx/">

This file is only recognized by DROID as a standard XML file. It probably should have unique identification like SMIL and with a root tag of <ncx>, that should be fairly easy to add.

The Package file with the extension OPF, is actually a format used by the openebook group, not to be confused by a format used by the Open Preservation Foundation 🤣. The Open Packaging Format is used and a DTB conforming to this standard must include exactly one Package File which must be a valid XML 1.0 document conforming to the OEBF Publication Structure 1.2 package.

cat DAISY3 Export/package.opf   
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE package PUBLIC "+//ISBN 0-9673008-1-9//DTD OEB 1.2 Package//EN" "http://openebook.org/dtds/oeb-1.2/oebpkg12.dtd">
<package
unique-identifier="uid" xmlns="http://openebook.org/namespaces/oeb-package/1.0/">
<metadata>
<dc-metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oebpackage="http://openebook.org/namespaces/oeb-package/1.0/">
<dc:Identifier
id="uid">589c550e-303b-4c0d-9921-ae76d782fd53</dc:Identifier>
<dc:Format>ANSI/NISO Z39.86-2005</dc:Format>
<dc:Title>Obi Project</dc:Title>
<dc:Publisher>N/A</dc:Publisher>
<dc:Language>en-US</dc:Language>
<dc:Creator>Creator name</dc:Creator>
<dc:Date>2024-09-25</dc:Date>
</dc-metadata>

The OPF format is also unknown to PRONOM and they identify as standard XML files as well. The root tag of “<package>” could be used elsewhere so the signature may need to reference the OEB package information.

The RES Resource file is also a standard XML and can be identified through its root tag of “<resources>” and resources DOCTYPE.

cat DAISY3 Export/tpbnarrator.res 
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE resources PUBLIC "-//NISO//DTD resource 2005-1//EN" "http://www.daisy.org/z3986/2005/resource-2005-1.dtd" []>
<resources xmlns="http://www.daisy.org/z3986/2005/resource/" version="2005-1">

<!-- SKIPPABLE NCX -->

<scope nsuri="http://www.daisy.org/z3986/2005/ncx/">
<nodeSet id="ns001" select="//smilCustomTest[@bookStruct='LINE_NUMBER']">
<resource xml:lang="en" id="r001">
<text>Row</text>
<audio src="tpbnarrator_res.mp3" clipBegin="0:00:02.379" clipEnd="0:00:03.416" />
</resource>
</nodeSet>

Now, adding these DAISY 3.0 formats will greatly increase the identification of this complex format. But we run into a problem with some of the software out there which generates these DAISY files, some of them include files not required by the format, but are included to be used by the different software. This can include some CSS files for formatting, additional XML, XSL files, DTD’s, and for DAISY files created by the PlexTalk software, additional project files.

ls -la MasterCD/AfterBuild 
total 7520
drwx------@ 1 tyler staff 16384 Sep 24 19:34 .
drwx------@ 1 tyler staff 16384 Sep 25 22:11 ..
-rwx------@ 1 tyler staff 6688 Sep 25 01:32 ImdPhrInfo.imph
-rwx------@ 1 tyler staff 3773 Sep 25 01:32 ImdTxtTabl.imtt
-rwx------@ 1 tyler staff 1276 Sep 25 01:32 Ncc.imdn
-rwx------@ 1 tyler staff 3716618 Sep 25 01:32 a000001.mp3
-rwx------@ 1 tyler staff 4352 Sep 25 01:32 ncc.html
-rwx------@ 1 tyler staff 1015 Sep 25 01:32 ptk000001.smil
-rwx------@ 1 tyler staff 938 Sep 25 01:32 ptk000002.smil

The ncc.html file is here, indicating a DAISY 2.0 format, along with an MP3 and SMIL files, but including some additional formats.

In addition, when creating a project, four files with the extensions Ncc.imdn, ImdPhrInfo.imph, ImdTxtTabl.imtt, and METADATA.ini are automatically created. These files are called “Plextalk project files.” They store table of contents information, etc. (Plextalk project files generated by older versions of this product do not have METADATA.ini.)

http://www.plextalk.com/jp/dw_data/PRSStd/PLEX_RS_UM.html

These four files may not be crucial to the playing of the Daisy format, but they are important to the PlexTalk software.

hexdump -C ImdPhrInfo.imph | head
00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000020 ff ff ff ff ff ff ff ff 00 00 00 00 00 00 00 00 |................|
00000030 00 00 00 00 00 00 00 00 f0 a3 0d 00 00 00 00 00 |................|
00000040 a3 06 00 00 a4 06 00 00 00 00 00 00 53 00 00 00 |............S...|
00000050 ff ff ff ff 01 00 00 00 03 00 00 00 00 00 00 00 |................|
00000060 00 00 00 00 00 00 00 00 c5 11 00 00 20 1a 00 00 |............ ...|
00000070 e5 2b 00 00 00 00 00 00 63 00 00 00 ff ff ff ff |.+......c.......|
00000080 02 00 00 00 04 00 00 00 00 00 00 00 00 00 00 00 |................|
00000090 00 00 00 00 e5 2b 00 00 d6 0b 00 00 bb 37 00 00 |.....+.......7..|

hexdump -C ImdTxtTabl.imtt | head
00000000 17 00 00 00 32 30 30 34 2f 30 35 2f 33 31 2f 31 |....2004/05/31/1|
00000010 36 3a 36 3a 34 37 2e 30 30 30 00 03 00 00 00 65 |6:6:47.000.....e|
00000020 6e 00 0b 00 00 00 69 73 6f 2d 38 38 35 39 2d 31 |n.....iso-8859-1|
00000030 00 0d 00 00 00 5a 3a 2f 42 6f 6f 6b 44 69 72 34 |.....Z:/BookDir4|
00000040 2f 00 0d 00 00 00 5a 3a 2f 42 6f 6f 6b 44 69 72 |/.....Z:/BookDir|
00000050 34 2f 00 0c 00 00 00 61 30 30 30 30 30 31 2e 6d |4/.....a000001.m|
00000060 70 33 00 0c 00 00 00 61 30 30 30 30 30 31 2e 6d |p3.....a000001.m|
*
00000980 70 33 00 08 00 00 00 48 65 61 64 69 6e 67 00 01 |p3.....Heading..|
00000990 00 00 00 00 08 00 00 00 48 65 61 64 69 6e 67 00 |........Heading.|

hexdump -C Ncc.imdn | head
00000000 01 ff 00 ff c4 00 00 00 3c 00 00 00 2c 00 00 00 |........<...,...|
00000010 14 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000020 00 00 00 00 49 6d 64 54 78 74 54 61 62 6c 2e 69 |....ImdTxtTabl.i|
00000030 6d 74 74 00 00 00 00 00 00 00 00 00 00 00 00 00 |mtt.............|
00000040 00 00 00 00 49 6d 64 50 68 72 49 6e 66 6f 2e 69 |....ImdPhrInfo.i|
00000050 6d 70 68 00 00 00 00 00 00 00 00 00 00 00 00 00 |mph.............|
00000060 00 00 00 00 04 00 00 00 00 fa 00 00 44 ac 00 00 |............D...|
00000070 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000080 00 00 00 00 01 00 00 00 08 00 00 00 12 00 00 00 |................|
00000090 03 00 00 00 00 00 00 00 01 00 00 00 ff ff ff ff |................|

I don’t have a METADATA.ini file to research, but I will be honest, these PlexTalk files will be hard to identify from their contents.

Looking at the IMPH file, there isn’t a lot of bytes which might indicate a format magic bytes. But I do see some patterns. The first 40 bytes all seem to be the same.

00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 FFFFFFFF FFFFFFFF

But making a signature from only 00 and FF might clash with other formats. It does appear that the 4 bytes FFFFFFFF occur every 40 bytes. This precision might be good enough if we repeat it a couple times.

The IMTT file is different. It appears to have information on the name, character set and all the files in the Daisy package. The first 4 bytes in my 14 samples either start with 17000000 or 18000000. Not knowing what the 17 or 18 refers to, I am hesitant to use it for identification. In between some of the data there is some consistent bytes, but at different offsets.


hexdump -C ImdTxtTabl.imtt | head
00000000 18 00 00 00 54 69 74 6c 65 00 35 39 2d 31 00 31 |....Title.59-1.1|
00000010 35 3a 35 34 3a 35 39 2e 32 36 30 00 03 00 00 00 |5:54:59.260.....|
00000020 65 6e 00 0b 00 00 00 69 73 6f 2d 38 38 35 39 2d |en.....iso-8859-|
00000030 31 00 01 00 00 00 00 01 00 00 00 00 01 00 00 00 |1...............|
00000040 00 01 00 00 00 00 01 00 00 00 00 01 00 00 00 00 |................|
00000050 01 00 00 00 00 01 00 00 00 00 0c 00 00 00 4d 61 |..............Ma|
00000060 72 69 6f 6e 20 53 79 6d 65 00 28 00 00 00 4d 69 |rion Syme.(...Mi|
00000070 6e 75 74 65 73 20 6f 66 20 74 68 65 20 43 6f 6d |nutes of the Com|
00000080 6d 69 74 74 65 65 20 4d 65 65 74 69 6e 67 20 32 |mittee Meeting 2|
00000090 34 30 35 30 34 00 08 00 00 00 48 65 61 64 69 6e |40504.....Headin|

hexdump -C ImdTxtTabl.imtt | head
00000000 17 00 00 00 32 30 30 34 2f 30 35 2f 33 31 2f 31 |....2004/05/31/1|
00000010 36 3a 36 3a 34 37 2e 30 30 30 00 03 00 00 00 65 |6:6:47.000.....e|
00000020 6e 00 0b 00 00 00 69 73 6f 2d 38 38 35 39 2d 31 |n.....iso-8859-1|
00000030 00 0d 00 00 00 5a 3a 2f 42 6f 6f 6b 44 69 72 34 |.....Z:/BookDir4|
00000040 2f 00 0d 00 00 00 5a 3a 2f 42 6f 6f 6b 44 69 72 |/.....Z:/BookDir|
00000050 34 2f 00 0c 00 00 00 61 30 30 30 30 30 31 2e 6d |4/.....a000001.m|
00000060 70 33 00 0c 00 00 00 61 30 30 30 30 30 31 2e 6d |p3.....a000001.m|

Not sure what any of it means, but might be good enough for a signature.

Now the IMDN files might be a little easier:

hexdump -C Ncc.imdn | head
00000000 01 ff 00 ff d4 00 00 00 3c 00 00 00 2c 00 00 00 |........<...,...|
00000010 14 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000020 00 00 00 00 49 6d 64 54 78 74 54 61 62 6c 2e 69 |....ImdTxtTabl.i|
00000030 6d 74 74 00 00 00 00 00 00 00 00 00 00 00 00 00 |mtt.............|
00000040 00 00 00 00 49 6d 64 50 68 72 49 6e 66 6f 2e 69 |....ImdPhrInfo.i|
00000050 6d 70 68 00 00 00 00 00 00 00 00 00 00 00 00 00 |mph.............|
00000060 00 00 00 00 04 00 00 00 00 7d 00 00 22 56 00 00 |.........}.."V..|
00000070 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000080 00 00 00 00 01 00 00 00 28 00 00 00 28 00 00 00 |........(...(...|
00000090 00 00 00 00 00 00 00 00 28 00 00 00 ff ff ff ff |........(.......|

This format directly names the two other formats. Should be easy to look for the two file names in the header. The NCC html file in Daisy 2.0 and the NCX xml file in Daisy 3.0 are directory files so it makes sense this file would do the same.

Not sure if these signatures will hold up over time, but they are a start. It would be nice if all the files we are given to preserve would have convenient static magic bytes, but alas, many do not and we have to guess.

These Daisy formats illustrate a problem in preservation that doesn’t quite have a good solution. Each of these files are individually unique and can be identified, but as a whole they represent another unique format. Tying formats together to link their interdependence on each other will be no small task, but will be necessary not only to understanding the format, but to avoid separating the files, renaming, or rearranging breaking that interdependence.

I have added the update to SMIL and new signatures for the other formats to my GitHub repository. Feel free to test and change if you find additional samples or information.