Daisy

A single file can often be self contained, having all that is needed to render itself with the correct software, but more and more often files need other files to function properly. Sometimes these groups of dependent files are within a container, such as a DOCX or ePub, but can also be found all sitting nicely in a folder. I say nicely, partly because the structure works, that is until they are treated as individual files and renamed or moved around breaking that interdependence on each other.

In the case of many Apple bundle files, they appear to be a single file when using on the MacOS, but as a folder on Windows or Linux. This can be very confusing. In other cases such as the DAISY Digital Talking Book format, it is simply a folder or disc with a few or many files within.

Current tools used to identify file formats, such as DROID, look at individual files, not groups of files to determine format. Each file within a folder may have a unique format, but when grouped with other specific formats they become something more. We will have to work on enhancing current tools if we want to avoid breaking these format types and losing their ability to render properly.

DAISY, or Digital Accessible Information System, is a type of Digital Book. The format was originally conceived in 1988 as a method to create a talking book, designed for the purpose of giving those who are visually impaired the ability to listen to books. It wasn’t until 1996, the DAISY Consortium was created in order to take the technology to those who needed it. The original version of the the DAISY format in 1994 was proprietary, but once they formed the consortium, they decided to adopt open standards for the format and in 1998, the DAISY 2.0 standard was released. You can read more on the Library of Congress Format Description page.

Lets take a look at a folder containing a DAISY 2.0 book.

ls -la "DAISY 2.02 export"
total 536
drwx------ 1 tyler staff 16384 Sep 25 22:06 .
drwx------ 1 tyler staff 16384 Sep 25 22:06 ..
-rwx------@ 1 tyler staff 1090 Sep 25 22:05 0002.smil
-rwx------ 1 tyler staff 228413 Sep 25 22:05 aud0001.mp3
-rwx------@ 1 tyler staff 672 Sep 25 22:05 master.smil
-rwx------ 1 tyler staff 1703 Sep 25 22:05 ncc.html

We can see three different formats in this folder. The obvious well known MP3 files and an HTML file. We also see two files with the extension SMIL.

Synchronized Multimedia Integration Language” or SMIL is a W3C XML standard used to describe multimedia presentations. It is used in the DAISY DTB as well as other applications, but we will focus on DAISY, and it is in its third version. A SMIL file has this structure:

<?xml version="1.0"?>
<!DOCTYPE smil PUBLIC "-//W3C//DTD SMIL 1.0//EN" "http://www.w3.org/TR/REC-smil/SMIL10.dtd">
<smil>
<head>
<meta name="dc:title" content="Obi Project" />
<meta name="dc:identifier" content="589c550e-303b-4c0d-9921-ae76d782fd53" />
<meta name="ncc:generator" content="Obi v5.0.0.0 with toolkit: UrakawaSDK.core v2.0.0.0 (http://urakawa.sf.net/obi)" />
<meta name="dc:format" content="Daisy 2.02" />
<meta name="ncc:timeInThisSmil" content="00:00:28" />
<layout>
<region id="textView" />
</layout>
</head>
<body>
<ref title="Testing" src="0002.smil" id="ms_0002" />
</body>
</smil>

A standard XML file with a link to a SMIL DTD and a root tag of <smil>. This format is recognized by PRONOM as fmt/205, although is often identified as a standard XML file. It seems the signature was created with a small offset which works with some SMIL files, but the gap between the end of the XML declaration and the start of the <smil> tag is only 20-86 bytes, not enough to allow for different character sets and full DTD URL’s. We will have to increase this gap in order to get all the SMIL files identified correctly.

With this update all the files in a DAISY 2.0 files should be identified individually, but as a set of files they make up the DAISY 2.0 format. This format requires the ncc.html file be present at the root of the folder or CD, so this file will aid in the manual identification of this format.

DAISY 3 was released in 2002 and standardized using the ANSI/NISO Z39.86 2002 name. It has been revised a couple times with the current revision being 2012. This update adds more functionality to the format with many new optional and required formats/files included in the folder. Here is a simple example:

ls -la "DAISY3 Export"
total 784
drwx------ 1 tyler staff 16384 Sep 25 22:06 .
drwx------ 1 tyler staff 16384 Sep 25 22:06 ..
-rwx------@ 1 tyler staff 979 Sep 25 22:05 0001.smil
-rwx------ 1 tyler staff 228413 Sep 25 22:05 aud0001.mp3
-rwx------ 1 tyler staff 1014 Sep 25 22:05 navigation.ncx
-rwx------ 1 tyler staff 1881 Sep 25 22:05 package.opf
-rwx------ 1 tyler staff 7838 Nov 2 2020 tpbnarrator.res
-rwx------ 1 tyler staff 117656 Nov 2 2020 tpbnarrator_res.mp3

The SMIL format is still included, along with MP3’s, but we have some addition formats. The NCX or “Navigation Control File”, the OPF or “Package file”, and the RES or “Resource file” are a few of them. The NCX file is the first file accessed as it lays out the navigation for the whole DTB. It is also XML:

cat DAISY3 Export/navigation.ncx 
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE ncx PUBLIC "-//NISO//DTD ncx 2005-1//EN" "http://www.daisy.org/z3986/2005/ncx-2005-1.dtd">
<ncx
version="2005-1"
xml:lang="en-US" xmlns="http://www.daisy.org/z3986/2005/ncx/">

This file is only recognized by DROID as a standard XML file. It probably should have unique identification like SMIL and with a root tag of <ncx>, that should be fairly easy to add.

The Package file with the extension OPF, is actually a format used by the openebook group, not to be confused by a format used by the Open Preservation Foundation 🤣. The Open Packaging Format is used and a DTB conforming to this standard must include exactly one Package File which must be a valid XML 1.0 document conforming to the OEBF Publication Structure 1.2 package.

cat DAISY3 Export/package.opf   
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE package PUBLIC "+//ISBN 0-9673008-1-9//DTD OEB 1.2 Package//EN" "http://openebook.org/dtds/oeb-1.2/oebpkg12.dtd">
<package
unique-identifier="uid" xmlns="http://openebook.org/namespaces/oeb-package/1.0/">
<metadata>
<dc-metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oebpackage="http://openebook.org/namespaces/oeb-package/1.0/">
<dc:Identifier
id="uid">589c550e-303b-4c0d-9921-ae76d782fd53</dc:Identifier>
<dc:Format>ANSI/NISO Z39.86-2005</dc:Format>
<dc:Title>Obi Project</dc:Title>
<dc:Publisher>N/A</dc:Publisher>
<dc:Language>en-US</dc:Language>
<dc:Creator>Creator name</dc:Creator>
<dc:Date>2024-09-25</dc:Date>
</dc-metadata>

The OPF format is also unknown to PRONOM and they identify as standard XML files as well. The root tag of “<package>” could be used elsewhere so the signature may need to reference the OEB package information.

The RES Resource file is also a standard XML and can be identified through its root tag of “<resources>” and resources DOCTYPE.

cat DAISY3 Export/tpbnarrator.res 
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE resources PUBLIC "-//NISO//DTD resource 2005-1//EN" "http://www.daisy.org/z3986/2005/resource-2005-1.dtd" []>
<resources xmlns="http://www.daisy.org/z3986/2005/resource/" version="2005-1">

<!-- SKIPPABLE NCX -->

<scope nsuri="http://www.daisy.org/z3986/2005/ncx/">
<nodeSet id="ns001" select="//smilCustomTest[@bookStruct='LINE_NUMBER']">
<resource xml:lang="en" id="r001">
<text>Row</text>
<audio src="tpbnarrator_res.mp3" clipBegin="0:00:02.379" clipEnd="0:00:03.416" />
</resource>
</nodeSet>

Now, adding these DAISY 3.0 formats will greatly increase the identification of this complex format. But we run into a problem with some of the software out there which generates these DAISY files, some of them include files not required by the format, but are included to be used by the different software. This can include some CSS files for formatting, additional XML, XSL files, DTD’s, and for DAISY files created by the PlexTalk software, additional project files.

ls -la MasterCD/AfterBuild 
total 7520
drwx------@ 1 tyler staff 16384 Sep 24 19:34 .
drwx------@ 1 tyler staff 16384 Sep 25 22:11 ..
-rwx------@ 1 tyler staff 6688 Sep 25 01:32 ImdPhrInfo.imph
-rwx------@ 1 tyler staff 3773 Sep 25 01:32 ImdTxtTabl.imtt
-rwx------@ 1 tyler staff 1276 Sep 25 01:32 Ncc.imdn
-rwx------@ 1 tyler staff 3716618 Sep 25 01:32 a000001.mp3
-rwx------@ 1 tyler staff 4352 Sep 25 01:32 ncc.html
-rwx------@ 1 tyler staff 1015 Sep 25 01:32 ptk000001.smil
-rwx------@ 1 tyler staff 938 Sep 25 01:32 ptk000002.smil

The ncc.html file is here, indicating a DAISY 2.0 format, along with an MP3 and SMIL files, but including some additional formats.

In addition, when creating a project, four files with the extensions Ncc.imdn, ImdPhrInfo.imph, ImdTxtTabl.imtt, and METADATA.ini are automatically created. These files are called “Plextalk project files.” They store table of contents information, etc. (Plextalk project files generated by older versions of this product do not have METADATA.ini.)

http://www.plextalk.com/jp/dw_data/PRSStd/PLEX_RS_UM.html

These four files may not be crucial to the playing of the Daisy format, but they are important to the PlexTalk software.

hexdump -C ImdPhrInfo.imph | head
00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000020 ff ff ff ff ff ff ff ff 00 00 00 00 00 00 00 00 |................|
00000030 00 00 00 00 00 00 00 00 f0 a3 0d 00 00 00 00 00 |................|
00000040 a3 06 00 00 a4 06 00 00 00 00 00 00 53 00 00 00 |............S...|
00000050 ff ff ff ff 01 00 00 00 03 00 00 00 00 00 00 00 |................|
00000060 00 00 00 00 00 00 00 00 c5 11 00 00 20 1a 00 00 |............ ...|
00000070 e5 2b 00 00 00 00 00 00 63 00 00 00 ff ff ff ff |.+......c.......|
00000080 02 00 00 00 04 00 00 00 00 00 00 00 00 00 00 00 |................|
00000090 00 00 00 00 e5 2b 00 00 d6 0b 00 00 bb 37 00 00 |.....+.......7..|

hexdump -C ImdTxtTabl.imtt | head
00000000 17 00 00 00 32 30 30 34 2f 30 35 2f 33 31 2f 31 |....2004/05/31/1|
00000010 36 3a 36 3a 34 37 2e 30 30 30 00 03 00 00 00 65 |6:6:47.000.....e|
00000020 6e 00 0b 00 00 00 69 73 6f 2d 38 38 35 39 2d 31 |n.....iso-8859-1|
00000030 00 0d 00 00 00 5a 3a 2f 42 6f 6f 6b 44 69 72 34 |.....Z:/BookDir4|
00000040 2f 00 0d 00 00 00 5a 3a 2f 42 6f 6f 6b 44 69 72 |/.....Z:/BookDir|
00000050 34 2f 00 0c 00 00 00 61 30 30 30 30 30 31 2e 6d |4/.....a000001.m|
00000060 70 33 00 0c 00 00 00 61 30 30 30 30 30 31 2e 6d |p3.....a000001.m|
*
00000980 70 33 00 08 00 00 00 48 65 61 64 69 6e 67 00 01 |p3.....Heading..|
00000990 00 00 00 00 08 00 00 00 48 65 61 64 69 6e 67 00 |........Heading.|

hexdump -C Ncc.imdn | head
00000000 01 ff 00 ff c4 00 00 00 3c 00 00 00 2c 00 00 00 |........<...,...|
00000010 14 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000020 00 00 00 00 49 6d 64 54 78 74 54 61 62 6c 2e 69 |....ImdTxtTabl.i|
00000030 6d 74 74 00 00 00 00 00 00 00 00 00 00 00 00 00 |mtt.............|
00000040 00 00 00 00 49 6d 64 50 68 72 49 6e 66 6f 2e 69 |....ImdPhrInfo.i|
00000050 6d 70 68 00 00 00 00 00 00 00 00 00 00 00 00 00 |mph.............|
00000060 00 00 00 00 04 00 00 00 00 fa 00 00 44 ac 00 00 |............D...|
00000070 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000080 00 00 00 00 01 00 00 00 08 00 00 00 12 00 00 00 |................|
00000090 03 00 00 00 00 00 00 00 01 00 00 00 ff ff ff ff |................|

I don’t have a METADATA.ini file to research, but I will be honest, these PlexTalk files will be hard to identify from their contents.

Looking at the IMPH file, there isn’t a lot of bytes which might indicate a format magic bytes. But I do see some patterns. The first 40 bytes all seem to be the same.

00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 FFFFFFFF FFFFFFFF

But making a signature from only 00 and FF might clash with other formats. It does appear that the 4 bytes FFFFFFFF occur every 40 bytes. This precision might be good enough if we repeat it a couple times.

The IMTT file is different. It appears to have information on the name, character set and all the files in the Daisy package. The first 4 bytes in my 14 samples either start with 17000000 or 18000000. Not knowing what the 17 or 18 refers to, I am hesitant to use it for identification. In between some of the data there is some consistent bytes, but at different offsets.


hexdump -C ImdTxtTabl.imtt | head
00000000 18 00 00 00 54 69 74 6c 65 00 35 39 2d 31 00 31 |....Title.59-1.1|
00000010 35 3a 35 34 3a 35 39 2e 32 36 30 00 03 00 00 00 |5:54:59.260.....|
00000020 65 6e 00 0b 00 00 00 69 73 6f 2d 38 38 35 39 2d |en.....iso-8859-|
00000030 31 00 01 00 00 00 00 01 00 00 00 00 01 00 00 00 |1...............|
00000040 00 01 00 00 00 00 01 00 00 00 00 01 00 00 00 00 |................|
00000050 01 00 00 00 00 01 00 00 00 00 0c 00 00 00 4d 61 |..............Ma|
00000060 72 69 6f 6e 20 53 79 6d 65 00 28 00 00 00 4d 69 |rion Syme.(...Mi|
00000070 6e 75 74 65 73 20 6f 66 20 74 68 65 20 43 6f 6d |nutes of the Com|
00000080 6d 69 74 74 65 65 20 4d 65 65 74 69 6e 67 20 32 |mittee Meeting 2|
00000090 34 30 35 30 34 00 08 00 00 00 48 65 61 64 69 6e |40504.....Headin|

hexdump -C ImdTxtTabl.imtt | head
00000000 17 00 00 00 32 30 30 34 2f 30 35 2f 33 31 2f 31 |....2004/05/31/1|
00000010 36 3a 36 3a 34 37 2e 30 30 30 00 03 00 00 00 65 |6:6:47.000.....e|
00000020 6e 00 0b 00 00 00 69 73 6f 2d 38 38 35 39 2d 31 |n.....iso-8859-1|
00000030 00 0d 00 00 00 5a 3a 2f 42 6f 6f 6b 44 69 72 34 |.....Z:/BookDir4|
00000040 2f 00 0d 00 00 00 5a 3a 2f 42 6f 6f 6b 44 69 72 |/.....Z:/BookDir|
00000050 34 2f 00 0c 00 00 00 61 30 30 30 30 30 31 2e 6d |4/.....a000001.m|
00000060 70 33 00 0c 00 00 00 61 30 30 30 30 30 31 2e 6d |p3.....a000001.m|

Not sure what any of it means, but might be good enough for a signature.

Now the IMDN files might be a little easier:

hexdump -C Ncc.imdn | head
00000000 01 ff 00 ff d4 00 00 00 3c 00 00 00 2c 00 00 00 |........<...,...|
00000010 14 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000020 00 00 00 00 49 6d 64 54 78 74 54 61 62 6c 2e 69 |....ImdTxtTabl.i|
00000030 6d 74 74 00 00 00 00 00 00 00 00 00 00 00 00 00 |mtt.............|
00000040 00 00 00 00 49 6d 64 50 68 72 49 6e 66 6f 2e 69 |....ImdPhrInfo.i|
00000050 6d 70 68 00 00 00 00 00 00 00 00 00 00 00 00 00 |mph.............|
00000060 00 00 00 00 04 00 00 00 00 7d 00 00 22 56 00 00 |.........}.."V..|
00000070 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000080 00 00 00 00 01 00 00 00 28 00 00 00 28 00 00 00 |........(...(...|
00000090 00 00 00 00 00 00 00 00 28 00 00 00 ff ff ff ff |........(.......|

This format directly names the two other formats. Should be easy to look for the two file names in the header. The NCC html file in Daisy 2.0 and the NCX xml file in Daisy 3.0 are directory files so it makes sense this file would do the same.

Not sure if these signatures will hold up over time, but they are a start. It would be nice if all the files we are given to preserve would have convenient static magic bytes, but alas, many do not and we have to guess.

These Daisy formats illustrate a problem in preservation that doesn’t quite have a good solution. Each of these files are individually unique and can be identified, but as a whole they represent another unique format. Tying formats together to link their interdependence on each other will be no small task, but will be necessary not only to understanding the format, but to avoid separating the files, renaming, or rearranging breaking that interdependence.

I have added the update to SMIL and new signatures for the other formats to my GitHub repository. Feel free to test and change if you find additional samples or information.

HFE

Last week I had the pleasure of attending the 20th annual iPres conference on Digital Preservation in Ghent, Belgium. I enjoyed hearing from many of my respected colleagues on many aspects of preservation including one of my favorite topics, floppy disks. There was tutorials, lightning talks, and even a workshop, presented by Leontien Talboom, Elizabeth Kata, Chris Knowles, and myself. We titled the workshop “A Guide to Imaging Obscure Floppy Disk Formats“. The workshop was conceived by a mutual interest in imaging Wang 5.25in word processor disks, but expanded to include imaging of Amstrad 3in disks, 240K Brother Typewriter Disks, and Macintosh 400/800k disks.

I brought my hand soldered FluxEngine board and others brought their Greaseweazle board to show off how imaging obscure and uncommon disks can be done on a budget.

Photo of workshop taken on a Mavica Floppy Disk camera
Image taken during workshop on a Mavica FD200 Floppy Disk Camera.

During the conference we talked a bit about the different type of hardware that can be used and the difference between a disk image and flux image. There seems to be quite the exhaustive list of different types of file formats, some specific to a platform and others more generic. I recently did a blog post on the formats used by the Applesauce software, which have some unique features.

There are many disk image types which should be researched and added to PRONOM and other format description sites, but today lets take a look at a generic format used by many tools.

The HxC Floppy Emulator file format which the extension HFE is a popular format used with floppy drive emulators. There is a lot of complexity with what is included in many of these image formats, some are simply a raw sector representation of the binary data on a disk, others contain the complete flux readings from a floppy disk. The HFE format contains a little more than a raw image, including a header, a track lookup table, and the bitstreams for each track all with the purpose of emulating the physical media. The HFE format contains only a single pass over the data, where other formats may contain multiple reading of each track to get more complete data which can be helpful for damaged or purposely copy-protected disks. You can read more on Ashley’s blog, Library of Congress format description.

HFE version list

When using the HxC Floppy Emulator software, you can open and save to many different formats. The main format being their HFE native format. It comes in 5 versions.

hexdump -C test01.hfe | head
00000000 48 58 43 50 49 43 46 45 00 53 02 00 e8 01 00 00 |HXCPICFE.S......|
00000010 07 01 01 00 ff ff ff ff ff ff ff ff ff ff ff ff |................|
00000020 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................|

Above is a hexdump of the main SDCard HxC Floppy Emulator file format. The format specification shows the 8 byte header “HXCPICFE”. This is a very unique pattern and should be all we need to make a robust signature for the format, but we do need to take into account the other HFE “versions” and see if they might clash or need to be identified separately.

hexdump -C test02-a2.hfe | head 
00000000 48 58 43 50 49 43 46 45 00 53 02 00 d0 03 00 00 |HXCPICFE.S......|
00000010 07 01 01 00 ff ff ff ff ff ff ff ff ff ff ff ff |................|
00000020 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................|

The “A2” version of the format has the same header but some different bytes further into the file.

hexdump -C test03-rev2.hfe | head
00000000 48 58 43 50 49 43 46 45 01 53 02 00 00 00 00 00 |HXCPICFE.S......|
00000010 07 01 01 00 ff ff ff ff ff ff ff ff ff ff ff ff |................|
00000020 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................|

The “Rev 2” version also has the same header. But if you look at the 9th byte you can see the value changed from 00 to 01, which according to the specification, this is the revision byte.

hexdump -C test04-rev3.hfe | head 
00000000 48 58 43 48 46 45 56 33 00 53 02 00 e8 01 00 00 |HXCHFEV3.S......|
00000010 07 01 01 00 ff ff ff ff ff ff ff ff ff ff ff ff |................|
00000020 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................|

With “Rev 3” we see a change in the header with “HXCHFEV3” which appears to be referred to as HFEv3.

hexdump -C test05-stream.hfe | head 
00000000 48 78 43 5f 53 74 72 65 61 6d 5f 49 6d 61 67 65 |HxC_Stream_Image|
00000010 00 00 00 00 00 00 00 00 00 18 00 00 00 02 00 00 |................|
00000020 00 1a 00 00 53 00 00 00 02 00 00 00 40 9c 00 00 |....S.......@...|
00000030 07 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000040 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|

This last format seems to be a special HxC stream image.

It seems the best option is to make three signatures to identify the three main headers. Additional software can be used to further parse the disk image. If you would like to see some sample images, you can download a bunch here. You can also take a look at my GitHub repository to see additional samples and a proposed set of signatures.

A2R / MOOF / WOZ

There seems to be a never ending growing list of disk image formats. Many have features which are specific to the media and format. If you have ever imaged an older Macintosh floppy you know they are special. If you add in copy-protection which many early Apple II floppies have, and you need special drives, hardware, and a special format to store the floppy data.

When imaging special media, especially with unique media, it is best practice to image the floppies at the magnetic flux level.

Floppy disks contain magnetic fluctuations which are measured and recorded using specialized equipment. A popular method is using a Kryoflux board, floppy drive, and software. The software communicates with a custom controller board connected to a floppy drive through USB. If you are interested in the different controller boards, a good list has been compiled here.

A Kryoflux, fluxengine, greaseweazle, all can image specialized disks like a Macintosh 800k floppy, but the best controller board for them is an Applesauce setup. They are specifically designed to for the task. With that task, comes a few specialty formats.

A file format which can store flux data is a bit different than a regular disk image format. The flux data contains all the low-level recordings which can then be interpreted into disk images much like the original floppy. In the case of an Applesauce flux image, it can contain all the small nuances of the original floppy, this includes recording any copy protection or other creative methods used by software vendors throughout the years. The format used for storing this flux data is the A2R format.

A2R is in its third iteration. Let’s take a look at the basics of the format.

hexdump -C Samplev3.a2r | head
00000000 41 32 52 33 ff 0a 0d 0a 49 4e 46 4f 25 00 00 00 |A2R3....INFO%...|
00000010 01 41 70 70 6c 65 73 61 75 63 65 20 76 31 2e 38 |.Applesauce v1.8|
00000020 38 2e 35 20 20 20 20 20 20 20 20 20 20 20 20 20 |8.5 |
00000030 20 02 01 01 00 52 57 43 50 e9 49 6e 01 01 24 f4 | ....RWCP.In..$.|
00000040 00 00 00 00 00 00 00 00 00 00 00 00 00 43 01 00 |.............C..|
00000050 00 01 27 3a 25 00 91 d9 00 00 21 20 21 21 21 21 |..':%.....! !!!!|
00000060 1f 21 21 21 21 1f 24 5e 24 1f 21 21 20 21 24 5c |.!!!!.$^$.!! !$\|
00000070 24 20 21 21 21 1f 24 5c 25 21 21 1f 21 21 23 5b |$ !!!.$\%!!.!!#[|
00000080 25 20 21 21 21 1f 21 22 23 3f 41 3f 26 3e 43 3f |% !!!.!"#?A?&>C?|
00000090 43 5f 41 27 3d 61 41 27 3d 61 3f 28 3e 61 3f 26 |C_A'=aA'=a?(>a?&|

hexdump -C Samplev2.a2r | head
00000000 41 32 52 32 ff 0a 0d 0a 49 4e 46 4f 24 00 00 00 |A2R2....INFO$...|
00000010 01 41 70 70 6c 65 73 61 75 63 65 20 76 31 2e 31 |.Applesauce v1.1|
00000020 2e 36 20 20 20 20 20 20 20 20 20 20 20 20 20 20 |.6 |
00000030 20 02 01 01 53 54 52 4d 75 17 5d 01 00 01 e6 da | ...STRMu.].....|
00000040 00 00 83 a9 12 00 12 1e 11 13 1e 13 1e 13 11 1f |................|
00000050 21 1f 11 13 1c 14 1e 30 14 20 1e 14 1e 14 1c 14 |!......0. ......|
00000060 1c 13 11 20 21 1f 11 11 0f 13 1e 14 1c 14 2e 21 |... !..........!|
00000070 13 1e 13 1e 14 1e 11 11 20 21 1f 11 11 13 1e 1f |........ !......|
00000080 13 20 30 21 11 11 0f 13 1e 13 11 30 1f 21 20 13 |. 0!.......0.! .|
00000090 11 30 1f 14 1e 30 14 1e 11 11 11 1e 13 11 1e 14 |.0...0..........|

The A2R format uses a chunk system to store the various pieces to the format. Earlier versions used a STRM Chunk to store all the raw flux data. Version 3 changed to a RWCP Chunk to store all the raw flux data. Applesauce uses a 2-pass imaging process, doing a rapid imaging to determine where on the media surface track data exists and then a second pass that captures longer durations for processing and error correction.

Once the full raw flux data has been captured that data can be interpreted as a disk image. The Applesauce software is able to make a regular disk image, a Disk Copy 4.2 file, which are well known and identify in PRONOM as fmt/625, but can also create a couple of special disk image formats which allow for special nuances on an original disk.

The WOZ Disk Image format is an offshoot of the Applesauce project. Capturing highly accurate bit data is of no use if you don’t have a container to hold the data. The WOZ format was designed to be able to contain every possible Apple ][ disk structure and layout. It can be so accurate that even copy protected software can’t tell that it isn’t an original disk.

The WOZ format has become very popular in the Apple II community and is ideal for emulating all the old games and software titles popular in the early 1980’s. You may have guessed where the name comes from. The internet archive has a large collection of WOZ disks in their WOZ-a-Day collection. The file format of a WOZ disk image is also a chunk based format similar to the A2R format, it has two versions. Let’s take a look.

hexdump -C WOZ 1.0/Blazing Paddles (Baudville).woz | head
00000000 57 4f 5a 31 ff 0a 0d 0a f6 f5 92 d6 49 4e 46 4f |WOZ1........INFO|
00000010 3c 00 00 00 01 01 00 01 01 41 70 70 6c 65 73 61 |<........Applesa|
00000020 75 63 65 20 76 30 2e 32 36 20 20 20 20 20 20 20 |uce v0.26 |
00000030 20 20 20 20 20 20 20 20 20 00 00 00 00 00 00 00 | .......|
00000040 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000050 54 4d 41 50 a0 00 00 00 00 00 ff 01 01 01 ff 02 |TMAP............|
00000060 02 02 ff 03 03 03 ff 04 04 04 ff 05 05 05 ff 06 |................|
00000070 06 06 ff 07 07 07 ff 08 08 08 ff 09 09 09 ff 0a |................|
00000080 0a 0a ff 0b 0b 0b ff 0c 0c 0c ff 0d 0d 0d ff 0e |................|
00000090 0e 0e ff 0f 0f 0f ff 10 10 10 ff 11 11 11 ff 12 |................|

hexdump -C WOZ 2.0/Blazing Paddles (Baudville).woz | head
00000000 57 4f 5a 32 ff 0a 0d 0a 21 da c2 c8 49 4e 46 4f |WOZ2....!...INFO|
00000010 3c 00 00 00 02 01 00 01 01 41 70 70 6c 65 73 61 |<........Applesa|
00000020 75 63 65 20 76 31 2e 31 20 20 20 20 20 20 20 20 |uce v1.1 |
00000030 20 20 20 20 20 20 20 20 20 01 01 20 00 00 00 00 | .. ....|
00000040 0d 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000050 54 4d 41 50 a0 00 00 00 00 00 ff 01 01 01 ff 02 |TMAP............|
00000060 02 02 ff 03 03 03 ff 04 04 04 ff 05 05 05 ff 06 |................|
00000070 06 06 ff 07 07 07 ff 08 08 08 ff 09 09 09 ff 0a |................|
00000080 0a 0a ff 0b 0b 0b ff 0c 0c 0c ff 0d 0d 0d ff 0e |................|
00000090 0e 0e ff 0f 0f 0f ff 10 10 10 ff 11 11 11 ff 12 |................|

Unlike a common disk image, a WOZ image contains more than the bits on the disk, it contains a mapping of all the tracks and the associated data, this is how it can even contain copy-protection usually only possible with a physical disk. The ‘TMAP’ chunk contains a track map and the ‘TRKS’ chunk contains all the data.

What the WOZ is for the Apple II, MOOF was made for the Macintosh. You may wonder what is with the funny name, but there is a long history around “Clarus the Dogcow”. I’m sure this factoid will help you impress your friends or win at trivia night. Again, the purpose of the special format for Macintosh disks is to allow for emulating disks, even with copy protection. You can also find quite the collection of old Macintosh software in the MOOF format on the Internet Archive, even emulate your favorite game, such as Dark Castle, which I played for hours as a kid. Also a chunk based format, let’s take a look at the header.

hexdump -C Dark Castle v1.0 - Disk 1.moof | head
00000000 4d 4f 4f 46 ff 0a 0d 0a b5 75 f9 4e 49 4e 46 4f |MOOF.....u.NINFO|
00000010 3c 00 00 00 01 01 00 01 10 41 70 70 6c 65 73 61 |<........Applesa|
00000020 75 63 65 20 76 31 2e 37 33 20 20 20 20 20 20 20 |uce v1.73 |
00000030 20 20 20 20 20 20 20 20 20 00 13 00 00 00 00 00 | .......|
00000040 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000050 54 4d 41 50 a0 00 00 00 00 ff 01 ff 02 ff 03 ff |TMAP............|
00000060 04 ff 05 ff 06 ff 07 ff 08 ff 09 ff 0a ff 0b ff |................|
00000070 0c ff 0d ff 0e ff 0f ff 10 ff 11 ff 12 ff 13 ff |................|
00000080 14 ff 15 ff 16 ff 17 ff 18 ff 19 ff 1a ff 1b ff |................|
00000090 1c ff 1d ff 1e ff 1f ff 20 ff 21 ff 22 ff 23 ff |........ .!.".#.|

All three formats created for imaging and emulating Apple and Macintosh software are well documented and open. They are also well suited for preservation as they can contain extensive metadata in the INFO chunk which gives provenance information on the source of the files. The Applesauce software even has a camera to photograph the disk itself for archiving. All of this makes these formats great for preservation and emulation. Take a look at my proposal for a signature on my Github.

ePic

Image compression has been around for awhile. It seems everyone took a crack at making better algorithms to improve quality and size. Some chose to invent new ways and others chose to use existing methods but with their own flare. Kodak tried this with their PhotoCD, but there was a couple other photo processing options that popped up in 90’s. One was Seattle FilmWorks and another was Konica PC PictureShow. Both of which used “proprietary” formats to deliver developed film on disk.

Seattle FilmWorks later called PhotoWorks, used an image format with the extension SFW and was based on BMP and JPG, but with their own twist. The same goes for the format used by Konica’s PC PictureShow.

Konica PC PictureShow Disk

If you took your film in to be developed at one of Konica’s photo labs, you could could have those images put on a diskette or later a CD-R. The disks came with software to view your photos called PC PhotoShow. The images stored on disk where in another proprietary format with the extension KQP. The KQP format was actually licensed from another company called Pegasus Imaging Corporation, later known as Accusoft. They developed their own way to compress a JPEG file which they called an ePic. An SDK called PICTools was offered for many years, but seems not to be available anymore.

ePIC (Proprietary)
  • Supports PIC format compression, replacing the JPEG Huffman encoder with the proprietary ELS entropy encoder for 15% more compression.
  • Can be losslessly converted back to JPEG format using Op_RORE.

A search on the internet for Konica KQP shows quite a few people over the years wondering what to do with their old disks and converting the old format to JPG, only to find a lack of information and available tools to do so. One such person used python to edit the file and making the file renderable as a JPG. While the method worked well for their KQP files, it might not work for all of them. Let’s look closer and understand why.

hexdump -C Sample.PIC | head
00000000 42 4d 00 00 00 00 00 00 00 00 42 04 00 00 44 00 |BM........B...D.|
00000010 00 00 34 08 00 00 24 fa ff ff 01 00 18 00 4a 50 |..4...$.......JP|
00000020 45 47 00 00 00 00 00 00 00 00 00 00 00 00 fc 00 |EG..............|
00000030 00 00 ec 00 00 00 2c 00 00 00 18 00 00 00 00 00 |......,.........|
00000040 00 00 02 00 00 00 08 00 00 00 01 00 00 00 01 00 |................|
00000050 00 00 60 00 00 00 00 00 60 00 00 60 00 00 00 00 |..`.....`..`....|
00000060 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|

At first glance the file appears to be a Bitmap (BMP), and it does have a Bitmap header claiming to have JPEG compression, but if we look a little further into the file.

identify -verbose Sample.PIC   
identify: length and filesize do not match `Sample.PIC' @ error/bmp.c/ReadBMPImage/950.
identify: unrecognized compression `Sample.PIC' @ error/bmp.c/ReadBMPImage/1019.

hexdump -C Sample.PIC
00000000 42 4d 00 00 00 00 00 00 00 00 42 04 00 00 44 00 |BM........B...D.|
00000010 00 00 34 08 00 00 24 fa ff ff 01 00 18 00 4a 50 |..4...$.......JP|
00000020 45 47 00 00 00 00 00 00 00 00 00 00 00 00 fc 00 |EG..............|
00000030 00 00 ec 00 00 00 2c 00 00 00 18 00 00 00 00 00 |......,.........|
00000040 00 00 02 00 00 00 08 00 00 00 01 00 00 00 01 00 |................|
00000050 00 00 60 00 00 00 00 00 60 00 00 60 00 00 00 00 |..`.....`..`....|
00000060 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000400 00 00 60 00 00 00 00 00 60 00 00 60 00 00 00 00 |..`.....`..`....|
00000410 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000440 00 00 ff d8 ff e0 00 10 4a 46 49 46 00 01 02 02 |........JFIF....|
00000450 00 00 00 00 00 00 ff e1 00 0a 50 49 43 00 01 19 |..........PIC...|
00000460 1e 01 ff c0 00 11 08 05 dc 08 34 03 01 11 00 02 |..........4.....|

We find a JPG marker, in fact almost the whole jpg file is included, except the quantization tables for luminance and chrominance which are needed to properly display the image. This is the area the Pegasus company thought they could encode better to further compress the image. Their method was to use a new algorithm called ELS (Entropy Logarithmic-Scale). This new method was used by the PICTools software to make a Pegasus PIC file while Konica used it for their KQP format. They are identical. By choosing the luminance and chrominance values during compression, you could make a highly compressed image, but required specific software to render.

Pegasus also made use of a special custom APP marker (PIC) within the JPEG structure of the PIC/KQP and also any JPG compressed using their software. This marker which takes up around 8 bytes holds the luminance and chrominance values. Take the above sample for instance, it is compressing the image with a Luminance of 25 and a Chrominance of 30, these are integer values and in hex they would be “19” and “1E” respectively.

hexdump -C Sample.PIC      
00000440 00 00 ff d8 ff e0 00 10 4a 46 49 46 00 01 02 02 |........JFIF....|
00000450 00 00 00 00 00 00 ff e1 00 0a 50 49 43 00 01 19 |..........PIC...|
00000460 1e 01 ff c0 00 11 08 05 dc 08 34 03 01 11 00 02 |..........4.....|
00000470 11 01 03 11 01 ff c4 00 51 00 01 00 03 01 00 00 |........Q.......|

So in theory one could strip out any part of the file before the JPG beginning of file magic bytes (FF D8 FF E0), locate the APP marker, use the values to generate the two quantization tables, insert them in the appropriate spot and save out a JPG file.

This may be the case for the first few versions of the ePic format, but later versions got more complicated. It seems a “PIC2” version replaced the earlier versions and this format is a little more complicated.

hexdump -C Sample.KQP | head
00000000 50 49 43 32 01 08 00 00 00 64 00 01 00 b9 3e 00 |PIC2.....d....>.|
00000010 00 05 08 00 00 00 4a 50 47 45 03 00 00 00 16 24 |......JPGE.....$|
00000020 00 00 00 43 6f 6d 70 72 65 73 73 69 6f 6e 20 62 |...Compression b|
00000030 79 20 50 65 67 61 73 75 73 20 49 6d 61 67 69 6e |y Pegasus Imagin|
00000040 67 20 43 6f 72 70 2e 06 68 3e 00 00 ff d8 ff e0 |g Corp..h>......|
00000050 00 10 4a 46 49 46 00 01 01 00 00 01 00 01 00 00 |..JFIF..........|
00000060 ff e1 00 16 50 49 43 00 03 00 00 01 00 00 00 00 |....PIC.........|
00000070 00 00 00 00 00 00 00 00 ff db 00 84 00 0f 0a 0a |................|
00000080 0a 0a 06 0f 0a 0a 0a 0f 0f 0f 0f 14 1e 14 14 14 |................|
00000090 14 14 28 1e 1e 19 1e 2d 28 32 32 2d 28 2d 2d 32 |..(....-(22-(--2|

Instead of the Bitmap (BMP) header, a proprietary PIC2 header is used, still containing a JPG in the JFIF format along with a the PIC APP marker, but encoded in a way that the simple method of adding a quantization table may not work. With the original format the JPG and the PIC/KQP were approximately the same size, this new version significantly reduces the size of the PIC/KQP in comparison with the JPG.

The ELS compression technology used in the ePic format seems to be patented by Pegasus and Accusoft, but is not entirely hidden as the libavcodec library includes a ELS decoder. Might be a fun project to use the code to decode the PIC/KQP formats fully.

In the meantime, a signature identifying the two versions should be added to PRONOM. Check out my proposal on my GitHub. If you need to convert your KQP or PIC files back to JPG here are a few links:

Konica PC PictureShow Version 4 (PIC2)

Accusoft PICTools Apollo Demo (Windows 7 Compatible)

Konica PC PictureShow for Macintosh

FASTA & FASTQ

There seems to be a never ending source of file formats out there. Documenting past obsolete formats, one would assume a point at which there are no more to find, but in reality more are re-discovered everyday by the Digital Preservation community. When it comes to more modern formats, it seems more are invented everyday, too many to keep up with identification. Document one, 10 more pop up, it seems never-ending. Such is the case for scientific formats, including sequencing formats.

I was speaking with a colleague from another institution the other day and a file format was mentioned I hadn’t heard of before. It seems many of their scientific data was stored in a format called FASTA “Fast A” (“fast-aye”). This format specifically stores DNA sequence data and is used quite a bit, it seems. I was even more surprised the next day when I went to process some new submissions for our repository only to find one submission contained three FASTA files. I love researching file formats, but sometimes in order to understand the format structure you have to know something about the content as well. Let’s explore the FASTA and FASTQ file formats. If you would like to take a peek at the Human Genome in FASTA, go here.

Both the FASTA and FASTQ formats are text based and have a simple structure. Identification of each of these should be pretty simple, but to avoid conflicts with other formats, the signature might have to be more complex.

The FASTA format is well documented as many in the scientific community use it. Basically the format starts with the greater than “>” character followed by a description, a new line character, then the sequence. For example:

>MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken
MADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID
FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA
DIDGDGQVNYEEFVQMMTAK*

Pretty straight forward, but so much of the format can be variable, a simple signature would clash with too many other formats. There are some rules with what characters can be used in the sequence so it might be possible to limit the signature to only allow certain characters. At first I thought it might only be able to contain the standard characters representative of adenine (A), cytosine (C), guanine (G), and thymine (T), but as it turns out the FASTA format can contain Nucleic Acid Code’s and Amino Acid Code’s. These codes allow more than the four I was expecting, but do limit what can be represented.

Take the NCBI Sequence Viewer for a spin and download some data as FASTA.

The FASTQ format adds more structure and is more limiting, but also presents some challenges. Here is a sample of its structure:

@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

Instead of a greater than symbol, the FASTQ format uses an “@” symbol followed by an identifier. The identifier can be basically anything and as long as needed. There is a newline character followed by the DNA sequence, which is only the four characters I have heard before. It can contain an A, C, G, T, or N. The “N” can represent an unidentified nucleotide or indicate that the software was unable to make a basecall. A newline character again and the “+” symbol. This is place before the fourth line with is a quality score and is the same number of characters as the sequence.

See what I mean when you have to learn about the context of a format in order to make a proper signature!

One of the problems I am left with is how to determine how many of the sequence characters to use in the signature to not have any conflicts. Too few and it might conflict with another format or simple text file. Too many and the signature gets complicated and may exclude a short sequence file. As far as I can tell there is no set minimum or maximum for the sequence. Not sure what the genome for Pinus Taeda would look like in FASTA with 22.18 billion base pairs. The other problem is often times these formats are compressed into a GZIP file, so they need to be extracted before identification.

These two formats are just a couple of the many sequencing formats being used in the bioinformatics community. I am sure others will pop up in the future. Until then, I have with the help of others put together a signature which seems to work well for the samples and data sets we have access to. Take a look at my GitHub for the signature proposal. If you find any issues, let me know!

SDIF

I have used and have researched a lot of audio editing software. Some are very simple and straightforward, others are feature rich and take some time to learn. While looking in a format, I came across some Audio software which nothing like I have used before. At first I was confused, I figured it would be simple to open a certain file format and play the audio. Not so fast.

Max is software which proudly says it is an, “infinitely flexible space to create your own interactive software”. Created by Cycling ’74 software, Max has been around for awhile, being developed in the mid 1980’s. It allows the user to make “patches” stringing around components and effects to accomplish an infinite amount of options and outcomes.

The software produces simple project files and patch files, but hey are just JSON data, at least in the latest version. But when working with audio files the software can save to a number of formats.

One of the options is a format called “SDIF”, which stands for “Sound Description Interchange Format“. SDIF was jointly developed by IRCAM and CNMAT, with proposals starting back in the mid-1990’s. Originally written as a Spectral Description, it was later changed to refer to a Sound Description.

The Specification states the general idea was to “store information related to signal processing and specifically of sound, in files, according to a common format to all data types. Thus, it is possible to store results or parameters of analyses, syntheses…” So not exactly the same as a simple WAVE file you can open and edit, this format was meant to store signal data for analysis.

Each SDIF file consists of a header and then an overall a succession of frames, not unlike chunks in the IFF/AIFF/RIFF formats, ordered in time. Each frame matrix declares a “Type” which can be a combination of many options. Lets take a look at a SDIF file:

hexdump -C test.sdif | head
00000000 53 44 49 46 00 00 00 08 00 00 00 03 00 00 00 01 |SDIF............|
00000010 31 54 52 43 00 00 00 20 00 00 00 00 00 00 00 00 |1TRC... ........|
00000020 00 00 00 01 00 00 00 01 31 54 52 43 00 00 00 04 |........1TRC....|
00000030 00 00 00 00 00 00 00 04 31 54 52 43 00 00 00 c0 |........1TRC....|
00000040 3f 74 7a e1 40 00 00 00 00 00 00 01 00 00 00 01 |?tz.@...........|
00000050 31 54 52 43 00 00 00 04 00 00 00 0a 00 00 00 04 |1TRC............|
00000060 3f 80 00 00 45 95 35 c3 00 00 00 00 00 00 00 00 |?...E.5.........|
00000070 40 00 00 00 46 06 e2 14 00 00 00 00 00 00 00 00 |@...F...........|
00000080 40 40 00 00 45 3b 42 3d 00 00 00 00 00 00 00 00 |@@..E;B=........|
00000090 40 80 00 00 43 5d 94 7b 00 00 00 00 00 00 00 00 |@...C].{........|

This test file has the opening frame “SDIF“, to identify it as an SDIF, then a reference to the type “1TRC. I would try and explain a Matrix 1TRC Sinusoidal Track, but I have no idea what it means. Something, something sine wave, etc. Someone much smarter than me can make use of this format. Here are a couple examples of SDIF with other frame types.

hexdump -C angry_cat.part.sdif| head
00000000 53 44 49 46 00 00 00 08 00 00 00 03 00 00 00 01 |SDIF............|
00000010 31 4e 56 54 00 00 00 88 ff ef ff ff ff ff ff ff |1NVT............|
00000020 ff ff ff fd 00 00 00 01 31 4e 56 54 00 00 03 01 |........1NVT....|
00000030 00 00 00 61 00 00 00 01 53 74 72 65 61 6d 49 44 |...a....StreamID|
00000040 09 30 0a 44 61 74 65 09 54 68 75 5f 41 75 67 5f |.0.Date.Thu_Aug_|
00000050 5f 33 5f 32 31 2e 33 32 2e 34 35 5f 32 30 30 30 |_3_21.32.45_2000|
00000060 5f 0a 54 61 62 6c 65 4e 61 6d 65 09 53 69 6e 75 |_.TableName.Sinu|
00000070 73 6f 69 64 61 6c 54 72 61 63 6b 73 0a 57 72 69 |soidalTracks.Wri|
00000080 74 74 65 6e 42 79 09 50 6d 5f 56 65 72 73 69 6f |ttenBy.Pm_Versio|
00000090 6e 5f 31 2e 32 2e 32 0a 00 00 00 00 00 00 00 00 |n_1.2.2.........|

hexdump -C cymbalum-c4.res.sdif| head
00000000 53 44 49 46 00 00 00 08 00 00 00 03 00 00 00 01 |SDIF............|
00000010 31 52 45 53 00 00 0d 20 00 00 00 00 00 00 00 00 |1RES... ........|
00000020 00 00 00 04 00 00 00 01 31 52 45 53 00 00 00 04 |........1RES....|
00000030 00 00 00 d0 00 00 00 04 42 49 27 7a 39 59 fc ab |........BI'z9Y..|
00000040 3d 35 06 c9 00 00 00 00 42 6e 68 68 39 63 99 b1 |=5......Bnhh9c..|
00000050 3e 25 f7 c0 00 00 00 00 42 c6 02 bb 39 8c 31 79 |>%......B...9.1y|
00000060 3f bb 7e 6e 00 00 00 00 43 01 82 96 3a 1d 36 44 |?.~n....C...:.6D|
00000070 3e d9 21 12 00 00 00 00 43 07 35 f0 3a 20 6f 6e |>.!.....C.5.: on|
00000080 3f 02 32 7f 00 00 00 00 43 30 84 0b 39 97 f9 1b |?.2.....C0..9...|
00000090 3e c6 43 c7 00 00 00 00 43 4d e4 e4 39 88 14 90 |>.C.....CM..9...|

Unfortunately, the common tools I use to explore AV formats don’t seem to work on this format. MediaInfo, FFProbe, Exiftool, all give me unknown file warnings. So I had to compile the SDIF software in order to get some details.

querysdif angry_cat.part.sdif 
Header info of file angry_cat.part.sdif:

Format version: 3
Types version: 1

Ascii chunks of file angry_cat.part.sdif:

1NVT
{
StreamID 0;
Date Thu_Aug__3_21.32.45_2000_;
TableName SinusoidalTracks;
WrittenBy Pm_Version_1.2.2;
}

Data in file angry_cat.part.sdif (9504872 bytes):
1933 1TRC frames in stream 0 between time 0.000000 and 5.794875 containing
1933 1TRC matrices with 45 --400 rows, 4 -- 4 columns

An interesting thing is that a SDIF file can be in text form as well.

sdiftotext test.sdif 
SDIF


SDFC

1TRC 1 1 0
1TRC 0x0004 0 4

1TRC 1 1 0.005
1TRC 0x0004 10 4
1 4774.72 0 0
2 8632.52 0 0
3 2996.14 0 0
4 221.58 0 0
5 1943.02 0 0
6 123.951 0 0
7 6705.04 0 0
8 4304.97 0 0
9 3554.29 0 0
10 23.7822 0 0

1TRC 1 1 0.01
1TRC 0x0004 10 4
1 4774.72 0.0353114 2.06098
2 8632.52 0.00442518 0.68795
3 2996.14 0.0238517 -1.42295
4 221.58 0.0089712 -2.44141
5 1943.02 0.00768914 2.64629
6 123.951 0.0397061 -0.17527
7 6705.04 0.0245643 -0.168753
8 4304.97 0.00894803 1.45553
9 3554.29 0.0265175 2.57231
10 23.7822 0.0419019 -2.17731

1TRC 1 1 0.2
1TRC 0x0004 10 4
1 2284.56 0.02781 2.47054
2 4222.62 0.0151738 1.55309
3 31.1554 0.00421461 -0.657285
4 310.99 0.0122306 1.25794
5 215.192 0.0174093 1.25468
6 6253.69 0.000894192 2.21334
7 8533.32 0.0296167 2.07209
8 8044.77 0.0423002 2.54088
9 6087.45 0.0264733 -2.05523
10 7052.7 0.0287347 0.426339

1TRC 1 1 0.205
1TRC 0x0004 10 4
1 2284.56 0 0
2 4222.62 0 0
3 31.1554 0 0
4 310.99 0 0
5 215.192 0 0
6 6253.69 0 0
7 8533.32 0 0
8 8044.77 0 0
9 6087.45 0 0
10 7052.7 0 0

1TRC 1 1 0.21
1TRC 0x0004 0 4

ENDC
ENDF

An interesting format for sure. But wait, there is more!

My initial interest in this format was when I was given access to a set of MUBU files. I was unclear on how there were created at first and it took me down a long path of learning about SDIF and the Max software from Cycling ’74 and IRCAM. MUBU turns out to be a toolbox for Max which adds more analysis features.

MUBU stands for MUlti-BUffer, which helps overcome some limitations. It is actually a container using the SDIF standard. Lets take a look.

hexdump -C test.mubu | head
00000000 53 44 49 46 00 00 00 08 00 00 00 03 00 00 00 01 |SDIF............|
00000010 31 4e 56 54 00 00 00 78 ff ef ff ff ff ff ff ff |1NVT...x........|
00000020 ff ff ff fd 00 00 00 01 31 4e 56 54 00 00 03 01 |........1NVT....|
00000030 00 00 00 53 00 00 00 01 4d 75 42 75 2e 43 6f 6e |...S....MuBu.Con|
00000040 74 61 69 6e 65 72 2e 4e 75 6d 54 72 61 63 6b 73 |tainer.NumTracks|
00000050 09 31 0a 4d 75 42 75 2e 43 6f 6e 74 61 69 6e 65 |.1.MuBu.Containe|
00000060 72 2e 56 65 72 73 69 6f 6e 09 31 2e 35 0a 4d 75 |r.Version.1.5.Mu|
00000070 42 75 2e 43 6f 6e 74 61 69 6e 65 72 2e 4e 75 6d |Bu.Container.Num|
00000080 42 75 66 66 65 72 73 09 31 0a 00 00 00 00 00 00 |Buffers.1.......|
00000090 31 4e 56 54 00 00 00 38 ff ef ff ff ff ff ff ff |1NVT...8........|

A MUBU file has the same SDIF frame header, but also include a “1NVT” frame, which is a Name Value Table. This is where the MUBU container is referenced. The MuBu file has its own structure:

If I query the MuBu file like I did the SDIF, I get the following:

querysdif test.mubu
Header info of file test.mubu:

Format version: 3
Types version: 1

Ascii chunks of file test.mubu:

1NVT
{
MuBu.Container.NumTracks 1;
MuBu.Container.Version 1.5;
MuBu.Container.NumBuffers 1;
}
1NVT
{
MuBu.Buffer.Index 0;
}
1NVT
{
MuBu.Track.MxRows 2;
AudioFile 1;
MuBu.Track.NonNumType 0;
MuBu.Track.MaxSize 93515;
meta_ISFT Lavf60.16.100;
MuBu.Track.Name mytrack;
MuBu.Track.BufferIndex 0;
MuBu.Track.SampleRate 48000;
FileName Wilhelm_Scream.wav;
MuBu.Track.MxVarRows 0;
MuBu.Track.MxCols 1;
meta_MetaDataSource WAV;
MuBu.Track.EndTime 1623.5;
FilePath /;
MuBu.Track.SampleOffset 0;
MuBu.Track.TimeTags 0;
MuBu.Track.Size 77929;
MuBu.Track.Index 0;
}

1TYP
{
1MTD M000 {unnamed}
1FTD M000
{
M000 Track-0-MatrixData;
}
}

Data in file test.mubu (3741392 bytes):
77929 M000 frames in stream 0 between time 0.000000 and 1.623500 containing
77929 M000 matrices with 2 -- 2 rows, 1 -- 1 columns

The MuBu file contains one audio track and one buffer. This is a simple test file, but MuBu files can be quite large with multiple tracks.

Working with the Max software or OpenMusic is not something I found to be easy to understand. I am sure if I was more musically inclined and with a little practice I could make some of this work. For the time being, a signature to identify a SDIF and MUBU will have to do. Check out the GitHub for my proposed signature and a couple examples.

Sibelius

Music notation software is among the earliest software for desktop computers. SCORE in 1987, Finale came around in 1988, Capella in 1992, and Sibelius in 1993. Many others came and went during this time. Music notation software was so much more than the typical word processing or desktop publishing system. Specialized fonts were needed to display the music notation and there are many other variables for different instruments allowing individuals and others the ability to create complicated compositions in an inexpensive way.

Sibelius [SI] + [BAY] + [LEE] + [UHS] was originally developed for the Acorn system in 1986, then released on Windows and Macintosh in 1998-99. The software became very popular and in 2006 was purchased by the software giant AVID. The software was used enough to get a preservation assessment by the British Library in 2017 and draft status format description by the Library of Congress, written by the amazing Ashley!

Both reviews of the format emphasize the proprietary nature of the file format which has been used since the early versions. Aside from the early Acorn release, the Windows and Macintosh versions used a binary format with the SIB extension. They are actually quite easy to identify.

hexdump -C Sibelius-s01.sib | head
00000000  0f 53 49 42 45 4c 49 55  53 00 00 40 00 02 00 a4  |.SIBELIUS..@....|
00000010  f1 ed 00 00 00 30 00 00  00 02 00 00 00 01 00 00  |.....0..........|
00000020  00 2a 00 00 00 00 00 00  00 00 0f 53 49 42 45 4c  |.*.........SIBEL|
00000030  49 55 53 00 00 40 00 02  00 00 00 00 00 00 00 3a  |IUS..@.........:|
00000040  00 00 00 00 38 a1 28 06  b3 d2 2f 66 03 04 16 4e  |....8.(.../f...N|
00000050  5f 5c 8d f3 95 27 3e f1  2a 1b 68 de 08 81 e8 9a  |_\...'>.*.h.....|
00000060  ea 1c bf dd 54 0e 92 8d  4d be e3 34 ed 42 78 36  |....T...M..4.Bx6|
00000070  d2 e1 67 7b 8d f7 98 6a  3a 70 c4 8b 0b 08 7b 26  |..g{...j:p....{&|
00000080  f9 45 00 00 00 00 48 71  7c 4c 98 df 0b 38 7d 9d  |.E....Hq|L...8}.|
00000090  2a 2d 84 9c a4 39 0f 4d  da a2 cc 97 ad 3d b0 55  |*-...9.M.....=.U|

This is exactly how PRONOM and other identification methods determine they are Sibelius files. PRONOM has assigned the format fmt/696 and is looking for the hexadecimal bytes 0F534942454C495553.

The problem with this identification method is that all the Sibelius files are identified as such, regardless of version. As mentioned by Ashley, version of the software used is highly important as new features were added all the time making backwards compatibility difficult. Add in the fact that there were different releases for each version which would limit these features even more and I can see how a musician could get very frustrated. If you created a score in Sibelius 5 and tried to open in Sibelius 5 Student version, you may find your composition lacking in many ways. The only way to avoid compatibility issues is to always open in the latest “Ultimate” version. Sibelius Ultimate can open all versions of the SIB format back to version 2. The software even has an export feature which allows you to export back to a previous version stripping what is necessary to ensure compatibility.

Sibelius export to previous version

For those with a bunch of SIB files in their archives, how would you know which software version created the file? Well lets take a closer look at the bytes and see if we can find some patterns. Let it be known, I am not reverse engineering the format, just looking for patterns which will allow for proper identification!

I am not the first person to ask this question, many others want to know the versions of their SIB files. Thankfully others have found some clues on which bytes hold the version information. It seems we can determine the version based on 4 bytes shortly after the SIBELIUS string. Specifically bytes 10-13.

hexdump -C Sibelius2-s01.sib | head
00000000  0f 53 49 42 45 4c 49 55  53 00 00 08 00 22 00 47  |.SIBELIUS....".G|
00000010  98 4c 00 00 00 3a 00 00  00 00 4e 81 49 34 41 2c  |.L...:....N.I4A,|
00000020  fa 76 62 f9 71 53 a9 93  0f 54 1e 20 6c 63 61 4d  |.vb.qS...T. lcaM|
00000030  f7 b2 b0 a7 5d bd 82 3a  0d 86 02 8b f2 89 d2 a0  |....]..:........|
00000040  83 1f 8d e0 37 1b ed 1c  6a 8b 82 08 4b 6d 64 60  |....7...j...Kmd`|
00000050  71 59 e8 aa ef b1 3c df  5c 25 0a 9f 66 50 69 de  |qY....<.\%..fPi.|
00000060  2a d3 4e 2a cd 97 88 06  67 5f 50 64 0f 8f 86 2b  |*.N*....g_Pd...+|
00000070  08 0d 3f f7 80 26 e0 63  f6 7d 4e f8 e7 c0 3f fc  |..?..&.c.}N...?.|
00000080  7a 77 ea b3 4a b9 30 59  13 47 6e 09 0a 0b ae 3c  |zw..J.0Y.Gn....<|
00000090  c1 93 85 f6 41 f8 58 22  4b 92 35 3f b2 f5 3f 9d  |....A.X"K.5?..?.|

From what others have gathered and updating it with more recent versions I have come up with a list.

VersionHex 10-13
Sibelius 1.200 00 00 0E
Sibelius 2.x00 08 xx xx 
Sibelius 3.x00 0A xx xx 
Sibelius 4.x00 1B xx xx 
Sibelius 5.000 2D 00 03 
Sibelius 5.100 2D 00 0D 
Sibelius 5.2.x – 5.400 2D 00 10 
Sibelius 6.0.x00 36 00 01 
Sibelius 6.100 36 00 17 
Sibelius 6.200 36 00 1E 
Sibelius 7.000 39 00 0C 
Sibelius 7.0.1 – 7.0.200 39 00 0E 
Sibelius 7.0.300 39 00 13 
Sibelius 7.1.000 39 00 15 
Sibelius 7.1.2 – 7.1.300 39 00 16 
Sibelius 7.5.x00 3D 00 0E 
Sibelius 8.0.0 – 8.0.100 3D 00 10 
Sibelius 8.1.x00 3E 00 00 
Sibelius 8.200 3E 00 01 
Sibelius 8.300 3E 00 02 
Sibelius 8.4.x00 3E 00 06 
Sibelius 8.5.x00 3E 00 07 
Sibelius 8.6.x, 8.7.0, 8.7.100 3F 00 00 
Sibelius 8.7.2, 2018.1, 2018.4.x, 2018.5, 2018.6, 2018.700 3F 00 01 
Sibelius 2018.11, 2018.1200 3F 00 02
Sibelius 2019.100 3F 00 04
Sibelius 2019.4.x, 2019.5, 2019.7, 2019.900 3F 00 06
Sibelius 2019.1200 3F 00 07
Sibelius 8.6-2019.1200 3F 00 0A
Sibelius 2020.100 3F 00 0B
Sibelius 2020.3, 2020.600 40 00 01
Sibelius 2020.900 40 00 02
Sibelius 2022.500 40 00 03
Sibelius 2022.1100 41 00 02
Sibelius 2022.1200 42 00 00
Sibelius 2023.300 42 00 01
Sibelius 2023.800 43 00 07
Sibelius 2024.3.100 44 00 01

That is a lot of versions and I feel there may be some gaps that still need to be identified. It appears that the first two bytes are the major version and the second set of bytes is the minor version. Although it looks like a few major version bytes span across a few software versions. With this chart, one could be very specific in identifying which Sibelius version wrote the file, but for archiving purposes it seems we can group many of these capturing just the major version. The export screenshot above seems to have broken down significant changes and grouped similar formats together, the biggest being 8.6 through 2019.12. A comparison of “student” and “first” formats don’t have any obvious bytes which indicate as such, so for now they are all lumped together.

There is one other similar format which needs to be mentioned. Sibelius Scorch was a product made to share scores online. This has been replaced with Sibelius Cloud Publishing, but for awhile was the best way to share a score with others in a way that protected the original. I have no idea how they were made, but sites like scorestreet.net and sibeliusmusic.com were sites you could upload your score to for sharing. Some SCO files appear to have a PDF embedded within them for proper printing.

hexdump -C smd_h_0000000000097761.sco | head
00000000  0f 43 43 53 43 4f 52 43  48 00 00 36 00 1e 00 c0  |.CCSCORCH..6....|
00000010  d4 55 00 00 00 30 00 00  00 01 00 00 00 01 00 00  |.U...0..........|
00000020  00 22 0f 43 43 53 43 4f  52 43 48 00 00 36 00 1e  |.".CCSCORCH..6..|
00000030  00 00 00 00 00 00 00 3a  00 00 00 00 03 56 11 b9  |.......:.....V..|
00000040  70 dc fe 90 50 48 30 df  eb 39 88 23 8e 88 78 bf  |p...PH0..9.#..x.|
00000050  da ab ab 5b e2 13 98 89  66 eb 94 67 8d 16 00 00  |...[....f..g....|
00000060  00 00 cf 6f 0c 67 85 ec  57 90 e5 c1 ea 8a eb 9f  |...o.g..W.......|
00000070  c8 13 d2 1d 75 bd a5 9f  eb b9 ef 1d 25 79 45 2c  |....u.......%yE,|
00000080  05 bb 74 41 e8 8f 27 6a  01 07 d0 f5 3b 17 ce 87  |..tA..'j....;...|
00000090  7b c2 82 d9 41 6b 82 2f  d8 b8 17 32 fa d3 59 05  |{...Ak./...2..Y.|

I am not sure the best way to handle all the different versions within the PRONOM registry. I went ahead and made a few signatures based on the export dialog of Sibelius 2024. Even with combining a few together, it leaves us with 17 new PUID’s. Maybe further discussion can refine these down a bit more? Regardless, each file can be associated with a specific Sibelius version, making it easier to open and migrate if needed without fear of opening in the wrong version. Take a look at some samples and my signatures on my GitHub page and let me know if there is a better way.

Shorten

I was recently going through some of my old CD-R’s and came across this 11 year old fun memory.

I remember going to this 2003 Toad the Wet Sprocket concert in Salt Lake City with some friends, I had seen this band perform before, but this was the first time I was able to get a recording of the show. Normally having a recording of a concert of a well known band was a little shady, but for some bands, they not only allow recording of their live concerts, but they encourage it. There has been a few bands over the years who have this philosophy, one most have heard of is the Grateful Dead, because of all the tape trading, the band’s numerous concerts will live on forever.

The scene of recording concerts is still alive and well, and if you are into recording and sharing it is expected you share in a lossless audio format. The world of lossless audio is definitely in the minority of all those who listen to music on the daily. Most of us have been placated with the infinite playlists on services like Apple Music, Spotify, and Amazon Music. Most probably don’t care about owning music anymore, but for the few who consider themselves Audiophiles, having a lossless audio file is the only choice.

When it comes to formats, there are a few lossless formats to choose from, they all come with some advantages as well as some downsides. WAV files contain the full PCM audio stream, and while internet bandwidth today can handle full uncompressed audio, it can still be beneficial to use some compression for archiving or sharing over the web.

The most common lossless format today is the Free Lossless Audio Codec or FLAC, but there are also quite a few who like the Apple Lossless Audio Codec. Both offer many advantages, especially with metadata, cuesheets, and can contain cover album art. But many years ago another lossless format was most often used with bootleg recordings and audio sharing.

Shorten was one of the first lossless formats, developed by Tony Robinson in 1993 for SoftSound. It could cut the size in half of a typical 16-bit WAV file. It achieved this by using Huffman coding, kinda the same way a JPEG works, by reducing the frequency of how often patterns occur. Today FLAC and ALAC have replaced this format and offer improved features and support. Many audio players have dropped support for shorten making it difficult to use this old format.

The Shorten format uses the .SHN extension. It is one of the formats listed on the Library of Congress Sustainability of Digital Formats with the ID fdd000199, although a couple links don’t appear to work as it hasn’t been updated since 2011. Support was ended for this format and many of the links found on various websites are for broken, usually referencing the etree wiki. Much of which is archived on the Internet Archive.

Let’s take a look at the what makes up a lossless compressed SHN file. A quick look at a sample header:

hexdump -C test.shn | head
00000000  61 6a 6b 67 02 fb b1 70  09 f9 25 59 52 a4 d1 a8  |ajkg...p..%YR...|
00000010  dd cf 85 5a 01 57 a0 d5  a8 b6 6b 6d d2 41 10 80  |...Z.W....km.A..|
00000020  40 20 10 18 04 0a 01 44  d6 40 20 11 0d 8c 0a 01  |@ .....D.@ .....|
00000030  04 80 44 20 16 4b 0d d2  c3 b8 f8 55 a0 11 80 59  |..D .K.....U...Y|
00000040  98 56 1d b1 79 51 9f 39  f1 12 d2 d3 75 5c cd 08  |.V..yQ.9....u\..|
00000050  06 25 68 6b 52 5e 9f 4c  39 cd c1 32 c4 0d a9 b7  |.%hkR^.L9..2....|
00000060  69 34 56 f0 96 fa 46 89  a2 6e 8c ba d5 d0 58 de  |i4V...F..n....X.|
00000070  f5 44 5b aa 61 82 c7 85  88 37 d6 ee cb ab 4e 44  |.D[.a....7....ND|
00000080  91 19 b7 38 d4 20 ae 98  98 d1 2c 4a 4e 88 dd 3e  |...8. ....,JN..>|
00000090  36 68 1b 59 a8 7d 84 23  76 0a 84 21 a1 cd 80 8e  |6h.Y.}.#v..!....|

The first four bytes seem to be consistent among my samples. It makes me wonder if the ascii values have something to do with the author, Anthony (Tony) J. Robinson. In the source code for the shorten software, the file shorten.h defines the ascii “ajkg” as the magic header for the SHN format. Also found in current ffmpeg code. Although the tools don’t have much to say about them.

mediainfo test.shn 
General
Complete name                            : test.shn
Format                                   : Shorten
Format version                           : 2
File size                                : 3.17 MiB

Audio
Format                                   : Shorten
Compression mode                         : Lossless

ffprobe -i test.shn
Input #0, shn, from 'test.shn':
  Duration: N/A, start: 0.000000, bitrate: N/A
  Stream #0:0: Audio: shorten, 44100 Hz, 2 channels, s16p

Using the older SHNTOOL, we can get more information.

shntool info test.shn  
-------------------------------------------------------------------------------
File name:                    test.shn
Handled by:                   shn format module
Length:                       0:32.23
WAVE format:                  0x0001 (Microsoft PCM)
Channels:                     2
Bits/sample:                  16
Samples/sec:                  44100
Average bytes/sec:            176400
Rate (calculated):            176400
Block align:                  4
Header size:                  44 bytes
Data size:                    5697720 bytes
Chunk size:                   5697756 bytes
Total size (chunk size + 8):  5697764 bytes
Actual file size:             3325489
File is compressed:           yes
Compression ratio:            0.5836
CD-quality properties:
  CD quality:                 yes
  Cut on sector boundary:     no
  Sector misalignment:        1176 bytes
  Long enough to be burned:   yes
WAVE properties:
  Non-canonical header:       no
  Extra RIFF chunks:          no
Possible problems:
  File contains ID3v2 tag:    no
  Data chunk block-aligned:   yes
  Inconsistent header:        no
  File probably truncated:    unknown
  Junk appended to file:      unknown
  Odd data size has pad byte: n/a
Extra shn-specific info:
  Seekable:                   yes

Many Shorten Audio Files are found out there in archives and file sharing sites, so even though the format isn’t used to create new files, it will still be around for awhile. My GitHub has my signature proposal and a couple of samples.

Canvas

When it comes to design software there were many options over the years, many being released with a lot of hype and others disappearing not long after they released. There are few which lasted long enough to not be gobbled up by big names such as Adobe. One of those is Canvas by Deneba Systems.

First released in 1987, it is still available over at Canvas GFX. It’s amazing it was never bought by one of the big names, Adobe, Corel, Aldus, etc and remained under Deneba Systems until 2003 when it was bought by ACD Systems, but kept the name Deneba Canvas for a time. The later versions were not popular to all, and Mac support was dropped, but the software continued. Awhile back I was looking through a few of my old ZIP disks and found some software my father used in the mid 1980’s. He had a copy of Canvas version 2 for Macintosh. At that time I was more interested in playing games on our family’s Macintosh 128k than using design software.

Over the years I have come across many Canvas documents. With each version released, changes were made to the file format used to store the drawings and artwork. There were many file format changes as well as the extensions used with each version. Some are easily identifiable and others have some confusing structures. Lets look into it.

VersionPlatformExtensionDescription
Canvas 1-3 & artWORKSMacintoshnoneno strong pattern
Canvas 3.5Mac & WindowsCVSSimilar to v1-3
Canvas 5Mac & WindowsCV5CANVAS5 string
Canvas 6-8Mac & WindowsCNVCANVAS6 string
Canvas 9-XMac & WindowsCVXSimilar to 6-8
Canvas DrawMacCVDDifferent than others
Canvas Image FileCVIDAD5PROX

The first three versions of Canvas were Macintosh only and in those early days there was no extension, just a Type / Creator indicating to the Finder how to open them. Deneba Systems used the Creator codes DAD2, DAD5, through DADX.

The first versions are quite frustrating. I have gathered samples from Version 2, 3, 3.5 and artWORKS version 1. Even with numerous samples, there are no patterns I can discern from them. I even reached out to the current CanvasX technical support for answers. They wanted to be helpful, but their answers didn’t offer much help.

With “CVS” or ‘drw2’ for mac, the header contains ranges inside a structure, and other data like if it was compressed. When we see if it’s a valid file we check the ranges. There is no easy way to determine what hex values would be written because of flipping, Intel vs (PPC or 68K). Unfortunately, the research needed to identify the Hex value will require the original code for version 3.5 which we do not have access to easily. Canvas 3.5 code is 16 bit… this would also be an issue.

Let’s take a look at a couple samples:

hexdump -C Canvas2.1-Sample | head
00000000  00 00 03 06 00 00 3d 9c  00 00 00 2a 00 00 00 0a  |......=....*....|
00000010  00 00 00 76 00 00 00 36  00 00 00 2e 00 00 00 1e  |...v...6........|
00000020  00 00 00 12 00 00 00 42  00 00 00 1a 00 00 00 82  |.......B........|
00000030  00 00 00 3c 00 66 00 01  00 00 3d 9c 00 48 00 00  |...<.f....=..H..|
00000040  40 02 90 00 00 00 00 00  00 00 00 00 00 00 00 00  |@...............|
00000050  00 01 00 00 01 00 00 00  00 20 00 40 00 60 00 80  |......... .@.`..|
00000060  00 c0 01 40 01 80 01 c0  02 40 02 80 00 00 00 00  |...@.....@......|
00000070  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 05  |................|
00000080  00 00 00 00 00 01 00 10  00 00 00 01 00 03 3f fc  |..............?.|
00000090  80 00 00 00 00 00 00 00  00 07 00 01 00 01 00 0b  |................|

hexdump -C Canvas2-s02 | head
00000000  00 00 03 b2 00 00 07 ec  00 00 00 2a 00 00 00 0a  |...........*....|
00000010  00 00 00 76 00 00 00 36  00 00 00 2e 00 00 00 1e  |...v...6........|
00000020  00 00 00 12 00 00 00 42  00 00 00 1a 00 00 00 82  |.......B........|
00000030  00 00 00 3c 00 66 00 01  00 00 07 ec 00 48 00 00  |...<.f.......H..|
00000040  40 02 90 00 00 00 00 00  00 00 00 00 00 00 00 00  |@...............|
00000050  00 01 01 00 01 00 00 00  00 20 00 40 00 60 00 80  |......... .@.`..|
00000060  00 c0 01 40 01 80 01 c0  02 40 02 80 00 00 00 00  |...@.....@......|
00000070  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 05  |................|
00000080  00 00 00 00 00 01 00 10  00 00 00 01 00 03 3f fc  |..............?.|
00000090  80 00 00 00 00 00 00 00  00 07 00 01 00 01 00 0b  |................|

hexdump -C Canvas3.04 | head
00000000  00 00 02 5a 00 00 00 1c  00 00 00 2a 00 00 00 0a  |...Z.......*....|
00000010  00 00 00 76 00 00 00 36  00 00 00 2e 00 00 00 1e  |...v...6........|
00000020  00 00 00 12 00 00 00 42  00 00 00 1a 00 00 00 82  |.......B........|
00000030  00 00 00 3c 00 68 00 02  00 00 00 1c 00 48 00 00  |...<.h.......H..|
00000040  40 02 90 00 00 00 00 00  00 00 00 00 00 00 00 00  |@...............|
00000050  00 01 01 00 01 03 00 00  00 20 00 40 00 60 00 80  |......... .@.`..|
00000060  00 c0 01 40 01 80 01 c0  02 40 02 80 00 00 00 00  |...@.....@......|
00000070  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000080  00 01 00 00 00 01 00 10  00 00 00 01 00 03 3f fc  |..............?.|
00000090  80 00 00 00 00 00 00 00  00 07 00 01 00 01 00 0b  |................|

hexdump -C Canvas5-3.5-Sample1.CVS | head
00000000  00 00 01 58 00 00 01 30  00 00 00 2a 00 00 00 00  |...X...0...*....|
00000010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000030  00 00 00 00 00 69 00 02  00 00 01 30 00 48 00 00  |.....i.....0.H..|
00000040  40 02 90 00 00 00 00 00  00 00 00 00 00 00 00 00  |@...............|
00000050  00 01 01 01 00 00 00 00  00 20 00 40 00 60 00 80  |......... .@.`..|
00000060  00 c0 01 40 01 80 01 c0  02 40 02 80 00 00 00 00  |...@.....@......|
00000070  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000080  00 01 00 00 00 01 00 10  00 00 00 01 00 03 3f fc  |..............?.|
00000090  80 00 00 00 00 00 00 00  00 07 00 01 00 01 00 01  |................|

hexdump -C C3-5-S01.CVS | head
00000000  78 11 00 00 10 00 00 00  2a 00 00 00 0a 00 00 00  |x.......*.......|
00000010  26 00 00 00 26 00 00 00  26 00 00 00 26 00 00 00  |&...&...&...&...|
00000020  96 00 00 00 2a 00 00 00  2e 00 00 00 32 00 00 00  |....*.......2...|
00000030  00 00 00 00 01 6b 01 00  50 14 00 00 28 00 00 00  |.....k..P...(...|
00000040  6e 00 00 00 5b 00 00 00  01 00 04 00 00 00 00 00  |n...[...........|
00000050  e8 13 00 00 12 0b 00 00  12 0b 00 00 00 00 00 00  |................|
00000060  00 00 00 00 00 00 00 00  00 00 80 00 00 80 00 00  |................|
00000070  00 80 80 00 80 00 00 00  80 00 80 00 80 80 00 00  |................|
00000080  c0 c0 c0 00 80 80 80 00  00 00 ff 00 00 ff 00 00  |................|
00000090  00 ff ff 00 ff 00 00 00  ff 00 ff 00 ff ff 00 00  |................|

In the version 2 & 3 samples you can see some patterns, which I thought would allow for proper identification, but looking at more samples I found differences. One pattern I was hopeful might be consistent was the hex values “002000400060008000C00140018001C002400280”, but there are some which don’t match this pattern. If the file is truly compressed, it will be hard to know which values would be consistent among all files. I have over 8,000 samples and have a signature that only excludes around 20, so it will have to do for now.

When we start with Version 5 we get into some more identifiable headers, there is some oddness with some samples. But with an ascii string like “CANVAS5”, it should be easy, right? Not so fast, in version 5 you can compress the file structure. This removes the easily identifiable “CANVAS5” string. But some have a small string at the tail end, but others do not.

hexdump -C Canvas5-Sample1.CV5 | head
00000000  02 00 00 80 00 00 00 00  00 00 00 4e 96 00 00 4e  |...........N...N|
00000010  96 18 02 00 00 00 0e a8  da 43 41 4e 56 41 53 35  |.........CANVAS5|
00000020  00 01 00 00 00 00 00 05  03 00 00 00 00 00 00 00  |................|
00000030  00 00 00 00 00 21 00 00  00 21 00 00 00 79 00 00  |.....!...!...y..|
00000040  00 03 00 00 01 6b 00 00  00 03 00 00 00 01 ff ff  |.....k..........|
00000050  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|

hexdump -C Canvas5-Sample3-cmp.CV5 | head
00000000  02 00 00 80 00 00 00 00  08 00 00 80 00 00 00 03  |................|
00000010  5c ff ff ff ff 00 00 40  22 00 00 03 50 10 00 89  |\......@"...P...|
00000020  07 60 bd 0f f0 00 00 10  03 04 10 56 00 20 05 00  |.`.........V. ..|
00000030  e0 18 02 10 35 04 30 4e  05 30 72 07 f0 a8 0d a1  |....5.0N.0r.....|
00000040  17 11 81 19 05 50 5c 00  60 0f 00 10 80 02 90 80  |.....P\.`.......|
00000050  03 f0 56 05 50 55 05 b0  75 12 51 29 05 e0 55 05  |..V.PU..u.Q)..U.|

hexdump -C Canvas5-Sample3-cmp.CV5 | tail
00001ff0  00 00 00 01 08 a5 ab c0  00 00 00 00 3f 89 2c 58  |............?.,X|
00002000  00 00 00 00 08 a5 ab 80  00 00 00 00 ff d4 11 e4  |................|
00002010  00 00 00 00 08 a5 ab 90  00 02 3e d8 ff d3 12 cc  |..........>.....|
00002020  00 00 00 00 00 00 00 00  00 02 3e d8 00 01 00 09  |..........>.....|
00002030  00 00 00 00 00 00 00 00  00 00 00 00 08 a5 ab f8  |................|
00002040  00 00 00 00 43 4e 56 35                           |....CNV5|

Canvas 6 uses a new extension, but has a similar structure to the file format. With compression as an option. But some of the compressed files on Windows has a reversed string, “5VNC“. So many Canvas 5 compressed look identical to Canvas 6 compressed, complicating identification.

hexdump -C Canvas6-Sample.CNV | head
00000000  01 00 80 00 00 90 07 cd  07 00 80 00 00 00 80 00  |................|
00000010  00 17 01 00 00 59 f5 0e  00 43 41 4e 56 41 53 36  |.....Y...CANVAS6|
00000020  00 01 00 00 00 00 06 00  00 00 00 00 00 00 00 00  |................|
00000030  00 00 00 00 00 21 7a 00  00 00 7a 00 00 00 03 00  |.....!z...z.....|
00000040  00 00 6e 01 00 00 03 00  00 00 01 00 00 00 ff ff  |..n.............|
00000050  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|

hexdump -C Canvas6-Sample1-c.CNV | head
00000000  01 00 80 00 00 58 ea 2b  00 c2 1d 00 00 d0 09 00  |.....X.+........|
00000010  00 00 00 0f 2e 00 00 0b  07 00 00 09 c4 10 00 01  |................|
00000020  00 00 03 00 20 04 00 70  ff 00 80 05 00 c0 06 06  |.... ..p........|
00000030  50 20 03 00 0f 06 10 6b  00 a0 12 01 00 48 07 20  |P .....k.....H. |
00000040  6d 07 30 40 06 40 11 06  00 0b 05 00 10 00 10 71  |m.0@.@.........q|
00000050  01 40 21 00 00 59 01 00  0f 05 10 00 00 e1 14 00  |.@!..Y..........|

hexdump -C Canvas6-Sample1-c.CNV | tail
000016a0  00 00 00 12 f6 00 00 c0  f0 12 00 3c d0 80 7c 58  |...........<..|X|
000016b0  2f 14 00 00 00 00 00 bc  f4 8d 00 0f 00 00 00 00  |/...............|
000016c0  f1 12 00 7f 00 00 00 f8  2e 14 00 bc f4 8d 00 1c  |................|
000016d0  f2 12 00 04 f3 12 00 fc  d1 80 7c 09 04 00 00 00  |..........|.....|
000016e0  00 00 40 00 f2 12 00 ff  ff ff ff 00 f1 12 00 1c  |..@.............|
000016f0  f1 12 00 bc f4 8d 00 00  00 00 40 35 56 4e 43     |..........@5VNC|

While most have the “CANVAS6” string near the beginning, quite a few are missing the CNV5/5VNC string at the end. Instead, many have the string “%SI-0200” near the end, which I use in my signature suggestion. This structure remained the same from version 6 to 8.

hexdump -C Canvas8-S01.CNV | head
00000000  02 00 00 80 00 00 12 b8  80 00 00 11 19 00 00 11  |................|
00000010  19 18 02 00 00 00 0e f5  59 43 41 4e 56 41 53 36  |........YCANVAS6|
00000020  00 01 00 00 00 00 00 08  01 00 00 00 00 00 00 00  |................|
00000030  00 00 00 00 00 21 00 00  00 00 00 00 00 00 00 00  |.....!..........|
00000040  00 03 00 00 00 00 00 00  00 03 00 00 00 01 00 00  |................|
00000050  00 01 ff ff ff ff 00 00  00 02 00 00 00 02 00 00  |................|

But…….. There are plenty without these strings, just the “%SI-0200” near the end.

hexdump -C TELEGRPH.CNV | head
00000000  02 00 00 80 00 00 00 00  08 00 00 80 00 00 00 3d  |...............=|
00000010  f2 ff ff ff ff 00 00 75  76 00 00 3d e6 10 00 ff  |.......uv..=....|
00000020  00 00 b3 0d 90 a9 03 b0  8a 07 f0 98 07 60 80 08  |.............`..|
00000030  d0 35 01 c0 58 01 e0 59  04 80 b8 03 90 38 02 f0  |.5..X..Y.....8..|
00000040  e2 00 20 0b 03 70 1d 03  20 36 0f 30 00 01 80 09  |.. ..p.. 6.0....|

hexdump -C TELEGRPH.CNV | tail
00006850  2b 2c f9 ae 30 00 00 00  20 00 00 00 01 00 00 00  |+,..0... .......|
00006860  0f 00 00 00 10 00 00 00  1e 00 00 00 07 00 00 00  |................|
00006870  64 65 6e 65 62 61 00 00  00 00 01 4c 25 53 49 2d  |deneba.....L%SI-|
00006880  30 32 30 30 6d 61 63 00  00 00 00 00 00 00 00 00  |0200mac.........|
00006890  00 00 00 00                                       |....|

In version 9 and forward we have an extension change to CVX, but the format is similar with the “CANVAS6” string, but is a slightly different offset. It is still used with the current version of Canvas X.

hexdump -C Canvas9-Sample1.cvx | head
00000000  00 00 00 00 00 00 00 00  00 00 02 00 00 80 00 07  |................|
00000010  d1 84 d0 00 00 80 00 00  00 80 00 18 02 00 00 00  |................|
00000020  0f b7 ef 43 41 4e 56 41  53 36 00 01 00 00 00 00  |...CANVAS6......|
00000030  00 09 00 00 00 03 34 00  00 00 04 00 00 00 00 00  |......4.........|
00000040  00 00 00 3c 42 45 47 49  4e 5f 50 52 45 56 49 45  |...<BEGIN_PREVIE|
00000050  57 5f 54 41 47 3e 21 00  00 00 75 00 00 00 79 00  |W_TAG>!...u...y.|
00000060  00 00 03 00 00 01 6b 00  00 00 03 00 00 00 01 ff  |......k.........|
00000070  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|

hexdump -C Canvas9-Sample1-compressed.cvx | tail
00004090  00 00 e0 20 00 57 80 00  00 00 00 00 0a 13 00 09  |... .W..........|
000040a0  00 00 04 00 00 00 00 01  00 00 00 00 bf ff e0 80  |................|
000040b0  bf ff e0 40 01 8c 5e 00  02 4a 22 d0 00 00 01 60  |...@..^..J"....`|
000040c0  bf ff e0 40 00 5c 08 18  00 00 00 00 00 0d 84 80  |...@.\..........|
000040d0  43 61 6e 76 61 73 39 2d  53 61 6d 70 6c 65 31 2d  |Canvas9-Sample1-|
000040e0  63 6f 6d 70 72 65 73 73  65 64 2e 63 76 78 00 18  |compressed.cvx..|
000040f0  bf ff e0 70 0a 12 6a a0  02 43 22 b4 00 0c aa 9c  |...p..j..C".....|
00004100  bf ff e0 80 00 00 00 01  00 00 00 00 00 0d 84 80  |................|
00004110  bf ff e0 b0 43 4e 56 35                           |....CNV5|

hexdump -C CanvasX2019-S01.cvx | head
00000000  00 00 00 00 00 00 00 00  00 00 01 00 80 00 00 00  |................|
00000010  6e ab 03 00 80 00 00 00  80 00 00 17 01 00 00 ef  |n...............|
00000020  b7 0f 00 43 41 4e 56 41  53 36 00 01 00 00 00 00  |...CANVAS6......|
00000030  09 00 00 4d 01 00 00 eb  4c 00 00 41 00 00 00 31  |...M....L..A...1|
00000040  52 45 56 03 00 00 00 01  00 00 00 00 00 00 00 00  |REV.............|
00000050  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

This collection of file formats is very hard to make sense of. Some really great consistent patterns on many samples, with lots of exceptions. Super confusing. This software has had a long run, with the latter years staying pretty stagnate in terms of new development. It is worth defining and creating a signature for the consistent patterns, then we can dial in the variants over time?

The signatures I have built miss about 23 files in versions 1-3 out of the ~9000 samples I have and for Canvas 5, only some of the compressed files are currently not identified. But so far all my CNV and CVX files identify correctly, so probably good for now.

CanvasX dropped supported for the Macintosh, but did release an entirely different product called Canvas X Draw, which does support the Macintosh. Here is what a CVD file looks like:

hexdump -C CanvasXDraw7-Sample1.cvd | head
00000000  25 43 61 6e 76 61 73 43  56 44 09 31 2e 30 25 bb  |%CanvasCVD.1.0%.|
00000010  54 48 65 61 64 65 72 00  00 00 00 00 00 00 00 00  |THeader.........|
00000020  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000030  00 bb 52 4d 61 63 4f 53  56 65 72 73 69 6f 6e 20  |..RMacOSVersion |
00000040  31 30 2e 31 33 2e 36 20  28 42 75 69 6c 64 20 31  |10.13.6 (Build 1|
00000050  37 47 31 34 30 34 32 29  31 30 2e 32 33 30 34 08  |7G14042)10.2304.|
00000060  00 00 00 70 6c 61 74 66  6f 72 6d 0a 73 00 00 00  |...platform.s...|
00000070  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000080  00 00 00 00 00 05 00 00  00 02 00 00 00 00 00 00  |................|
00000090  00 08 00 00 00 6f 73 0a  73 00 00 00 00 00 00 00  |.....os.s.......|

There is also the matter of a Canvas Image, which the User Guide calls proxy images. They are Raster images used in placements within Canvas Documents. Should be easy to identify.

hexdump -C Canvas5-Sample1.CVI | head
00000000  00 00 00 01 44 41 44 35  50 52 4f 58 00 00 09 99  |....DAD5PROX....|
00000010  00 00 00 11 00 00 00 2d  00 00 00 03 00 00 00 08  |.......-........|
00000020  00 48 00 00 00 00 00 06  00 03 00 08 00 00 00 11  |.H..............|
00000030  00 00 00 2d 00 03 00 03  00 48 00 00 00 48 00 00  |...-.....H...H..|
00000040  00 00 00 00 00 00 00 00  00 00 00 11 00 00 00 2d  |...............-|
00000050  00 00 00 02 00 00 00 08  00 00 00 01 00 00 00 11  |................|
00000060  00 00 00 2d ff ff ff ff  ff ff ff ff ff ff ff ff  |...-............|
00000070  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|

Phew, if you held on for this whole post you must really like confusing file format structures. This format has been on my mind on and off for about 6 years. Hopefully these signatures will work for the vast majority of the Canvas files found in archives and personal systems. As always here is my GitHub with the signatures I am proposing and a few samples to get you confused.

MAGIX

There are probably many reasons why a software developer might want to create a proprietary format to store their files in. The software may require special features that don’t fit into an existing format. I would hope a developer would try to use existing formats, or even better open formats, but for many reasons, which probably include profits, they choose to re-invent the wheel often.

MAGIX is a German company which started making software in 1994. In 2001 they developed their first video editing software which was called Movie Edit Pro. The software seems to be well received and is still in use today.

Like most video editing software, project files are used to store all the edits and links to video files. These are usually smaller text based, with many using XML as the project format. Not MAGIX, they decided to go with a different yet known format for their project files.

hexdump -C MAGIX15-s01.MVP | head
00000000  52 49 46 46 6c 37 01 00  53 45 4b 44 4d 56 50 48  |RIFFl7..SEKDMVPH|
00000010  08 00 00 00 00 00 00 00  00 00 00 00 4c 49 53 54  |............LIST|
00000020  0c 16 01 00 4d 56 50 4c  4c 49 53 54 00 16 01 00  |....MVPLLIST....|
00000030  56 49 50 4c 53 56 49 50  0c 07 00 00 00 dc 05 00  |VIPLSVIP........|
00000040  00 00 00 00 20 00 00 00  0c 00 00 00 80 bb 00 00  |.... ...........|
00000050  10 00 00 00 29 6b 55 e2  53 f8 3d 40 00 00 f0 42  |....)kU.S.=@...B|
00000060  01 00 00 00 bd 04 ef fe  00 00 01 00 06 00 08 00  |................|
00000070  00 00 01 00 06 00 08 00  00 00 01 00 3f 00 00 00  |............?...|
00000080  28 00 00 00 04 00 04 00  01 00 00 00 00 00 00 00  |(...............|
00000090  00 00 00 00 00 00 00 00  bd 8f 32 01 d0 02 00 00  |..........2.....|

Yes, they used the RIFF container format for their projects. Seems an odd choice, especially for video production although it is well suited for it. AVI is another video format which uses the RIFF container. The MVP project file uses the ID SEKD with the format MVPH. Earlier versions of Movie Edit Pro used a different extension.

hexdump -C MAGIXv11-s01.MVD | head
00000000  52 49 46 46 38 57 00 00  53 45 4b 44 53 56 49 50  |RIFF8W..SEKDSVIP|
00000010  70 00 00 00 00 dc 05 00  00 00 00 00 04 00 00 00  |p...............|
00000020  02 00 00 00 80 bb 00 00  10 00 00 00 8e 23 d6 e2  |.............#..|
00000030  53 f8 3d 40 00 00 f0 42  01 00 00 00 bd 04 ef fe  |S.=@...B........|
00000040  00 00 01 00 00 00 06 00  00 00 04 00 00 00 06 00  |................|
00000050  00 00 04 00 3f 00 00 00  28 00 00 00 04 00 04 00  |....?...(.......|
00000060  01 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000070  c8 1b 32 01 d0 02 00 00  e0 01 00 00 52 d7 da fb  |..2.........R...|
00000080  54 55 f5 3f 4c 49 53 54  04 00 00 00 70 68 79 73  |TU.?LIST....phys|
00000090  4c 49 53 54 d0 3d 00 00  74 72 6b 73 4c 49 53 54  |LIST.=..trksLIST|

The MVD format used on an earlier version of Movie Edit Pro is also a RIFF, and with the ID of SEKD, but has a format of SVIP.

RIFFpad can break down the chunks we see in an MVP file. Each of the LIST chunks has their own subchunks as well. I assume this his how the editing software stores each video/audio track references, etc. So I give it to MAGIX for at least using an understandable format to store their projects.

MAGIX has also used RIFF in many of its supporting formats. So far I have found mfx, afx, ifx, cfx, ctf, tfx, ufx, mmt, mmm, hdp, each having their own format:

hexdump -C 101_Loud.mfx | head
00000000  52 49 46 46 a8 6f 00 00  53 45 4b 44 4d 41 46 58  |RIFF.o..SEKDMAFX|
00000010  00 00 00 00 4c 49 53 54  94 6f 00 00 41 55 46 58  |....LIST.o..AUFX|
00000020  4c 49 53 54 88 6f 00 00  41 46 58 45 46 58 48 44  |LIST.o..AFXEFXHD|
00000030  20 00 00 00 00 00 25 0d  00 00 00 00 02 00 00 00  | .....%.........|
00000040  01 00 00 00 00 00 00 00  03 18 00 00 00 00 00 00  |................|
00000050  00 00 00 00 4c 49 53 54  54 6f 00 00 41 46 58 44  |....LISTTo..AFXD|
00000060  4c 49 53 54 50 6a 00 00  41 46 58 45 46 58 48 44  |LISTPj..AFXEFXHD|
00000070  20 00 00 00 00 00 25 0d  00 00 00 00 05 00 00 00  | .....%.........|
00000080  01 00 00 00 00 00 00 00  03 18 00 00 00 00 00 00  |................|
00000090  00 00 00 00 4c 49 53 54  1c 6a 00 00 41 46 58 44  |....LIST.j..AFXD|

Not sure the best way to manage all of these in terms of identification, as I am not sure what what is the purpose of each format. Maybe for now I’ll make a generic to catch them all as a MAGIX File.

ExtensionIDFORMAT
AFXSEKDSAFX
CFXSEKDSCFX
CTFSEKDSVIP
HDPSEKDSHDP
IFXSEKDSIFX
MFXSEKDMAFX
MMMSEKDSVIP
MMTSEKDSVIP
MVDSEKDSVIP
MVPSEKDMVPH
MXMMXMDmxmi
TFXSEKDSTFX
UFXSEKDSVIP

But, when it comes to their proprietary MAGIX Video format, I think they may have pushed things a little too far. Meet the MXV format:

hexdump -C MAGIXv11-s01.mxv | head
00000000  4d 58 52 49 46 46 36 34  9a cb 2b 00 00 00 00 00  |MXRIFF64..+.....|
00000010  4d 58 4a 56 49 44 36 34  4d 58 4a 56 48 32 36 34  |MXJVID64MXJVH264|
00000020  70 00 00 00 00 00 00 00  70 00 00 00 03 00 00 00  |p.......p.......|
00000030  42 93 2b 00 00 00 00 00  f0 00 00 00 00 00 00 00  |B.+.............|
00000040  7b 2e 00 00 4b 00 00 00  01 00 00 00 00 00 00 00  |{...K...........|
00000050  8e 23 d6 e2 53 f8 3d 40  80 02 00 00 e0 01 00 00  |.#..S.=@........|
00000060  80 02 00 00 e0 01 00 00  04 00 00 00 43 15 00 00  |............C...|
00000070  f0 00 00 00 00 00 00 00  28 19 00 00 00 00 00 00  |........(.......|
00000080  55 55 55 55 55 55 f5 3f  00 00 00 00 00 00 00 00  |UUUUUU.?........|
00000090  7f dd 05 00 00 00 00 00  4d 58 4a 56 48 44 36 34  |........MXJVHD64|

I am not sure what I am looking at, is it a RIFF? Is it a RIFF variant like RF64? MAGIX claims the format is:

This is the MAGIX video format for quicker processing with MAGIX products. It offers very low loss of quality, but it cannot be played via conventional DVD players.

MAGIX Video Pro X6

A look around the internet doesn’t bring much up in reference to this format. Just my recent page on the format wiki. A search for MXRIFF64 bring up nothing. But a closer look at other strings within the MXV file reveal we are probably looking at some sort of MPEG format.

I was able to locate a project on GitHub which claims to be able to demux the MXV format. The software is written in GO and appears to indicate this format is chunked based and has most of the chunks figured out. So if you find yourself stuck with some MXV files and don’t want to use the latest from MAGIX, this might be the tool for you.

This demuxer also has an interesting file you can download. It is called a “GRAMMAR” file and can be loaded into hex viewers like Synalyze It! can show the parts of a file you load. Its a great way to explore a format!

None of these formats are found in PRONOM, project files are not usually kept in archives, but if would be good to know about the RIFF files if they do turn up. The video format is for sure something the archival world should know about. MediaInfo is currently not aware of this format, but seems like it might be an easy task.

As usual, you can see some samples and my proposal signatures on my GitHub.