Daisy

A single file can often be self contained, having all that is needed to render itself with the correct software, but more and more often files need other files to function properly. Sometimes these groups of dependent files are within a container, such as a DOCX or ePub, but can also be found all sitting nicely in a folder. I say nicely, partly because the structure works, that is until they are treated as individual files and renamed or moved around breaking that interdependence on each other.

In the case of many Apple bundle files, they appear to be a single file when using on the MacOS, but as a folder on Windows or Linux. This can be very confusing. In other cases such as the DAISY Digital Talking Book format, it is simply a folder or disc with a few or many files within.

Current tools used to identify file formats, such as DROID, look at individual files, not groups of files to determine format. Each file within a folder may have a unique format, but when grouped with other specific formats they become something more. We will have to work on enhancing current tools if we want to avoid breaking these format types and losing their ability to render properly.

DAISY, or Digital Accessible Information System, is a type of Digital Book. The format was originally conceived in 1988 as a method to create a talking book, designed for the purpose of giving those who are visually impaired the ability to listen to books. It wasn’t until 1996, the DAISY Consortium was created in order to take the technology to those who needed it. The original version of the the DAISY format in 1994 was proprietary, but once they formed the consortium, they decided to adopt open standards for the format and in 1998, the DAISY 2.0 standard was released. You can read more on the Library of Congress Format Description page.

Lets take a look at a folder containing a DAISY 2.0 book.

ls -la "DAISY 2.02 export"
total 536
drwx------ 1 tyler staff 16384 Sep 25 22:06 .
drwx------ 1 tyler staff 16384 Sep 25 22:06 ..
-rwx------@ 1 tyler staff 1090 Sep 25 22:05 0002.smil
-rwx------ 1 tyler staff 228413 Sep 25 22:05 aud0001.mp3
-rwx------@ 1 tyler staff 672 Sep 25 22:05 master.smil
-rwx------ 1 tyler staff 1703 Sep 25 22:05 ncc.html

We can see three different formats in this folder. The obvious well known MP3 files and an HTML file. We also see two files with the extension SMIL.

Synchronized Multimedia Integration Language” or SMIL is a W3C XML standard used to describe multimedia presentations. It is used in the DAISY DTB as well as other applications, but we will focus on DAISY, and it is in its third version. A SMIL file has this structure:

<?xml version="1.0"?>
<!DOCTYPE smil PUBLIC "-//W3C//DTD SMIL 1.0//EN" "http://www.w3.org/TR/REC-smil/SMIL10.dtd">
<smil>
<head>
<meta name="dc:title" content="Obi Project" />
<meta name="dc:identifier" content="589c550e-303b-4c0d-9921-ae76d782fd53" />
<meta name="ncc:generator" content="Obi v5.0.0.0 with toolkit: UrakawaSDK.core v2.0.0.0 (http://urakawa.sf.net/obi)" />
<meta name="dc:format" content="Daisy 2.02" />
<meta name="ncc:timeInThisSmil" content="00:00:28" />
<layout>
<region id="textView" />
</layout>
</head>
<body>
<ref title="Testing" src="0002.smil" id="ms_0002" />
</body>
</smil>

A standard XML file with a link to a SMIL DTD and a root tag of <smil>. This format is recognized by PRONOM as fmt/205, although is often identified as a standard XML file. It seems the signature was created with a small offset which works with some SMIL files, but the gap between the end of the XML declaration and the start of the <smil> tag is only 20-86 bytes, not enough to allow for different character sets and full DTD URL’s. We will have to increase this gap in order to get all the SMIL files identified correctly.

With this update all the files in a DAISY 2.0 files should be identified individually, but as a set of files they make up the DAISY 2.0 format. This format requires the ncc.html file be present at the root of the folder or CD, so this file will aid in the manual identification of this format.

DAISY 3 was released in 2002 and standardized using the ANSI/NISO Z39.86 2002 name. It has been revised a couple times with the current revision being 2012. This update adds more functionality to the format with many new optional and required formats/files included in the folder. Here is a simple example:

ls -la "DAISY3 Export"
total 784
drwx------ 1 tyler staff 16384 Sep 25 22:06 .
drwx------ 1 tyler staff 16384 Sep 25 22:06 ..
-rwx------@ 1 tyler staff 979 Sep 25 22:05 0001.smil
-rwx------ 1 tyler staff 228413 Sep 25 22:05 aud0001.mp3
-rwx------ 1 tyler staff 1014 Sep 25 22:05 navigation.ncx
-rwx------ 1 tyler staff 1881 Sep 25 22:05 package.opf
-rwx------ 1 tyler staff 7838 Nov 2 2020 tpbnarrator.res
-rwx------ 1 tyler staff 117656 Nov 2 2020 tpbnarrator_res.mp3

The SMIL format is still included, along with MP3’s, but we have some addition formats. The NCX or “Navigation Control File”, the OPF or “Package file”, and the RES or “Resource file” are a few of them. The NCX file is the first file accessed as it lays out the navigation for the whole DTB. It is also XML:

cat DAISY3 Export/navigation.ncx 
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE ncx PUBLIC "-//NISO//DTD ncx 2005-1//EN" "http://www.daisy.org/z3986/2005/ncx-2005-1.dtd">
<ncx
version="2005-1"
xml:lang="en-US" xmlns="http://www.daisy.org/z3986/2005/ncx/">

This file is only recognized by DROID as a standard XML file. It probably should have unique identification like SMIL and with a root tag of <ncx>, that should be fairly easy to add.

The Package file with the extension OPF, is actually a format used by the openebook group, not to be confused by a format used by the Open Preservation Foundation 🤣. The Open Packaging Format is used and a DTB conforming to this standard must include exactly one Package File which must be a valid XML 1.0 document conforming to the OEBF Publication Structure 1.2 package.

cat DAISY3 Export/package.opf   
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE package PUBLIC "+//ISBN 0-9673008-1-9//DTD OEB 1.2 Package//EN" "http://openebook.org/dtds/oeb-1.2/oebpkg12.dtd">
<package
unique-identifier="uid" xmlns="http://openebook.org/namespaces/oeb-package/1.0/">
<metadata>
<dc-metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oebpackage="http://openebook.org/namespaces/oeb-package/1.0/">
<dc:Identifier
id="uid">589c550e-303b-4c0d-9921-ae76d782fd53</dc:Identifier>
<dc:Format>ANSI/NISO Z39.86-2005</dc:Format>
<dc:Title>Obi Project</dc:Title>
<dc:Publisher>N/A</dc:Publisher>
<dc:Language>en-US</dc:Language>
<dc:Creator>Creator name</dc:Creator>
<dc:Date>2024-09-25</dc:Date>
</dc-metadata>

The OPF format is also unknown to PRONOM and they identify as standard XML files as well. The root tag of “<package>” could be used elsewhere so the signature may need to reference the OEB package information.

The RES Resource file is also a standard XML and can be identified through its root tag of “<resources>” and resources DOCTYPE.

cat DAISY3 Export/tpbnarrator.res 
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE resources PUBLIC "-//NISO//DTD resource 2005-1//EN" "http://www.daisy.org/z3986/2005/resource-2005-1.dtd" []>
<resources xmlns="http://www.daisy.org/z3986/2005/resource/" version="2005-1">

<!-- SKIPPABLE NCX -->

<scope nsuri="http://www.daisy.org/z3986/2005/ncx/">
<nodeSet id="ns001" select="//smilCustomTest[@bookStruct='LINE_NUMBER']">
<resource xml:lang="en" id="r001">
<text>Row</text>
<audio src="tpbnarrator_res.mp3" clipBegin="0:00:02.379" clipEnd="0:00:03.416" />
</resource>
</nodeSet>

Now, adding these DAISY 3.0 formats will greatly increase the identification of this complex format. But we run into a problem with some of the software out there which generates these DAISY files, some of them include files not required by the format, but are included to be used by the different software. This can include some CSS files for formatting, additional XML, XSL files, DTD’s, and for DAISY files created by the PlexTalk software, additional project files.

ls -la MasterCD/AfterBuild 
total 7520
drwx------@ 1 tyler staff 16384 Sep 24 19:34 .
drwx------@ 1 tyler staff 16384 Sep 25 22:11 ..
-rwx------@ 1 tyler staff 6688 Sep 25 01:32 ImdPhrInfo.imph
-rwx------@ 1 tyler staff 3773 Sep 25 01:32 ImdTxtTabl.imtt
-rwx------@ 1 tyler staff 1276 Sep 25 01:32 Ncc.imdn
-rwx------@ 1 tyler staff 3716618 Sep 25 01:32 a000001.mp3
-rwx------@ 1 tyler staff 4352 Sep 25 01:32 ncc.html
-rwx------@ 1 tyler staff 1015 Sep 25 01:32 ptk000001.smil
-rwx------@ 1 tyler staff 938 Sep 25 01:32 ptk000002.smil

The ncc.html file is here, indicating a DAISY 2.0 format, along with an MP3 and SMIL files, but including some additional formats.

In addition, when creating a project, four files with the extensions Ncc.imdn, ImdPhrInfo.imph, ImdTxtTabl.imtt, and METADATA.ini are automatically created. These files are called “Plextalk project files.” They store table of contents information, etc. (Plextalk project files generated by older versions of this product do not have METADATA.ini.)

http://www.plextalk.com/jp/dw_data/PRSStd/PLEX_RS_UM.html

These four files may not be crucial to the playing of the Daisy format, but they are important to the PlexTalk software.

hexdump -C ImdPhrInfo.imph | head
00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000020 ff ff ff ff ff ff ff ff 00 00 00 00 00 00 00 00 |................|
00000030 00 00 00 00 00 00 00 00 f0 a3 0d 00 00 00 00 00 |................|
00000040 a3 06 00 00 a4 06 00 00 00 00 00 00 53 00 00 00 |............S...|
00000050 ff ff ff ff 01 00 00 00 03 00 00 00 00 00 00 00 |................|
00000060 00 00 00 00 00 00 00 00 c5 11 00 00 20 1a 00 00 |............ ...|
00000070 e5 2b 00 00 00 00 00 00 63 00 00 00 ff ff ff ff |.+......c.......|
00000080 02 00 00 00 04 00 00 00 00 00 00 00 00 00 00 00 |................|
00000090 00 00 00 00 e5 2b 00 00 d6 0b 00 00 bb 37 00 00 |.....+.......7..|

hexdump -C ImdTxtTabl.imtt | head
00000000 17 00 00 00 32 30 30 34 2f 30 35 2f 33 31 2f 31 |....2004/05/31/1|
00000010 36 3a 36 3a 34 37 2e 30 30 30 00 03 00 00 00 65 |6:6:47.000.....e|
00000020 6e 00 0b 00 00 00 69 73 6f 2d 38 38 35 39 2d 31 |n.....iso-8859-1|
00000030 00 0d 00 00 00 5a 3a 2f 42 6f 6f 6b 44 69 72 34 |.....Z:/BookDir4|
00000040 2f 00 0d 00 00 00 5a 3a 2f 42 6f 6f 6b 44 69 72 |/.....Z:/BookDir|
00000050 34 2f 00 0c 00 00 00 61 30 30 30 30 30 31 2e 6d |4/.....a000001.m|
00000060 70 33 00 0c 00 00 00 61 30 30 30 30 30 31 2e 6d |p3.....a000001.m|
*
00000980 70 33 00 08 00 00 00 48 65 61 64 69 6e 67 00 01 |p3.....Heading..|
00000990 00 00 00 00 08 00 00 00 48 65 61 64 69 6e 67 00 |........Heading.|

hexdump -C Ncc.imdn | head
00000000 01 ff 00 ff c4 00 00 00 3c 00 00 00 2c 00 00 00 |........<...,...|
00000010 14 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000020 00 00 00 00 49 6d 64 54 78 74 54 61 62 6c 2e 69 |....ImdTxtTabl.i|
00000030 6d 74 74 00 00 00 00 00 00 00 00 00 00 00 00 00 |mtt.............|
00000040 00 00 00 00 49 6d 64 50 68 72 49 6e 66 6f 2e 69 |....ImdPhrInfo.i|
00000050 6d 70 68 00 00 00 00 00 00 00 00 00 00 00 00 00 |mph.............|
00000060 00 00 00 00 04 00 00 00 00 fa 00 00 44 ac 00 00 |............D...|
00000070 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000080 00 00 00 00 01 00 00 00 08 00 00 00 12 00 00 00 |................|
00000090 03 00 00 00 00 00 00 00 01 00 00 00 ff ff ff ff |................|

I don’t have a METADATA.ini file to research, but I will be honest, these PlexTalk files will be hard to identify from their contents.

Looking at the IMPH file, there isn’t a lot of bytes which might indicate a format magic bytes. But I do see some patterns. The first 40 bytes all seem to be the same.

00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 FFFFFFFF FFFFFFFF

But making a signature from only 00 and FF might clash with other formats. It does appear that the 4 bytes FFFFFFFF occur every 40 bytes. This precision might be good enough if we repeat it a couple times.

The IMTT file is different. It appears to have information on the name, character set and all the files in the Daisy package. The first 4 bytes in my 14 samples either start with 17000000 or 18000000. Not knowing what the 17 or 18 refers to, I am hesitant to use it for identification. In between some of the data there is some consistent bytes, but at different offsets.


hexdump -C ImdTxtTabl.imtt | head
00000000 18 00 00 00 54 69 74 6c 65 00 35 39 2d 31 00 31 |....Title.59-1.1|
00000010 35 3a 35 34 3a 35 39 2e 32 36 30 00 03 00 00 00 |5:54:59.260.....|
00000020 65 6e 00 0b 00 00 00 69 73 6f 2d 38 38 35 39 2d |en.....iso-8859-|
00000030 31 00 01 00 00 00 00 01 00 00 00 00 01 00 00 00 |1...............|
00000040 00 01 00 00 00 00 01 00 00 00 00 01 00 00 00 00 |................|
00000050 01 00 00 00 00 01 00 00 00 00 0c 00 00 00 4d 61 |..............Ma|
00000060 72 69 6f 6e 20 53 79 6d 65 00 28 00 00 00 4d 69 |rion Syme.(...Mi|
00000070 6e 75 74 65 73 20 6f 66 20 74 68 65 20 43 6f 6d |nutes of the Com|
00000080 6d 69 74 74 65 65 20 4d 65 65 74 69 6e 67 20 32 |mittee Meeting 2|
00000090 34 30 35 30 34 00 08 00 00 00 48 65 61 64 69 6e |40504.....Headin|

hexdump -C ImdTxtTabl.imtt | head
00000000 17 00 00 00 32 30 30 34 2f 30 35 2f 33 31 2f 31 |....2004/05/31/1|
00000010 36 3a 36 3a 34 37 2e 30 30 30 00 03 00 00 00 65 |6:6:47.000.....e|
00000020 6e 00 0b 00 00 00 69 73 6f 2d 38 38 35 39 2d 31 |n.....iso-8859-1|
00000030 00 0d 00 00 00 5a 3a 2f 42 6f 6f 6b 44 69 72 34 |.....Z:/BookDir4|
00000040 2f 00 0d 00 00 00 5a 3a 2f 42 6f 6f 6b 44 69 72 |/.....Z:/BookDir|
00000050 34 2f 00 0c 00 00 00 61 30 30 30 30 30 31 2e 6d |4/.....a000001.m|
00000060 70 33 00 0c 00 00 00 61 30 30 30 30 30 31 2e 6d |p3.....a000001.m|

Not sure what any of it means, but might be good enough for a signature.

Now the IMDN files might be a little easier:

hexdump -C Ncc.imdn | head
00000000 01 ff 00 ff d4 00 00 00 3c 00 00 00 2c 00 00 00 |........<...,...|
00000010 14 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000020 00 00 00 00 49 6d 64 54 78 74 54 61 62 6c 2e 69 |....ImdTxtTabl.i|
00000030 6d 74 74 00 00 00 00 00 00 00 00 00 00 00 00 00 |mtt.............|
00000040 00 00 00 00 49 6d 64 50 68 72 49 6e 66 6f 2e 69 |....ImdPhrInfo.i|
00000050 6d 70 68 00 00 00 00 00 00 00 00 00 00 00 00 00 |mph.............|
00000060 00 00 00 00 04 00 00 00 00 7d 00 00 22 56 00 00 |.........}.."V..|
00000070 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000080 00 00 00 00 01 00 00 00 28 00 00 00 28 00 00 00 |........(...(...|
00000090 00 00 00 00 00 00 00 00 28 00 00 00 ff ff ff ff |........(.......|

This format directly names the two other formats. Should be easy to look for the two file names in the header. The NCC html file in Daisy 2.0 and the NCX xml file in Daisy 3.0 are directory files so it makes sense this file would do the same.

Not sure if these signatures will hold up over time, but they are a start. It would be nice if all the files we are given to preserve would have convenient static magic bytes, but alas, many do not and we have to guess.

These Daisy formats illustrate a problem in preservation that doesn’t quite have a good solution. Each of these files are individually unique and can be identified, but as a whole they represent another unique format. Tying formats together to link their interdependence on each other will be no small task, but will be necessary not only to understanding the format, but to avoid separating the files, renaming, or rearranging breaking that interdependence.

I have added the update to SMIL and new signatures for the other formats to my GitHub repository. Feel free to test and change if you find additional samples or information.