NCH Software

December 6, 2024 by Thor Leave a comment

Recently I came across a piece of software which used dozens of extensions for a single file format.

I find it hard to understand why a software developer would choose to use a new extension for every variation. It's all the same format, but makes #digipres more complicated. #fileformats #goodgrief pic.twitter.com/rz0IpK2i0j
— Tyler Thorsted (@CHLThor) July 16, 2024

This T-Shirt Factory Deluxe files are a bit of an extreme, probably a prank against all of us doing file format identification. If you know who made this decision, I would like to have a chat.

This is not first time I have come across a format which seems to have been used for more than one software title. Awhile back I tried to find more information on a file format used with many tools created by MetaCreations. It was called “Composite File Management System“, and was used with Kai’s Power tools, Bryce3D, Ray Dream, Poser, and others. I did a previous post about the format.

I came across another recently with a similar issue. They are also many different software titles with the same native format.

NCH Software is an Australian software company who produce a massive number of software titles covering many different needs. From Audio Editing to Business charts and from Accounting tools to a 3D model converter, they have it all. Their audio editing software WavePad is quite popular. My initial entry into their software world was for the specialized Dictation/Scribe software which produced a slightly proprietary audio format with the extension DCT. This format does not use the format many of the other titles use.

With the number of different titles, it probably makes sense they use the same file structure to make processing/programming more efficient. They appear to be mostly proprietary binary files.

hexdump -C Wavepad/Untitled2.wpp | head
00000000  6c 73 64 66 01 00 1a 00  00 00 07 00 00 00 00 00  |lsdf............|
00000010  ca 84 20 00 00 00 00 00  e9 03 00 00 a5 84 20 00  |.. ........... .|
00000020  00 00 00 00 d0 07 00 00  99 84 20 00 00 00 00 00  |.......... .....|
00000030  d1 07 06 00 24 00 00 00  00 00 00 00 2f 55 73 65  |....$......./Use|
00000040  72 73 2f 74 79 6c 65 72  2f 44 65 73 6b 74 6f 70  |rs/tyler/Desktop|
00000050  2f 55 6e 74 69 74 6c 65  64 5f 30 2e 77 61 76 00  |/Untitled_0.wav.|
00000060  dc 07 02 00 04 00 00 00  00 00 00 00 00 00 00 00  |................|
00000070  d2 07 03 00 08 00 00 00  00 00 00 00 00 00 00 00  |................|
00000080  00 00 00 00 d3 07 03 00  08 00 00 00 00 00 00 00  |................|
00000090  00 00 00 00 00 00 00 00  d4 07 03 00 08 00 00 00  |................|

hexdump -C Crescendo/examples/Grooving.cdo | head
00000000  6c 73 64 66 01 00 05 00  00 00 03 00 00 00 00 00  |lsdf............|
00000010  8a b5 00 00 00 00 00 00  00 10 00 00 65 05 00 00  |............e...|
00000020  00 00 00 00 01 11 04 00  04 00 00 00 00 00 00 00  |................|
00000030  00 00 00 41 02 11 02 00  04 00 00 00 00 00 00 00  |...A............|
00000040  05 00 00 00 03 11 04 00  04 00 00 00 00 00 00 00  |................|
00000050  00 00 52 43 04 11 04 00  04 00 00 00 00 00 00 00  |..RC............|
00000060  00 80 94 43 05 11 04 00  04 00 00 00 00 00 00 00  |...C............|
00000070  00 00 a0 41 06 11 02 00  04 00 00 00 00 00 00 00  |...A............|
00000080  01 00 00 00 07 11 04 00  04 00 00 00 00 00 00 00  |................|
00000090  00 00 00 00 08 11 04 00  04 00 00 00 00 00 00 00  |................|

hexdump -C Spin3D/bunny.3dp | head
00000000  6c 73 64 66 01 00 20 00  00 00 01 00 00 00 00 00  |lsdf.. .........|
00000010  ec bc 65 00 00 00 00 00  00 10 00 00 e0 bc 65 00  |..e...........e.|
00000020  00 00 00 00 00 12 00 00  38 bc 65 00 00 00 00 00  |........8.e.....|
00000030  01 12 07 00 8c 26 26 00  00 00 00 00 cc d1 27 3f  |.....&&.......'?|
00000040  1c b5 80 3f 3c f4 bd 3d  d9 79 27 3f de af 80 3f  |...?<..=.y'?...?|
00000050  bf 81 a9 3d ad fa 28 3f  10 e7 7d 3f 05 a8 a9 3d  |...=..(?..}?...=|
00000060  ec a4 1a 3f 56 29 49 3f  ab d0 c0 3d 3e 3c 1f 3f  |...?V)I?...=><.?|
00000070  5f ed 4c 3f 5a 48 c0 3d  04 59 1b 3f 48 53 49 3f  |_.L?ZH.=.Y.?HSI?|
00000080  42 e9 ab 3d 74 5d 1c 3f  05 6c 3b 3f f7 03 5e 3d  |B..=t].?.l;?..^=|
00000090  46 d2 1a 3f f6 d4 3e 3f  ef ac 5d 3d 94 db 1a 3f  |F..?..>?..]=...?|

hexdump -C Voxal/Geek.voxal | head
00000000  6c 73 64 66 01 00 0c 00  00 00 01 00 00 00 00 00  |lsdf............|
00000010  ea 01 00 00 00 00 00 00  ec 03 01 00 01 00 00 00  |................|
00000020  00 00 00 00 01 e8 03 00  00 a9 01 00 00 00 00 00  |................|
00000030  00 00 20 02 00 04 00 00  00 00 00 00 00 13 00 00  |.. .............|
00000040  00 00 10 00 00 39 00 00  00 00 00 00 00 00 10 00  |.....9..........|
00000050  00 0d 00 00 00 00 00 00  00 00 20 01 00 01 00 00  |.......... .....|
00000060  00 00 00 00 00 00 01 20  04 00 04 00 00 00 00 00  |....... ........|
00000070  00 00 c3 f5 40 41 02 20  02 00 04 00 00 00 00 00  |....@A. ........|
00000080  00 00 22 00 00 00 00 20  02 00 04 00 00 00 00 00  |..".... ........|
00000090  00 00 0e 00 00 00 00 10  00 00 29 00 00 00 00 00  |..........).....|

hexdump -C PhotoPad/test.ppp | head
00000000  6c 73 64 66 01 00 02 00  00 00 00 00 00 00 00 00  |lsdf............|
00000010  ee 3c 00 00 00 00 00 00  c9 00 01 00 01 00 00 00  |.<..............|
00000020  00 00 00 00 00 04 00 00  00 d5 3c 00 00 00 00 00  |..........<.....|
00000030  00 02 00 00 00 c9 3c 00  00 00 00 00 00 03 00 06  |......<.........|
00000040  00 0f 00 00 00 00 00 00  00 6f 72 69 67 69 6e 61  |.........origina|
00000050  6c 5f 69 6d 61 67 65 00  01 00 00 00 85 3c 00 00  |l_image......<..|
00000060  00 00 00 00 07 00 07 00  79 3c 00 00 00 00 00 00  |........y<......|
00000070  89 50 4e 47 0d 0a 1a 0a  00 00 00 0d 49 48 44 52  |.PNG........IHDR|
00000080  00 00 04 00 00 00 03 00  08 06 00 00 00 ba ba 15  |................|
00000090  0d 00 00 00 01 73 52 47  42 00 ae ce 1c e9 00 00  |.....sRGB.......|

Above are just a few of the titles which use the same structure. The LSDF string is the first 4 bytes and always the last 4 bytes. The next two bytes, 0100, seem consistent for all samples, but the two bytes after that seem to be unique to the software. So far I have found the following titles use the format.

Software Title	Name	Extension	Pattern
WavePad	WavePad Audio Editor Project File	WPP	6C736466 01001A00
Crescendo	Crescendo Score File	CDO	6C736466 01000500
Spin3D	NCH Software model format	3DP	6C736466 01002000
Voxal	Voxal Voices File	VOXAL	6C736466 01000C00
PhotoPad	PhotoPad Project File	PPP	6C736466 01000200
MixPad	MixPad Project	MPDP	6C736466 01000400
Disketch	Disketch Project	DEPROJ	6C736466 01000700
ClickCharts	ClickCharts Diagram	CCD	6C736466 01000A00
DreamPlan	DreamPlan File	DDP	6C736466 01001300
DrawPad	DrawPad File	DRP	6C736466 01001500

Without downloading and installing their vast library of software it’s hard to know all the different titles which use the format. The rest of the file for each sample seems to be proprietary in a binary format, except a few with a PNG image mixed in.

The simplest sample I could find was a preset file for the Zulu DJ Software which uses the ECF extension. The ECF extension is common with a few of the titles, like effect chains for WavePad and MixPad.

hexdump -C Zulu/Untitled.ecf
00000000  6c 73 64 66 01 00 0c 00  00 00 01 00 00 00 00 00  |lsdf............|
00000010  6b 00 00 00 00 00 00 00  00 10 00 00 1a 00 00 00  |k...............|
00000020  00 00 00 00 00 00 01 00  01 00 00 00 00 00 00 00  |................|
00000030  00 01 00 01 00 01 00 00  00 00 00 00 00 00 00 30  |...............0|
00000040  02 00 04 00 00 00 00 00  00 00 01 00 00 00 00 20  |............... |
00000050  00 00 29 00 00 00 00 00  00 00 00 10 00 00 0d 00  |..).............|
00000060  00 00 00 00 00 00 00 20  01 00 01 00 00 00 00 00  |....... ........|
00000070  00 00 00 00 20 02 00 04  00 00 00 00 00 00 00 00  |.... ...........|
00000080  00 00 01 6c 73 64 66                              |...lsdf|

This header is identical to the header for the VOXAL format, so not sure if the second set of 4 bytes is directly connected to the software title. Or if there purpose is something else.

The question that needs to be answered is how we might represent these formats in PRONOM if needed. We could create a unique signature for each title based on the magic header and footer and the second set of 4 bytes which may indicate the software. Or create a single generic signature to identify the basic format using the magic header and footer and adding all the extensions to the list, which would be lengthy. This would be the easiest and catch all formats related to NCH Software using this file format, but then additional characterization would need to happen to identify the specific software title needed to render the file.

The NCH Software company seems to churn out new software and versions quite frequently and a search for reviews of their software turns up some questionable results. Many might enjoy their software as they are easy to use and are free for home use. I had lots of trouble with a few of them as they wanted to mount network locations and disk images I had used recently, which seems sketchy. I would love to know if anyone uses their software and has any need to preserve these formats. I currently don’t, but found the common use of a file format intriguing. I also found no reference to the magic bytes they use, except for a few TrID entries. Marco always is a step ahead!

KODAK TIFF

November 29, 2024 by Thor 3 Comments

Years ago I bought my first digital camera. It was an Epson PhotoPC 3100z and I bought it because it could capture a digital image directly to a TIFF file. I don’t think most people would care about such a feature, but I thought it was awesome. Granted it filled up the small 32MB compact flash card pretty quick, I had to upgrade to a 512MB card, that set me back.

TIFF images are pretty universal, they have a well known structure and have been around for a very long time. I have written about TIFF’s before, so I wont go into too much about the format. The format is well respected in the preservation community, although one of the best websites, Aware Systems, documenting the various TIFF tags has gone dark in the this year, here is an archived version.

Many of the digital camera’s from the beginning to now use the TIFF format to store RAW sensor data. Most use their own extension and follow well established methods for storing the sensor data in an IFD with lots of common and custom tags. The DNG format is an open RAW format which uses the TIFF format to store sensor data, although many use SubIFD’s and can be incompatible with some software.

The first Digital Camera was invented by a Kodak employee, Steve Sasson in 1975, well, he was the first to use a CCD sensor in a self contained unit. This led Kodak to push the technology forward and in 1991 released the Kodak DCS digital system which used Nikon cameras equipped with a digital sensor. These early digital cameras were quite expensive, they used early CF cards and SCSI connections. Kodak released a few models of the DCS series, first on Nikon bodies, then on some Canon bodies. These early cameras used the TIFF format to store the RAW sensor data. For some reason, they decided to use a proprietary method and compression while still using the TIF extension.

Kodak was responsible for many new image file formats. Not sure why they decided to use a common format like TIFF and still use the TIF extension, but make it proprietary. The RAW file created by the DCS series of camera’s had to be opened with special plugins or software, if you tried to open the TIFF’s with anything else, you would only see the small thumbnail image located at IFD0 instead of the full size image hidden in a SubIFD1.

Finding samples of this format is particularly hard as they have the common TIF extension. The camera’s are also pretty rare and finding one is difficult, especially in working condition. I was only aware of a couple samples on the rawsamples.ch site, but that wasn’t enough to understand the format as the two files had a different structure.

hexdump -C RAW_KODAK_DCS460D_FILEVERSION_3.TIF | head
00000000  49 49 2a 00 00 03 00 00  7c 01 00 00 00 00 00 00  |II*.....|.......|
00000010  4b 4f 44 41 4b 20 20 20  20 20 20 20 20 20 20 20  |KODAK           |
00000020  44 43 53 34 36 30 44 20  20 20 20 20 20 20 20 20  |DCS460D         |
00000030  46 49 4c 45 20 56 45 52  53 49 4f 4e 20 33 20 20  |FILE VERSION 3  |
00000040  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000050  30 35 31 39 39 38 20 20  20 20 20 20 20 20 20 20  |051998          |
00000060  34 36 30 2d 32 39 35 30  00 00 00 00 00 00 00 00  |460-2950........|
00000070  31 39 39 30 3a 30 31 3a  30 31 20 31 32 3a 30 32  |1990:01:01 12:02|
00000080  3a 30 37 00 5b 20 32 5d  0d 49 53 4f 3a 20 20 20  |:07.[ 2].ISO:   |
00000090  20 20 20 20 20 38 30 20  20 0d 41 70 65 72 74 75  |     80  .Apertu|

hexdump -C RAW_KODAK_DCS560C.TIF | head
00000000  4d 4d 00 2a 00 00 11 76  00 04 f7 50 00 00 00 00  |MM.*...v...P....|
00000010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000040  54 68 69 73 20 69 6d 61  67 65 20 66 69 6c 65 20  |This image file |
00000050  77 61 73 20 63 72 65 61  74 65 64 20 62 79 20 61  |was created by a|
00000060  20 4b 6f 64 61 6b 20 44  43 53 35 36 30 43 20 64  | Kodak DCS560C d|
00000070  69 67 69 74 61 6c 20 63  61 6d 65 72 61 2e 20 28  |igital camera. (|
00000080  6e 75 6c 6c 29 20 20 00  00 00 00 00 00 00 00 00  |null)  .........|
00000090  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

There is/was a website called https://raw.pixls.us/, but it has been offline since last June, the regular site still works, but the raw sub-domain is unreachable. Luckily the wayback machine had archived a few samples.

I also found a reference on an older website referring to a sample set maintained by Kodak for developers using the SDK, but also no longer available. You can find the old website also on the wayback machine.

With a few more samples to refer to, it makes it easier to understand the headers and put together a signature. There was an SDK, but seems to be difficult to locate today, but the manual does give us a little more info on the different models and their format.

So from the SDK statement, the samples I have in TIF, and others I have in the more recent DCR format, I can conclude the custom TIF format was used with the DCS 3xx, 4xx, 5xx, 6xx models and from 7xx on the DCR format was used as the camera RAW. Looking closer at the samples in TIF, we can see all the 4xx models used the “FILE VERSION 3” version of the format, while the others have the full statement in the header. Not 100% clear on which format came first, but the 4xx models are some of the earliest models.

At the time, there was only Kodak software that could properly “develop” the RAW file taken by these camera models. Today that has changed and the format has been added to many open source libraries such as libraw and rawspeed. Many other commercial products also claim to support the DCS models including Adobe Camera Raw, which seems to be able to open these TIF’s.

Distinguishing these RAW TIF’s is important to properly manage them over the long term. These images currently identify in the PRONOM repository as regular TIF’s, fmt/353, so we would need to create a signature which identifies the standard TIFF header, but also uses bytes unique to this format. In the few samples I have the “VERSION 3” images all start with the litte-endian header, “49492A00”, while the other samples start with the big-endian header, “4D4D002A”. That makes it a little easier for each signature.

For for the “VERSION 3” format we could use a pattern such as 49492A00{12}4B4F44414B{11}(444353|454F53444353). This looks for the TIFF header, skips 12 bytes, looks for the word “KODAK”, skips 11 more bytes to then look for either “DCS” or “EOSDCS” right before the camera model number.

For the other format we also look for the TIFF header, but then find the whole string used in all the samples. 4D4D002A{60}5468697320696D6167652066696C652077617320637265617465642062792061204B6F64616B20444353{5}6469676974616C2063616D6572612E

This looks for the big-endian header, then the string, “This image file was created by a Kodak DCS”, skipping the model number, then the end of the string, “digital camera.” This should catch all the different models of this format.

You can find my proposed signature on my GitHub page, since none of the samples belong to me, you can find them above in some of the links.

RealVideo

November 15, 2024 by Thor 1 Comment

For #WDPD24 and PRONOM Hackathon week this year, I want to find some older formats listed which did not have a signature. There is a list to choose from, but I wanted to find something I hadn’t worked on before. I came across two entries for Real Video:

PUID	Name	Extension
fmt/204	RealVideo Clip	rv
x-fmt/277	Real Video	rv

I was familiar with Real Media and Real Audio, but had yet to come across any RealVideo with the RV extension. I thought it would be easy to find some references and samples, but that was not the case. I assume PRONOM originally added these based on MIME types available.

Real or RealNetworks is/was an Internet media company who jumped on the rapidly growing World Wide Web in 1995 to become a leader in Internet Media Delivery. Their initial offerings mainly focused on audio streaming and they accomplished all of this by providing free players and web browser extensions to make it easy to serve up a website with streaming media everyone could enjoy. Later adding video streaming optimized for the slower dialup and connections of the day. They used codecs based on common technology like H.263 and H.264, but used then to make their own proprietary codecs identified through FourCC codes, RV10-RV60.

So thought it would be easy to find a reference to the RV extension, I quickly discovered it wasn’t. Looking at the Wikipedia page on RealVideo, I found no reference to the RV extension. RV is an abbreviation for RealVideo, right? Well, I ended up finding a reference in the RealAudio page under file extensions. Ok, First clue to the existence of the RV extension. The page references RV as being used for video only files and was used by the flagship encoder (RealProducer).

RealProducer was the tool for creating the streaming audio and video formats that could then be used for your website or streaming platform. The RealProducer software came in a Basic version, which was free, and the Plus or Pro version, which was not free and provided more options. The first version of RealProducer to make video files was version 4. I was able to find a copy of the encoder and installed it under a Windows 95 emulator. To my surprise it only saved to the RealMedia RM file format. This format is well known and identified with PRONOM as x-fmt/190 also documented at the LoC.

This was the same with RealProducer 5, 7, 8, 9, and 10 that I was able to try. All made no mention of the RV extension. I was starting to feel this format didn’t exist or that some decided to use the RV extension on their own. Searches on Google yielded a couple results, mostly from users who had found a few files on their older discs and wanted to migrate them to something newer. I was able to find one example, one user shared, but it had the same header as the RealMedia format. The clue was in the file.

hexdump -C ambush_abb.rv
00000000  2e 52 4d 46 00 00 00 12  00 01 00 00 00 00 00 00  |.RMF............|
00000010  00 07 50 52 4f 50 00 00  00 32 00 00 00 03 6e e8  |..PROP...2....n.|
00000020  00 03 6e e8 00 00 03 e0  00 00 01 b3 00 00 6a 6f  |..n...........jo|
00000030  00 06 80 fa 00 00 08 b5  00 ba 41 73 00 00 03 55  |..........As...U|
00000040  00 03 00 09 43 4f 4e 54  00 00 00 40 00 00 00 00  |....CONT...@....|
00000050  00 00 00 08 28 43 29 20  32 30 30 35 00 26 00 00  |....(C) 2005.&..|
00000060  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000270  00 09 61 75 64 69 6f 4d  6f 64 65 00 00 00 02 00  |..audioMode.....|
00000280  06 76 6f 69 63 65 00 00  00 00 2d 00 00 0d 43 72  |.voice....-...Cr|
00000290  65 61 74 69 6f 6e 20 44  61 74 65 00 00 00 02 00  |eation Date.....|
000002a0  13 39 2f 32 30 2f 32 30  30 36 20 31 34 3a 30 37  |.9/20/2006 14:07|
000002b0  3a 30 38 00 00 00 00 53  00 00 0c 47 65 6e 65 72  |:08....S...Gener|
000002c0  61 74 65 64 20 42 79 00  00 00 02 00 3a 52 65 61  |ated By.....:Rea|
000002d0  6c 50 72 6f 64 75 63 65  72 28 52 29 20 42 61 73  |lProducer(R) Bas|
000002e0  69 63 20 31 31 2e 30 20  66 6f 72 20 57 69 6e 64  |ic 11.0 for Wind|
000002f0  6f 77 73 2c 20 42 75 69  6c 64 20 31 31 2e 30 2e  |ows, Build 11.0.|
00000300  30 2e 32 30 30 39 00 00  00 00 31 00 00 11 4d 6f  |0.2009....1...Mo|
00000310  64 69 66 69 63 61 74 69  6f 6e 20 44 61 74 65 00  |dification Date.|
00000320  00 00 02 00 13 39 2f 32  30 2f 32 30 30 36 20 31  |.....9/20/2006 1|
00000330  34 3a 30 37 3a 30 38 00  00 00 00 1d 00 00 09 76  |4:07:08........v|
00000340  69 64 65 6f 4d 6f 64 65  00 00 00 02 00 07 6e 6f  |ideoMode......no|
00000350  72 6d 61 6c 00 44 41 54  41 00 ba 3e 1e 00 00 00  |rmal.DATA..>....|

RealProducer Basic 11 for Windows. The Wikipedia article did hint at this by saying “the latest version of RealProducer reverted to using .ra for audio only files and began using .rv for video files with or without audio.” Why would they use the RM extension for so long, then revert to a different extension with a later version? I found more in the User Manual for version 11.

• .rv – RealVideo
    RealProducer uses the .rv file extension if the input is video-only or video-with-audio. You can also select the .rm file extension for video content.
    Tip: Using the .rv file extension helps search engines identify the file as a RealVideo clip.

• .rm – RealAudio or RealVideo
    RealProducer chooses the .rm file extension if it cannot determine the content of the input clip. You can use .rm file extension for any RealAudio or RealVideo clip, except for variable bit-rate clips.

Ok, so a few things to learn from this. One is the RV extension was used as the default for version 11 as they wanted search engines to identify them as a RealVideo clip. Second thing we learned is there is no difference between the two placeholders in PRONOM, one being a RealVideo file and the other being a RealVideo Clip. We don’t need both.

Now, is there any difference between an RV and RM file?

hexdump -C Producer11-01.rv | head
00000000  2e 52 4d 46 00 00 00 12  00 01 00 00 00 00 00 00  |.RMF............|
00000010  00 07 50 52 4f 50 00 00  00 32 00 00 00 03 6e e8  |..PROP...2....n.|
00000020  00 03 6e e8 00 00 03 e0  00 00 01 c7 00 00 01 66  |..n............f|
00000030  00 00 1b 57 00 00 07 41  00 02 91 0a 00 00 03 5e  |...W...A.......^|
00000040  00 03 00 09 43 4f 4e 54  00 00 00 40 00 00 00 00  |....CONT...@....|
00000050  00 00 00 08 28 43 29 20  32 30 30 35 00 26 00 00  |....(C) 2005.&..|
00000060  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000080  00 00 00 00 4d 44 50 52  00 00 00 70 00 00 00 00  |....MDPR...p....|
00000090  00 02 c2 a4 00 02 c2 a4  00 00 03 e0 00 00 01 9f  |................|

hexdump -C Producer11-01.rm | head
00000000  2e 52 4d 46 00 00 00 12  00 01 00 00 00 00 00 00  |.RMF............|
00000010  00 07 50 52 4f 50 00 00  00 32 00 00 00 03 6e e8  |..PROP...2....n.|
00000020  00 03 6e e8 00 00 03 e0  00 00 01 a4 00 00 01 64  |..n............d|
00000030  00 00 1b 57 00 00 05 a4  00 02 5c 35 00 00 03 5e  |...W......\5...^|
00000040  00 03 00 09 43 4f 4e 54  00 00 00 40 00 00 00 00  |....CONT...@....|
00000050  00 00 00 08 28 43 29 20  32 30 30 35 00 26 00 00  |....(C) 2005.&..|
00000060  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000080  00 00 00 00 4d 44 50 52  00 00 00 70 00 00 00 00  |....MDPR...p....|
00000090  00 02 c2 a4 00 02 c2 a4  00 00 03 e0 00 00 01 a4  |................|

They both look very similar to me. Aside from a few bytes, they are practically identical. Lets see what MediaInfo has to say.

mediainfo Producer11-01.rv
General
Complete name                            : Producer11-01.rv
Format                                   : RealMedia
File size                                : 164 KiB
Duration                                 : 6 s 999 ms
Overall bit rate                         : 225 kb/s
Frame rate                               : 24.000 FPS
Copyright                                : (C) 2005
FileExtension_Invalid                    : rm rmvb ra

Video
ID                                       : 0
Format                                   : RealVideo 4
Codec ID                                 : RV40
Codec ID/Info                            : Based on AVC (H.264), Real Player 9
Duration                                 : 6 s 999 ms
Bit rate                                 : 181 kb/s
Width                                    : 640 pixels
Height                                   : 424 pixels
Display aspect ratio                     : 3:2
Frame rate                               : 24.000 FPS
Bits/(Pixel*Frame)                       : 0.028
Stream size                              : 155 KiB (94%)

Audio
ID                                       : 1
Format                                   : Cooker
Codec ID                                 : cook
Codec ID/Info                            : Based on G.722.1, Real Player 6
Duration                                 : 7 s 429 ms
Bit rate                                 : 44.1 kb/s
Channel(s)                               : 2 channels
Sampling rate                            : 44.1 kHz
Bit depth                                : 16 bits
Stream size                              : 40.0 KiB (24%)

mediainfo Producer11-01.rm
General
Complete name                            : Producer11-01.rm
Format                                   : RealMedia
File size                                : 151 KiB
Duration                                 : 6 s 999 ms
Overall bit rate                         : 225 kb/s
Frame rate                               : 24.000 FPS
Copyright                                : (C) 2005

Video
ID                                       : 0
Format                                   : RealVideo 4
Codec ID                                 : RV40
Codec ID/Info                            : Based on AVC (H.264), Real Player 9
Duration                                 : 6 s 999 ms
Bit rate                                 : 181 kb/s
Width                                    : 640 pixels
Height                                   : 424 pixels
Display aspect ratio                     : 3:2
Frame rate                               : 24.000 FPS
Bits/(Pixel*Frame)                       : 0.028
Stream size                              : 155 KiB

Audio
ID                                       : 1
Format                                   : Cooker
Codec ID                                 : cook
Codec ID/Info                            : Based on G.722.1, Real Player 6
Bit rate                                 : 44.1 kb/s
Channel(s)                               : 2 channels
Sampling rate                            : 44.1 kHz
Bit depth                                : 16 bits

Other than the RV file having a invalid file extension, they both identify as a RealMedia file and have identical properties. So it seems the RV file is really no different than the RM file. I think the best course of action for PRONOM is to deprecate these two RV PUID’s and just ad RV as an acceptable extension for the RealMedia format.

To add to the evidence, here is the output from ffprobe:

Input #0, rm, from 'Producer11-01.rm':
  Metadata:
    copyright       : (C) 2005
    comment         : 
    ASMRuleBook     : #($Bandwidth >= 0),Stream1Bandwidth = 44100, Stream0Bandwidth = 180900;
    Audiences       : 256k DSL or Cable;
    audioMode       : music
    Creation Date   : 11/12/2024 20:28:55
    Generated By    : RealProducer(R) Plus 11.1 for Windows, Build 11.1.0.2676
    Modification Date: 11/12/2024 20:28:55
    videoMode       : normal
  Duration: 00:00:07.00, start: 0.000000, bitrate: 176 kb/s
  Stream #0:0: Video: rv40 (RV40 / 0x30345652), yuv420p, 640x424, 180 kb/s, 24 fps, 24 tbr, 1k tbn
  Stream #0:1: Audio: cook (cook / 0x6B6F6F63), 44100 Hz, stereo, fltp, 44 kb/s

Input #0, rm, from 'Producer11-01.rv':
  Metadata:
    copyright       : (C) 2005
    comment         : 
    ASMRuleBook     : #($Bandwidth >= 0),Stream1Bandwidth = 44100, Stream0Bandwidth = 180900;
    Audiences       : 256k DSL or Cable;
    audioMode       : music
    Creation Date   : 11/12/2024 20:28:16
    Generated By    : RealProducer(R) Plus 11.1 for Windows, Build 11.1.0.2676
    Modification Date: 11/12/2024 20:28:16
    videoMode       : normal
  Duration: 00:00:07.43, start: 0.000000, bitrate: 181 kb/s
  Stream #0:0: Video: rv40 (RV40 / 0x30345652), yuv420p, 640x424, 180 kb/s, 24 fps, 24 tbr, 1k tbn
  Stream #0:1: Audio: cook (cook / 0x6B6F6F63), 44100 Hz, stereo, fltp, 44 kb/s

But wait, there are a couple formats we could add which are related to RealProducer. RealProducer used a few other formats to manage projects and other metadata for streaming. They include:

.RP RealPix Image
.RT RealText
.RPAD RealProducer Audience File
.RPJF RealProducer Job File
.RPSD RealProducer Server Destination
.RMHD RealMediaHD file
.RAM Playlist
.RPM Embedded RAM

File Type	Extension	MIME Type
Ram	.ram	audio/x-pn-realaudio
Embedded Ram	.rpm	audio/x-pn-realaudio-plugin
SMIL	.smil and .smi	application/smil
RealAudio	.ra	audio/x-pn-realaudio
RealVideo	.rm	application/x-pn-realmedia
Flash	.swf	application/x-shockwave-flash
RealPix	.rp	image/vnd.rn-realpix
RealText	.rt	text/vnd.rn-realtext

https://web.archive.org/web/20120513203726/http://service.real.com/help/library/guides/production8/htmfiles/server.htm

Don’t get excited, the RealPix Image format really isn’t an image, it is simply an XML file with all the details of an image or group of images. Pretty boring. It was however a big thing in the day, even got a full guide written up for the process. “All information in the file occurs between an opening <imfl> tag and a closing </imfl> tag. This is the only tag that uses an end tag.” This format was the topic of discussion as malicious code could be in the RP file and executed just by having someone load your webpage. IMFL is obviously an acronym, but none of the documents I could find tells me what it stands for, so I did what everyone does now, I asked ChatGPT.

The RealPix format by RealNetworks, which was used for interactive multimedia content, indeed utilized IMFL as its tagged format. IMFL stands for “Interleaved Media File Language.” This markup was particularly designed to handle multimedia presentations, allowing the synchronization of images, audio, and video in a slideshow-style format. It used XML-like syntax where elements like <imfl>, <head>, and <fadein/> defined media objects, transitions, and their timing. Key components included attributes for positioning, color, and animation effects, making RealPix a flexible format for creating multimedia sequences compatible with RealPlayer.

For technical details, the RealPix format closely resembles SMIL (Synchronized Multimedia Integration Language) and supports strict tag closure and case sensitivity. This means all tags and attribute names must be lowercase, and attributes must be in double quotes, as seen in SMIL and RealSystem G2 markup, RealNetworks’ broader multimedia framework.

When I asked for a source, it could not give me one. So not sure if it is the correct answer, but it seems to fit. Here are some samples of RP, RT and SMIL files.

For RealText with the RT extension, we find a similar tagged text. This format is used to provide text presentations to go along with Images, Audio, or Video. The tagged text then describes when and how the text is displayed. This is all done in a player window, therefore the root tag of these RT documents starts and ends with <window>. I guess these could be considered a subtitle format for streaming media.

The SMIL files is interesting, it is known standard, but in many cases, does not have an XML declaration, therefore not identified by current PRONOM. They are used to link everything together. I might suggest a variant of the SMIL format to not have the XML declaration to identify these formats correctly.

<smil>
 <body>
  <par>
   <textstream src=”rtsp://realserver.company.com/mary.rt”/>
   <video src=”rtsp://realserver.company.com/mary.rm”/>
  </par>
 </body>
</smil>

The .RPAD RealProducer Audience File, .RPJF RealProducer Job File, .RPSD RealProducer Server Destination are all XML files for managing some of the configuration found in the RealProducer software.

cat 56k\ Dial-up.rpad
<?xml version="1.0" encoding="UTF-8"?>
<audience xmlns="http://ns.real.com/tools/audience.2.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"  xsi:schemaLocation="http://ns.real.com/tools/audience.2.0 http://ns.real.com/tools/audience.2.0.xsd">
  <avgBitrate type="uint">34000</avgBitrate>
  <maxBitrate type="uint">68000</maxBitrate>
  <streams>

cat RealProducer11-01.rpjf
<?xml version="1.0" encoding="UTF-8"?>
<job xmlns="http://ns.real.com/tools/job.2.0"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://ns.real.com/tools/job.2.0 http://ns.real.com/tools/job.2.0.xsd">
  <enableTwoPass type="bool">true</enableTwoPass>
  <clipInfo>

cat Multicast\ Push\ Server.rpsd
<?xml version="1.0" encoding="UTF-8"?>
<destination xsi:type="pushServer" xmlns="http://ns.real.com/tools/server.2.0"   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"  xsi:schemaLocation="http://ns.real.com/tools/server.2.0 http://ns.real.com/tools/server.2.0.xsd">
  <pluginName type="string">rn-server-rbs</pluginName>

Those three formats should be easy enough, especially if we look for Namespace urls.

The RAM and RPM formats are simply text files with a URL. You can find some samples here and here.

An RM and RV file are the same format as the RMVB file but just with a variable bitrate. Later on a new format was used to improve the quality of video. This format has the extension RMHD, referring to RealMedia HD. Let’s take a look.

hexdump -C DSC_0009.rmhd | head
00000000  2e 52 4d 50 00 00 00 12  00 01 00 00 00 00 00 00  |.RMP............|
00000010  00 07 50 52 4f 50 00 00  00 36 00 02 00 04 f7 33  |..PROP...6.....3|
00000020  00 04 f7 33 00 00 11 bd  00 00 02 5d 00 00 01 d2  |...3.......]....|
00000030  00 00 1b 2e 00 00 00 00  00 00 00 00 00 04 65 68  |..............eh|
00000040  00 00 01 6f 00 02 00 03  43 4f 4e 54 00 00 00 12  |...o....CONT....|
00000050  00 00 00 00 00 00 00 00  00 00 4d 44 50 52 00 00  |..........MDPR..|
00000060  00 76 00 00 00 00 00 03  24 64 00 03 24 64 00 00  |.v......$d..$d..|
00000070  11 bd 00 00 04 2a 00 00  00 00 00 00 00 00 00 00  |.....*..........|
00000080  1b 2e 0c 56 69 64 65 6f  20 53 74 72 65 61 6d 14  |...Video Stream.|
00000090  76 69 64 65 6f 2f 78 2d  70 6e 2d 72 65 61 6c 76  |video/x-pn-realv|

The format looks very similar, but has the magic header of .RMP instead of .RMF. MediaInfo and FFProbe are unaware of the format. The software mentions a RV11 codec which is confusing as the codecs went from RV10-RV60.

Phew, that was a lot considering the two formats I tried to research came up the same as an existing format. There are probably others I have missed. I did see a reference to an RMX format which seems to be an encrypted RM file. The header is the same so it will identify as a RealMedia file, but with the wrong extension. Let me know if you come across any. I have some samples of the formats mentioned here, plus a proposal of new signatures on my Github repository.

PAR

November 8, 2024 by Thor Leave a comment

Some file formats have a unique extension. Some formats use three character extensions which are well known, so its not common for them to be used with other software. Take the extension PDF for example, pretty sure no one else will use it as it is so well known. Other extensions often get reused by a few different software titles. There are plenty of titles which use the DOC extension.

Part of defining a file format I come across is also defining other formats which use the same extension or the same basic patterns within the format. I want the format I am researching to be identified correctly, but I also don’t want other formats to falsely identify as them either.

When using the DROID tool, if a file can’t be identified using a signature, the tool will then look to see if the extension matches any formats within the PRONOM registry, if it finds one, it will identify as that format with the identification method as “Extension”. This can be confusing and dangerous.

The topic of a format came up recently in reference to the extension PAR. Lets take a look at what we know about files with the extension PAR. Using the handy tool at digipres.org, we can see there are many formats using the PAR extension.

Apparently many people like to use the extension with their software. One might think their files with the PAR extension have to be in this list, and they would be wrong in that assumption. The PRONOM registry has no records of any format using the PAR extension. Hopefully we can add a few to help with proper identification instead of using the extension only.

A PArchive or Parity Volume Set is a group of file formats used in error correction and data integrity. Only the first version used the PAR extension, it is now obsolete with version 2 being the last stable version.

hexdump -C archive.par | head
00000000  50 41 52 00 00 00 00 00  00 00 01 00 00 09 00 02  |PAR.............|
00000010  8f d0 ce 2e 21 db 3b e5  41 d5 18 be d3 0e 52 f0  |....!.;.A.....R.|
00000020  de b6 b3 9f 53 09 ff ba  16 6b ca d2 48 a6 ca 45  |....S....k..H..E|
00000030  00 00 00 00 00 00 00 00  01 00 00 00 00 00 00 00  |................|
00000040  60 00 00 00 00 00 00 00  4e 00 00 00 00 00 00 00  |`.......N.......|
00000050  ae 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000060  4e 00 00 00 00 00 00 00  01 00 00 00 00 00 00 00  |N...............|
00000070  45 16 01 00 00 00 00 00  76 da 44 2b 43 5f b5 bd  |E.......v.D+C_..|
00000080  08 7b d2 b0 2e 16 7d 86  46 75 7b 79 f0 36 75 3b  |.{....}.Fu{y.6u;|
00000090  a1 14 22 f3 0c 77 85 3c  70 00 61 00 72 00 2d 00  |.."..w.<p.a.r.-.|

hexdump -C Testing.docx.par2 | head
00000000  50 41 52 32 00 50 4b 54  84 00 00 00 00 00 00 00  |PAR2.PKT........|
00000010  76 1f e0 a4 5a 32 e0 84  d9 e9 32 32 06 9f 03 ff  |v...Z2....22....|
00000020  71 48 73 d5 59 c6 ae 7c  c7 21 3d ba 8d e5 ea 04  |qHs.Y..|.!=.....|
00000030  50 41 52 20 32 2e 30 00  46 69 6c 65 44 65 73 63  |PAR 2.0.FileDesc|
00000040  5d 74 b5 3d 64 ae 1f d8  ae 41 f1 8c 2f 7a cc c1  |]t.=d....A../z..|
00000050  27 9b bc 61 46 21 4d 37  a3 c7 f2 07 b4 b8 df 81  |'..aF!M7........|

Pretty straightforward. The only thing that would have made it easier is if the first version used “PAR1”, but be glad they didn’t as that signature is used by another!

hexdump -C null_list.parquet | head
00000000  50 41 52 31 15 00 15 18  15 18 2c 15 02 15 00 15  |PAR1......,.....|
00000010  06 15 06 00 00 02 00 00  00 02 00 02 00 00 00 02  |................|
00000020  01 26 42 1c 15 02 19 25  00 06 19 38 09 65 6d 70  |.&B....%...8.emp|
00000030  74 79 6c 69 73 74 04 6c  69 73 74 04 69 74 65 6d  |tylist.list.item|
00000040  15 00 16 02 16 3a 16 3a  26 08 3c 36 02 00 00 00  |.....:.:&.<6....|
00000050  15 02 19 4c 48 0c 61 72  72 6f 77 5f 73 63 68 65  |...LH.arrow_sche|
00000060  6d 61 15 02 00 35 02 18  09 65 6d 70 74 79 6c 69  |ma...5...emptyli|
00000070  73 74 15 02 15 06 4c 3c  00 00 00 35 04 18 04 6c  |st....L<...5...l|
00000080  69 73 74 15 02 00 15 02  25 02 18 04 69 74 65 6d  |ist.....%...item|
00000090  6c bc 00 00 00 16 02 19  1c 19 1c 26 42 1c 15 02  |l..........&B...|

Apache Parquet is a more modern format used to store column-oriented data. At least they used a unique file extension!

Another common bit of software which uses the PAR extension is Solid Edge by Siemens. They use the PAR extension to encode their 3D parts format. For some reason this format still uses the OLE compound object container.

7z l tinyscrew.par 

Path = tinyscrew.par
Type = Compound
Physical Size = 86528
Extension = compound
Cluster Size = 512
Sector Size = 64

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
                    .....        31964        32256  PSMcluster0
                    .....           12           64  Versions
2001-12-19 15:44:14 D....                            Display
2001-12-19 15:44:14 D....                            ACIS
                    .....         8462         8704  ACIS/Solid1.sab
                    .....          238          256  PSMroots
2001-12-19 15:44:14 D....                            Display/Cache0
2001-12-19 15:44:14 D....                            Display/Styles
                    .....         1725         1728  Display/Styles/Library0
                    .....           12           64  Display/Styles/DefaultStyles
                    .....           88          128  Display/Cache0/Info
                    .....         4248         4608  Display/Cache0/L1-T1
                    .....            8           64  JSitesList
2001-12-19 15:44:14 D....                            PARASOLID
                    .....         3389         3392  PARASOLID/STREAM434.D_B
                    .....        10402        10752  PARASOLID/STREAM434.P_B
                    .....            4           64  DocVersion2
                    .....          199          256  PSMclustertable
                    .....            8           64  PSMuserroots
                    .....          512          512  JVisibleData
2001-12-19 15:44:14 D....                            PSMspacemap
                    .....           66          128  PSMspacemap/0x00002000
                    .....         6090         6144  PSMspacemap/0x00000000
                    .....          174          192  PSMspacemap/0x00004000
                    .....         4716         5120  PSMtypetable
                    .....            8           64  FamilyMembers
                    .....            8           64  BuildVersions
                    .....          150          192  PartsLiteData
                    .....          596          640  [5]C3teagxwOttdbfkuIaamtae3Ie
                    .....          476          512  [5]SummaryInformation
                    .....           12           64  PSMsegmenttable
                    .....           96          128  MSConvertedPropertyset
                    .....          148          192  [5]K4teagxwOttdbfkuIaamtae3Ie
                    .....          280          320  [5]DocumentSummaryInformation
                    .....          116          128  [5]SszbwomgY1udb2whAaq5u2jwCg
                    .....          264          320  [5]Rfunnyd1AvtdbfkuIaamtae3Ie
                    .....          140          192  Dynamic Attributes Metadata
                    .....          458          512  Unclustered Dynamic Attributes
------------------- ----- ------------ ------------  ------------------------
2001-12-19 15:44:14              75069        77824  32 files, 6 folders

We will have to use the a container signature to correctly identify this format. There are also ASM and DFT formats which are also Solid Edge formats which use the same OLE container. Hopefully there are some unique features we can use to identify them.

One other file format which uses the PAR extension is not listed in any of the registries. Not in PRONOM, TrID, Wikidata, or others. I came across it while researching another format, DVD Studio Pro. On a Macintosh computer running the now discontinued DVD Studio Pro, one could save their DVD mastering project as a “file” which used the DSPPROJ extension. I use the term file loosely here as it wasn’t actually a file, it was a folder with an extension which MacOS would interpret as a single file. These are the package formats Apple used and still uses quite frequently. Moving this folder to another other system results in a folder of content.

tree sample.dspproj 
/sample.dspproj
└── Contents
    ├── PkgInfo
    └── Resources
        ├── Audio
        ├── MPEG
        ├── Menu
        ├── ModuleDataB
        ├── ObjectDataB
        ├── Openers.plist
        ├── Overlay
        ├── Picture
        ├── Render Data
        │   ├── C4272B0100797459.M2V
        │   └── PAR
        │       └── C4272B0100797459.M2V.par
        ├── Styles
        ├── Temp
        ├── Templates
        └── Thumbnails

14 directories, 6 files

This PAR extension is explained in the DVD Studio Pro manual:

About the Parse Files
To use an asset in a project, DVD Studio Pro needs to know some general information about it, such as its length, type, and integrity. Video assets encoded within DVD Studio Pro can include this information in the encoded files, or can create separate files for it. Assets encoded by Compressor outside of DVD Studio Pro can include this information if you select the “Add DVD Studio Pro meta-data” option in the Extras pane of the Encoder settings.
Assets encoded with other encoders, or with the “Add DVD Studio Pro meta-data” option disabled when using Compressor, must be parsed before DVD Studio Pro can use them. Parsing creates a small file, with the same name as the video asset and a “.par” extension that contains the required information. The parse file can take from several seconds to several minutes to create, depending on the size of the asset file.

hexdump -C E4712E541A60E300.M2V.par | head
00000000  56 50 41 52 00 00 00 20  00 00 00 00 00 01 e2 40  |VPAR... .......@|
00000010  00 00 00 00 00 c6 19 7c  2f 55 73 65 72 73 2f 74  |.......|/Users/t|
00000020  79 6c 65 72 2f 44 6f 63  75 6d 65 6e 74 73 2f 46  |yler/Documents/F|
00000030  69 6e 61 6c 20 52 65 6e  64 65 72 20 66 6f 72 20  |inal Render for |
00000040  44 56 44 20 56 51 42 2f  56 61 72 73 69 74 79 51  |DVD VQB/VarsityQ|
00000050  42 20 44 56 44 2f 56 61  72 73 69 74 79 51 42 2d  |B DVD/VarsityQB-|
00000060  44 69 73 63 32 2e 64 73  70 70 72 6f 6a 2f 43 6f  |Disc2.dspproj/Co|
00000070  6e 74 65 6e 74 73 2f 52  65 73 6f 75 72 63 65 73  |ntents/Resources|
00000080  2f 52 65 6e 64 65 72 20  44 61 74 61 2f 45 34 37  |/Render Data/E47|
00000090  31 32 45 35 34 31 41 36  30 45 33 30 30 2e 4d 32  |12E541A60E300.M2|

Parity, Parts, and Parse files, oh my.

If you thought we were done, you would be wrong! Let’s look at yet another PAR format.

hexdump -C MESSROH.PAR | head
00000000  08 69 64 73 32 30 30 30  30 d0 4e 01 51 46 42 00  |.ids20000.N.QFB.|
00000010  98 d0 4e 01 80 01 58 01  b6 b9 f7 bf 82 30 00 00  |..N...X......0..|
00000020  dc 08 00 00 60 51 f2 bf  82 30 01 59 ff ff ff ff  |....`Q...0.Y....|
00000030  a4 d0 4e 01 28 3e f2 bf  78 63 a4 01 dc 08 00 0b  |..N.(>..xc......|
00000040  5a 45 52 4f 2d 4f 46 46  53 45 54 01 18 0e ac 01  |ZERO-OFFSET.....|
00000050  d4 d0 4e 01 00 ac 43 00  18 0e ac 01 d4 d0 4e 01  |..N...C.......N.|
00000060  51 46 42 00 ec d0 4e 01  d4 00 4e 01 b6 b9 f7 bf  |QFB...N...N.....|
00000070  5c 4c 75 81 5c 81 00 00  45 07 41 00 c0 0a 00 01  |\Lu.\...E.A.....|
00000080  cd d0 41 00 d5 d0 41 00  5c 81 00 00 dc 0a a4 01  |..A...A.\.......|
00000090  5b 5d 42 00 cc d0 4e 01  72 5d 42 00 7a 5d 42 00  |[]B...N.r]B.z]B.|

hexdump -C DUMMYDAT.PAR | head 
00000000  08 73 65 69 73 6d 69 63  31 00 00 00 00 00 00 00  |.seismic1.......|
00000010  00 00 00 00 00 01 58 00  00 00 00 00 00 00 00 00  |......X.........|
00000020  00 00 00 00 00 00 00 00  00 00 01 59 00 00 00 00  |...........Y....|
00000030  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 0a  |................|
00000040  41 4b 55 53 54 49 4b 4c  4f 47 00 00 00 00 00 00  |AKUSTIKLOG......|
00000050  00 00 00 00 02 2f 2f 00  08 41 47 43 2d 47 41 49  |.....//..AGC-GAI|
00000060  4e 00 00 00 00 00 00 00  00 00 00 00 00 32 00 00  |N............2..|
00000070  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

This PAR format is called “Reflexw data-format“. This is a RAW format header that always is paired with a DAT file, together used to store geophysical wave data from devices such as GPR. Relexw is software made by Sandmeier geophysical research.

The PAR file samples I have don’t seem to have a consistent header as each have a unique set of bytes, but all of them have some similar bytes later in the file at around the 0x1D8 (472) offset:

000001d0  00 00 a0 3d 00 00 a0 41  00 00 00 00 00 00 00 00  |...=...A........|
000001e0  0a d7 23 3c 00 00 80 3f  00 00 00 00 00 00 00 00  |..#<...?........|
000001f0  00 00 00 00 cc cc dc 40  00 00 00 00 00 00 00 00  |.......@........|
00000200  00 00 80 3f 00 00 00 00  00 00 00 00 00 00 00 00  |...?............|
00000210  00 00 00 00 00 00 00 00  17 b7 d1 38 00 00 00 00  |...........8....|

It seems these sequence of bytes are the only consistent bytes among all my samples. I have no idea what they mean or reference. The specification does indicate some bytes which should lead to proper identification, but the integer used for the “HeaderMarker” is looking for a 4 byte “00 00 00 01”, which won’t be enough to cleanly identify the format. Love to hear what others can see from the spec. You can find some samples files here.

So we have some Parity files, Parts files, Parse files, Parquet files, and a Header file. I am sure other will be found and added to this lot. Hopefully the PAR files you run across will match one of these patterns! I am still working on a signature proposal. Stay Tuned!

Daisy

October 4, 2024 by Thor Leave a comment

A single file can often be self contained, having all that is needed to render itself with the correct software, but more and more often files need other files to function properly. Sometimes these groups of dependent files are within a container, such as a DOCX or ePub, but can also be found all sitting nicely in a folder. I say nicely, partly because the structure works, that is until they are treated as individual files and renamed or moved around breaking that interdependence on each other.

In the case of many Apple bundle files, they appear to be a single file when using on the MacOS, but as a folder on Windows or Linux. This can be very confusing. In other cases such as the DAISY Digital Talking Book format, it is simply a folder or disc with a few or many files within.

Current tools used to identify file formats, such as DROID, look at individual files, not groups of files to determine format. Each file within a folder may have a unique format, but when grouped with other specific formats they become something more. We will have to work on enhancing current tools if we want to avoid breaking these format types and losing their ability to render properly.

DAISY, or Digital Accessible Information System, is a type of Digital Book. The format was originally conceived in 1988 as a method to create a talking book, designed for the purpose of giving those who are visually impaired the ability to listen to books. It wasn’t until 1996, the DAISY Consortium was created in order to take the technology to those who needed it. The original version of the the DAISY format in 1994 was proprietary, but once they formed the consortium, they decided to adopt open standards for the format and in 1998, the DAISY 2.0 standard was released. You can read more on the Library of Congress Format Description page.

Lets take a look at a folder containing a DAISY 2.0 book.

ls -la "DAISY 2.02 export"
total 536
drwx------  1 tyler  staff   16384 Sep 25 22:06 .
drwx------  1 tyler  staff   16384 Sep 25 22:06 ..
-rwx------@ 1 tyler  staff    1090 Sep 25 22:05 0002.smil
-rwx------  1 tyler  staff  228413 Sep 25 22:05 aud0001.mp3
-rwx------@ 1 tyler  staff     672 Sep 25 22:05 master.smil
-rwx------  1 tyler  staff    1703 Sep 25 22:05 ncc.html

We can see three different formats in this folder. The obvious well known MP3 files and an HTML file. We also see two files with the extension SMIL.

“Synchronized Multimedia Integration Language” or SMIL is a W3C XML standard used to describe multimedia presentations. It is used in the DAISY DTB as well as other applications, but we will focus on DAISY, and it is in its third version. A SMIL file has this structure:

<?xml version="1.0"?>
<!DOCTYPE smil PUBLIC "-//W3C//DTD SMIL 1.0//EN" "http://www.w3.org/TR/REC-smil/SMIL10.dtd">
<smil>
  <head>
    <meta name="dc:title" content="Obi Project" />
    <meta name="dc:identifier" content="589c550e-303b-4c0d-9921-ae76d782fd53" />
    <meta name="ncc:generator" content="Obi v5.0.0.0 with toolkit: UrakawaSDK.core v2.0.0.0 (http://urakawa.sf.net/obi)" />
    <meta name="dc:format" content="Daisy 2.02" />
    <meta name="ncc:timeInThisSmil" content="00:00:28" />
    <layout>
      <region id="textView" />
    </layout>
  </head>
  <body>
    <ref title="Testing" src="0002.smil" id="ms_0002" />
  </body>
</smil>

A standard XML file with a link to a SMIL DTD and a root tag of <smil>. This format is recognized by PRONOM as fmt/205, although is often identified as a standard XML file. It seems the signature was created with a small offset which works with some SMIL files, but the gap between the end of the XML declaration and the start of the <smil> tag is only 20-86 bytes, not enough to allow for different character sets and full DTD URL’s. We will have to increase this gap in order to get all the SMIL files identified correctly.

With this update all the files in a DAISY 2.0 files should be identified individually, but as a set of files they make up the DAISY 2.0 format. This format requires the ncc.html file be present at the root of the folder or CD, so this file will aid in the manual identification of this format.

DAISY 3 was released in 2002 and standardized using the ANSI/NISO Z39.86 2002 name. It has been revised a couple times with the current revision being 2012. This update adds more functionality to the format with many new optional and required formats/files included in the folder. Here is a simple example:

ls -la "DAISY3 Export"
total 784
drwx------  1 tyler  staff   16384 Sep 25 22:06 .
drwx------  1 tyler  staff   16384 Sep 25 22:06 ..
-rwx------@ 1 tyler  staff     979 Sep 25 22:05 0001.smil
-rwx------  1 tyler  staff  228413 Sep 25 22:05 aud0001.mp3
-rwx------  1 tyler  staff    1014 Sep 25 22:05 navigation.ncx
-rwx------  1 tyler  staff    1881 Sep 25 22:05 package.opf
-rwx------  1 tyler  staff    7838 Nov  2  2020 tpbnarrator.res
-rwx------  1 tyler  staff  117656 Nov  2  2020 tpbnarrator_res.mp3

The SMIL format is still included, along with MP3’s, but we have some addition formats. The NCX or “Navigation Control File”, the OPF or “Package file”, and the RES or “Resource file” are a few of them. The NCX file is the first file accessed as it lays out the navigation for the whole DTB. It is also XML:

cat DAISY3 Export/navigation.ncx 
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE ncx PUBLIC "-//NISO//DTD ncx 2005-1//EN" "http://www.daisy.org/z3986/2005/ncx-2005-1.dtd">
<ncx
	version="2005-1"
	xml:lang="en-US" xmlns="http://www.daisy.org/z3986/2005/ncx/">

This file is only recognized by DROID as a standard XML file. It probably should have unique identification like SMIL and with a root tag of <ncx>, that should be fairly easy to add.

The Package file with the extension OPF, is actually a format used by the openebook group, not to be confused by a format used by the Open Preservation Foundation 🤣. The Open Packaging Format is used and a DTB conforming to this standard must include exactly one Package File which must be a valid XML 1.0 document conforming to the OEBF Publication Structure 1.2 package.

cat DAISY3 Export/package.opf   
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE package PUBLIC "+//ISBN 0-9673008-1-9//DTD OEB 1.2 Package//EN" "http://openebook.org/dtds/oeb-1.2/oebpkg12.dtd">
<package
	unique-identifier="uid" xmlns="http://openebook.org/namespaces/oeb-package/1.0/">
	<metadata>
		<dc-metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oebpackage="http://openebook.org/namespaces/oeb-package/1.0/">
			<dc:Identifier
				id="uid">589c550e-303b-4c0d-9921-ae76d782fd53</dc:Identifier>
			<dc:Format>ANSI/NISO Z39.86-2005</dc:Format>
			<dc:Title>Obi Project</dc:Title>
			<dc:Publisher>N/A</dc:Publisher>
			<dc:Language>en-US</dc:Language>
			<dc:Creator>Creator name</dc:Creator>
			<dc:Date>2024-09-25</dc:Date>
		</dc-metadata>

The OPF format is also unknown to PRONOM and they identify as standard XML files as well. The root tag of “<package>” could be used elsewhere so the signature may need to reference the OEB package information.

The RES Resource file is also a standard XML and can be identified through its root tag of “<resources>” and resources DOCTYPE.

cat DAISY3 Export/tpbnarrator.res 
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE resources PUBLIC "-//NISO//DTD resource 2005-1//EN" "http://www.daisy.org/z3986/2005/resource-2005-1.dtd" []>
<resources xmlns="http://www.daisy.org/z3986/2005/resource/" version="2005-1">
  
  <!-- SKIPPABLE NCX -->
  
  <scope nsuri="http://www.daisy.org/z3986/2005/ncx/">
    <nodeSet id="ns001" select="//smilCustomTest[@bookStruct='LINE_NUMBER']">
      <resource xml:lang="en" id="r001">
        <text>Row</text>
        <audio src="tpbnarrator_res.mp3" clipBegin="0:00:02.379" clipEnd="0:00:03.416" />
      </resource>
    </nodeSet>

Now, adding these DAISY 3.0 formats will greatly increase the identification of this complex format. But we run into a problem with some of the software out there which generates these DAISY files, some of them include files not required by the format, but are included to be used by the different software. This can include some CSS files for formatting, additional XML, XSL files, DTD’s, and for DAISY files created by the PlexTalk software, additional project files.

ls -la MasterCD/AfterBuild 
total 7520
drwx------@ 1 tyler  staff    16384 Sep 24 19:34 .
drwx------@ 1 tyler  staff    16384 Sep 25 22:11 ..
-rwx------@ 1 tyler  staff     6688 Sep 25 01:32 ImdPhrInfo.imph
-rwx------@ 1 tyler  staff     3773 Sep 25 01:32 ImdTxtTabl.imtt
-rwx------@ 1 tyler  staff     1276 Sep 25 01:32 Ncc.imdn
-rwx------@ 1 tyler  staff  3716618 Sep 25 01:32 a000001.mp3
-rwx------@ 1 tyler  staff     4352 Sep 25 01:32 ncc.html
-rwx------@ 1 tyler  staff     1015 Sep 25 01:32 ptk000001.smil
-rwx------@ 1 tyler  staff      938 Sep 25 01:32 ptk000002.smil

The ncc.html file is here, indicating a DAISY 2.0 format, along with an MP3 and SMIL files, but including some additional formats.

In addition, when creating a project, four files with the extensions Ncc.imdn, ImdPhrInfo.imph, ImdTxtTabl.imtt, and METADATA.ini are automatically created. These files are called “Plextalk project files.” They store table of contents information, etc. (Plextalk project files generated by older versions of this product do not have METADATA.ini.)
http://www.plextalk.com/jp/dw_data/PRSStd/PLEX_RS_UM.html

These four files may not be crucial to the playing of the Daisy format, but they are important to the PlexTalk software.

hexdump -C ImdPhrInfo.imph | head
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000020  ff ff ff ff ff ff ff ff  00 00 00 00 00 00 00 00  |................|
00000030  00 00 00 00 00 00 00 00  f0 a3 0d 00 00 00 00 00  |................|
00000040  a3 06 00 00 a4 06 00 00  00 00 00 00 53 00 00 00  |............S...|
00000050  ff ff ff ff 01 00 00 00  03 00 00 00 00 00 00 00  |................|
00000060  00 00 00 00 00 00 00 00  c5 11 00 00 20 1a 00 00  |............ ...|
00000070  e5 2b 00 00 00 00 00 00  63 00 00 00 ff ff ff ff  |.+......c.......|
00000080  02 00 00 00 04 00 00 00  00 00 00 00 00 00 00 00  |................|
00000090  00 00 00 00 e5 2b 00 00  d6 0b 00 00 bb 37 00 00  |.....+.......7..|

hexdump -C ImdTxtTabl.imtt | head 
00000000  17 00 00 00 32 30 30 34  2f 30 35 2f 33 31 2f 31  |....2004/05/31/1|
00000010  36 3a 36 3a 34 37 2e 30  30 30 00 03 00 00 00 65  |6:6:47.000.....e|
00000020  6e 00 0b 00 00 00 69 73  6f 2d 38 38 35 39 2d 31  |n.....iso-8859-1|
00000030  00 0d 00 00 00 5a 3a 2f  42 6f 6f 6b 44 69 72 34  |.....Z:/BookDir4|
00000040  2f 00 0d 00 00 00 5a 3a  2f 42 6f 6f 6b 44 69 72  |/.....Z:/BookDir|
00000050  34 2f 00 0c 00 00 00 61  30 30 30 30 30 31 2e 6d  |4/.....a000001.m|
00000060  70 33 00 0c 00 00 00 61  30 30 30 30 30 31 2e 6d  |p3.....a000001.m|
*
00000980  70 33 00 08 00 00 00 48  65 61 64 69 6e 67 00 01  |p3.....Heading..|
00000990  00 00 00 00 08 00 00 00  48 65 61 64 69 6e 67 00  |........Heading.|

hexdump -C Ncc.imdn | head       
00000000  01 ff 00 ff c4 00 00 00  3c 00 00 00 2c 00 00 00  |........<...,...|
00000010  14 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000020  00 00 00 00 49 6d 64 54  78 74 54 61 62 6c 2e 69  |....ImdTxtTabl.i|
00000030  6d 74 74 00 00 00 00 00  00 00 00 00 00 00 00 00  |mtt.............|
00000040  00 00 00 00 49 6d 64 50  68 72 49 6e 66 6f 2e 69  |....ImdPhrInfo.i|
00000050  6d 70 68 00 00 00 00 00  00 00 00 00 00 00 00 00  |mph.............|
00000060  00 00 00 00 04 00 00 00  00 fa 00 00 44 ac 00 00  |............D...|
00000070  01 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000080  00 00 00 00 01 00 00 00  08 00 00 00 12 00 00 00  |................|
00000090  03 00 00 00 00 00 00 00  01 00 00 00 ff ff ff ff  |................|

I don’t have a METADATA.ini file to research, but I will be honest, these PlexTalk files will be hard to identify from their contents.

Looking at the IMPH file, there isn’t a lot of bytes which might indicate a format magic bytes. But I do see some patterns. The first 40 bytes all seem to be the same.

00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 FFFFFFFF FFFFFFFF

But making a signature from only 00 and FF might clash with other formats. It does appear that the 4 bytes FFFFFFFF occur every 40 bytes. This precision might be good enough if we repeat it a couple times.

The IMTT file is different. It appears to have information on the name, character set and all the files in the Daisy package. The first 4 bytes in my 14 samples either start with 17000000 or 18000000. Not knowing what the 17 or 18 refers to, I am hesitant to use it for identification. In between some of the data there is some consistent bytes, but at different offsets.


hexdump -C ImdTxtTabl.imtt | head
00000000  18 00 00 00 54 69 74 6c  65 00 35 39 2d 31 00 31  |....Title.59-1.1|
00000010  35 3a 35 34 3a 35 39 2e  32 36 30 00 03 00 00 00  |5:54:59.260.....|
00000020  65 6e 00 0b 00 00 00 69  73 6f 2d 38 38 35 39 2d  |en.....iso-8859-|
00000030  31 00 01 00 00 00 00 01  00 00 00 00 01 00 00 00  |1...............|
00000040  00 01 00 00 00 00 01 00  00 00 00 01 00 00 00 00  |................|
00000050  01 00 00 00 00 01 00 00  00 00 0c 00 00 00 4d 61  |..............Ma|
00000060  72 69 6f 6e 20 53 79 6d  65 00 28 00 00 00 4d 69  |rion Syme.(...Mi|
00000070  6e 75 74 65 73 20 6f 66  20 74 68 65 20 43 6f 6d  |nutes of the Com|
00000080  6d 69 74 74 65 65 20 4d  65 65 74 69 6e 67 20 32  |mittee Meeting 2|
00000090  34 30 35 30 34 00 08 00  00 00 48 65 61 64 69 6e  |40504.....Headin|

hexdump -C ImdTxtTabl.imtt | head
00000000  17 00 00 00 32 30 30 34  2f 30 35 2f 33 31 2f 31  |....2004/05/31/1|
00000010  36 3a 36 3a 34 37 2e 30  30 30 00 03 00 00 00 65  |6:6:47.000.....e|
00000020  6e 00 0b 00 00 00 69 73  6f 2d 38 38 35 39 2d 31  |n.....iso-8859-1|
00000030  00 0d 00 00 00 5a 3a 2f  42 6f 6f 6b 44 69 72 34  |.....Z:/BookDir4|
00000040  2f 00 0d 00 00 00 5a 3a  2f 42 6f 6f 6b 44 69 72  |/.....Z:/BookDir|
00000050  34 2f 00 0c 00 00 00 61  30 30 30 30 30 31 2e 6d  |4/.....a000001.m|
00000060  70 33 00 0c 00 00 00 61  30 30 30 30 30 31 2e 6d  |p3.....a000001.m|

Not sure what any of it means, but might be good enough for a signature.

Now the IMDN files might be a little easier:

hexdump -C Ncc.imdn | head
00000000  01 ff 00 ff d4 00 00 00  3c 00 00 00 2c 00 00 00  |........<...,...|
00000010  14 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000020  00 00 00 00 49 6d 64 54  78 74 54 61 62 6c 2e 69  |....ImdTxtTabl.i|
00000030  6d 74 74 00 00 00 00 00  00 00 00 00 00 00 00 00  |mtt.............|
00000040  00 00 00 00 49 6d 64 50  68 72 49 6e 66 6f 2e 69  |....ImdPhrInfo.i|
00000050  6d 70 68 00 00 00 00 00  00 00 00 00 00 00 00 00  |mph.............|
00000060  00 00 00 00 04 00 00 00  00 7d 00 00 22 56 00 00  |.........}.."V..|
00000070  01 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000080  00 00 00 00 01 00 00 00  28 00 00 00 28 00 00 00  |........(...(...|
00000090  00 00 00 00 00 00 00 00  28 00 00 00 ff ff ff ff  |........(.......|

This format directly names the two other formats. Should be easy to look for the two file names in the header. The NCC html file in Daisy 2.0 and the NCX xml file in Daisy 3.0 are directory files so it makes sense this file would do the same.

Not sure if these signatures will hold up over time, but they are a start. It would be nice if all the files we are given to preserve would have convenient static magic bytes, but alas, many do not and we have to guess.

These Daisy formats illustrate a problem in preservation that doesn’t quite have a good solution. Each of these files are individually unique and can be identified, but as a whole they represent another unique format. Tying formats together to link their interdependence on each other will be no small task, but will be necessary not only to understanding the format, but to avoid separating the files, renaming, or rearranging breaking that interdependence.

I have added the update to SMIL and new signatures for the other formats to my GitHub repository. Feel free to test and change if you find additional samples or information.

HFE

September 27, 2024 by Thor 2 Comments

Last week I had the pleasure of attending the 20th annual iPres conference on Digital Preservation in Ghent, Belgium. I enjoyed hearing from many of my respected colleagues on many aspects of preservation including one of my favorite topics, floppy disks. There was tutorials, lightning talks, and even a workshop, presented by Leontien Talboom, Elizabeth Kata, Chris Knowles, and myself. We titled the workshop “A Guide to Imaging Obscure Floppy Disk Formats“. The workshop was conceived by a mutual interest in imaging Wang 5.25in word processor disks, but expanded to include imaging of Amstrad 3in disks, 240K Brother Typewriter Disks, and Macintosh 400/800k disks.

I brought my hand soldered FluxEngine board and others brought their Greaseweazle board to show off how imaging obscure and uncommon disks can be done on a budget.

Photo of workshop taken on a Mavica Floppy Disk camera — Image taken during workshop on a Mavica FD200 Floppy Disk Camera.

During the conference we talked a bit about the different type of hardware that can be used and the difference between a disk image and flux image. There seems to be quite the exhaustive list of different types of file formats, some specific to a platform and others more generic. I recently did a blog post on the formats used by the Applesauce software, which have some unique features.

There are many disk image types which should be researched and added to PRONOM and other format description sites, but today lets take a look at a generic format used by many tools.

The HxC Floppy Emulator file format which the extension HFE is a popular format used with floppy drive emulators. There is a lot of complexity with what is included in many of these image formats, some are simply a raw sector representation of the binary data on a disk, others contain the complete flux readings from a floppy disk. The HFE format contains a little more than a raw image, including a header, a track lookup table, and the bitstreams for each track all with the purpose of emulating the physical media. The HFE format contains only a single pass over the data, where other formats may contain multiple reading of each track to get more complete data which can be helpful for damaged or purposely copy-protected disks. You can read more on Ashley’s blog, Library of Congress format description.

When using the HxC Floppy Emulator software, you can open and save to many different formats. The main format being their HFE native format. It comes in 5 versions.

hexdump -C test01.hfe | head
00000000  48 58 43 50 49 43 46 45  00 53 02 00 e8 01 00 00  |HXCPICFE.S......|
00000010  07 01 01 00 ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
00000020  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|

Above is a hexdump of the main SDCard HxC Floppy Emulator file format. The format specification shows the 8 byte header “HXCPICFE”. This is a very unique pattern and should be all we need to make a robust signature for the format, but we do need to take into account the other HFE “versions” and see if they might clash or need to be identified separately.

hexdump -C test02-a2.hfe | head 
00000000  48 58 43 50 49 43 46 45  00 53 02 00 d0 03 00 00  |HXCPICFE.S......|
00000010  07 01 01 00 ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
00000020  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|

The “A2” version of the format has the same header but some different bytes further into the file.

hexdump -C test03-rev2.hfe | head
00000000  48 58 43 50 49 43 46 45  01 53 02 00 00 00 00 00  |HXCPICFE.S......|
00000010  07 01 01 00 ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
00000020  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|

The “Rev 2” version also has the same header. But if you look at the 9th byte you can see the value changed from 00 to 01, which according to the specification, this is the revision byte.

hexdump -C test04-rev3.hfe | head 
00000000  48 58 43 48 46 45 56 33  00 53 02 00 e8 01 00 00  |HXCHFEV3.S......|
00000010  07 01 01 00 ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
00000020  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|

With “Rev 3” we see a change in the header with “HXCHFEV3” which appears to be referred to as HFEv3.

hexdump -C test05-stream.hfe | head 
00000000  48 78 43 5f 53 74 72 65  61 6d 5f 49 6d 61 67 65  |HxC_Stream_Image|
00000010  00 00 00 00 00 00 00 00  00 18 00 00 00 02 00 00  |................|
00000020  00 1a 00 00 53 00 00 00  02 00 00 00 40 9c 00 00  |....S.......@...|
00000030  07 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000040  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

This last format seems to be a special HxC stream image.

It seems the best option is to make three signatures to identify the three main headers. Additional software can be used to further parse the disk image. If you would like to see some sample images, you can download a bunch here. You can also take a look at my GitHub repository to see additional samples and a proposed set of signatures.

ATRAC

September 6, 2024 by Thor 3 Comments

The year was 2001 and I found myself in need of an audio player and recorder. I had been burning CD’s for a few years, making mixed CD’s was fun and convenient, but I needed more flexibility. After some research I decided on a device that was super popular outside the United States, but was gaining some loyal fans.

This MZ-G750 MiniDisc device could record in a standard high quality mode through RCA, optical digital cable, and an optional microphone in mini-plug. This model also had the LP2 and LP4 modes which compressed higher, but could record up to 320 minutes on one MD disc.

Sony accomplished this by using a propriety compression codec called ATRAC, or Adaptive TRansform Acoustic Coding. This compression format was used with the MiniDisc and other Sony devices like the flash memory Walkman’s sold later.

I recorded and stored a lot of music on the few disc’s I purchased over the next year, but as you may have surmised, the iPod came out later that year. I waited a bit but eventually purchased the updated 10GB model and the MiniDisc only was used to make a few recordings over the next little while.

As good as the MiniDisc is, the model I owned could record in a digital format, but lacked the connections to transfer the audio to a computer unless you used the optical cable and captured in real time to a computer with an optical input. This was by design, even when they put USB ports on later models, the software only allowed sending audio to the MiniDisc, but not back from the device.

A few years back I heard of some work the community has done to bring MiniDisc’s back from shadows. Now there is a thriving market and some models can cost a pretty penny. With that came some great tools and the ability to copy from the device back to the computer. The only problem, my device lacks a USB port. I kept my eye out for a “good” deal on a NetMD MiniDisc device. It took some time, but I am happy to report I am now the proud owner of a MZ-N420D.

With a new USB capable NetMD in hand, lets take a look at the different ATRAC formats!

The most common ATRAC formats are the ATRAC3 versions which generally have the extension OMA or OMG. But let’s start with ATRAC1, the format used on my earlier MiniDisc device when captured in Standard Mode. Using the amazing https://webmd.pro/ tool, I was able to connect my new device and “archive” my disc.

hexdump -C Test1.aea | head
00000000  00 08 00 00 54 65 73 74  31 00 00 00 00 00 00 00  |....Test1.......|
00000010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000100  00 00 00 00 1e 01 00 00  02 00 00 00 00 00 00 00  |................|
00000110  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000b50  0c a0 45 57 54 44 32 35  41 44 22 34 32 24 13 23  |..EWTD25AD"42$.#|
00000b60  32 23 22 12 11 11 11 11  76 18 69 75 f8 63 69 a7  |2#".....v.iu.ci.|
00000b70  a4 5d 46 22 45 36 1f 59  55 9d 41 55 19 51 45 17  |.]F"E6.YU.AU.QE.|
00000b80  45 14 55 38 c2 cb 2c b2  88 26 fd b2 17 b3 f0 0f  |E.U8..,..&......|

ffprobe -i Test1.aea
[aea @ 0x7fc5e6c04fc0] Estimating duration from bitrate, this may be inaccurate
Input #0, aea, from 'Test1.aea':
  Duration: 00:00:01.63, bitrate: 302 kb/s
  Stream #0:0: Audio: atrac1, 44100 Hz, stereo, fltp, 292 kb/s

ATRAC1 files can have the AEA extension, which ffmpeg can decode, but MediaInfo doesn’t appear to have added the support. According to the decoder the magic numbers for the ATRAC1 format are “Magic is ‘00 08 00 00‘ in little-endian”. This pattern matches my files, but the recent addition PRONOM fmt/1968 doesn’t match all the samples I have.

The magic numbers are too simple to be the only pattern used in a signature. The Track title follows the magic numbers but are not static. Then there are quite a bit of zero bytes, like a lot. All the samples I have seem to have some data around the 260 offset, then more zero bytes until around 2400 to 2800 byte offset range. I scanned all the samples I have through Tridscan, and it looks like the only bytes in common are the magic header, lots of zero’s, and a few strings.

	<GlobalStrings>
		<String>ED33</String>
		<String>EUD3</String>
		<String>FTDC</String>
		<String>T322</String>
		<String>TC32</String>
		<String>TC43</String>
		<String>UC22</String>
		<String>UED3</String>
		<String>VD33</String>
		<String>VETC</String>
		<String>WEDD</String>
	</GlobalStrings>

The ffmpeg libavformat code does tell us at byte 264 there will be a 01 or 02 which indicates channels. 44.1 kHz is assumed and the bitrate is calculated from a constant by how many channels, so not much else to identify common patterns. More testing needed.

ATRAC3 is what allowed my original MiniDisc to record in LP2 and LP4, extending the recording time. This format was also how some DRM was added to the device and computer to allow for some checking-in and checking-out of files, but to control their use. This was done with Desktop software from Sony, originally in the form of the title SonicStage, later incorporating OpenMG to manage the DRM. I used SonicStage to encode some audio into OMG and OMA formats.

OpenMG format files

These are audio files which have been converted to ATRAC3 format and encrypted in OpenMG format, which is the copyright protection technology for audio contents specific to OpenMG (with the extension .omg).

hexdump -C 01-Untitled.omg | head
00000000  30 80 30 80 06 07 66 6f  70 65 6e 4d 47 02 02 03  |0.0...fopenMG...|
00000010  eb 04 14 01 0f 50 00 00  04 00 00 00 ba d0 90 49  |.....P.........I|
00000020  3d 7f 61 7b 91 c4 30 06  02 67 01 02 02 3f 00 06  |=.a{..0..g...?..|
00000030  02 68 01 02 04 00 59 47  80 02 01 00 02 03 02 03  |.h....YG........|
00000040  a0 02 02 01 80 02 01 00  00 00 04 08 f5 94 79 c9  |..............y.|
00000050  6b 78 75 22 04 84 00 59  5e 30 83 0b 71 39 e3 e8  |kxu"...Y^0..q9..|
00000060  27 29 00 00 00 00 00 00  00 00 26 e2 65 d0 de e0  |')........&.e...|
00000070  69 19 73 45 1c c4 3b 36  8d 02 3b 72 bd eb 84 df  |i.sE..;6..;r....|
00000080  cd 20 4e 43 d3 e3 23 8a  3f 9e df 80 f1 86 d1 aa  |. NC..#.?.......|
00000090  2b 93 bf 09 59 0d d6 8f  78 5d 45 3a 9f d8 79 8b  |+...Y...x]E:..y.|

ffprobe -i /01-Untitled.omg 
[oma @ 0x7fed2440e980] Format oma detected only with low score of 1, misdetection possible!
[oma @ 0x7fed2440e980] Couldn't find the EA3 header !
/01-Untitled.omg: Invalid data found when processing input

The good news is there appears to be a standard header for the OMG format, but ffmpeg assumes they are OMA files. Turns out OMG was the original form of the format, but was replaced with OMA starting with SonicStage v2.1.

hexdump -C 01-Untitled.oma | head
00000000  65 61 33 03 00 00 00 00  17 76 54 49 54 32 00 00  |ea3......vTIT2..|
00000010  00 17 00 00 02 00 55 00  6e 00 74 00 69 00 74 00  |......U.n.t.i.t.|
00000020  6c 00 65 00 64 00 28 00  31 00 29 54 41 4c 42 00  |l.e.d.(.1.)TALB.|
00000030  00 00 11 00 00 02 00 55  00 6e 00 74 00 69 00 74  |.......U.n.t.i.t|
00000040  00 6c 00 65 00 64 54 58  58 58 00 00 00 17 00 00  |.l.e.dTXXX......|
00000050  02 00 4f 00 4d 00 47 00  5f 00 54 00 52 00 41 00  |..O.M.G._.T.R.A.|
00000060  43 00 4b 00 00 00 31 54  58 58 58 00 00 00 25 00  |C.K...1TXXX...%.|
00000070  00 02 00 4f 00 4d 00 47  00 5f 00 41 00 4c 00 42  |...O.M.G._.A.L.B|
00000080  00 4d 00 53 00 00 00 55  00 6e 00 74 00 69 00 74  |.M.S...U.n.t.i.t|
00000090  00 6c 00 65 00 64 54 58  58 58 00 00 00 23 00 00  |.l.e.dTXXX...#..|
*
00000c00  45 41 33 03 00 60 ff 80  00 00 00 00 01 0f 50 00  |EA3..`........P.|
00000c10  00 04 00 00 00 60 8a 07  e3 0a c9 91 63 46 c6 bc  |.....`......cF..|
00000c20  22 52 03 76 00 05 66 48  00 00 3b 86 00 00 00 00  |"R.v..fH..;.....|
00000c30  00 00 20 30 00 00 00 00  00 00 00 00 00 00 00 00  |.. 0............|
00000c40  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

ffprobe -i 01-Untitled.oma
Input #0, oma, from '01-Untitled.oma':
  Metadata:
    title           : Untitled(1)
    album           : Untitled
    OMG_TRACK       : 1
    OMG_ALBMS       : Untitled
    OMG_ASGTM       : 2366000
    OMG_TIT2S       : Untitled(1)
    TLEN            : 353000
  Duration: N/A, start: 0.000000, bitrate: N/A
  Stream #0:0: Audio: atrac3al ([34][0][0][0] / 0x0022), 44100 Hz, stereo, fltp

We learned from trying an OMG file in ffprobe that ffmpeg is looking for EA3 header, which is found in this OMA file. Both of these formats should have a nice header to work from for a signature. In fact there has already been a request and signature submitted for the OMA format. Mine are slightly different, but only takes a small tweak to work with all my samples. Also, it seems the extension AA3 was used for awhile before settling on OMA. OMA can have a few different types:

ffprobe -i 02-Untitled.oma 
[oma @ 0x7fbc7ef047c0] Estimating duration from bitrate, this may be inaccurate
Input #0, oma, from '/Star Trek/02-Untitled.oma':
  Metadata:
    title           : Untitled(2)
    album           : Star Trek
    OMG_TRACK       : 2
    OMG_ALBMS       : Star Trek
    OMG_ASGTM       : 2366000
    OMG_TIT2S       : Untitled(2)
    TLEN            : 27000
  Duration: 00:00:27.21, start: 0.000000, bitrate: 193 kb/s
  Stream #0:0: Audio: atrac3p ([1][0][0][0] / 0x0001), 44100 Hz, stereo, fltp, 192 kb/s

I’ll leave the technical properties to be handled by tools more suited for parsing the format like ffmpeg. Maybe MediaInfo could have the formats added, but until then, it might be best to simply identify the main format. I am also aware of some later additions to the ATRAC family, such as ATRAC3plus, ATRAC Advanced Lossless, and ATRAC9 (WAV RIFF). There are other extensions like AT3 out there which use the ATRAC codec, like Sony’s Playstation or PSP. I will have to keep my eyes out for the even more elusive Hi-MD MiniDisc devices to find out more. For now, take a look at some samples and my proposal for signatures on my GitHub.

Worldox

August 23, 2024 by Thor 1 Comment

Most File Systems have unique ways for doing things, but also many things in common. On a Macintosh you might have some extended attributes, or that pesky hidden .DS_Store file no one really knows why it’s there. On Windows you may find a hidden thumbs.db file throwing off your file count. Hidden files are everywhere. Many have a real purpose, and that purpose may be insignificant or important in finding or giving context to other files.

While processing a collection from a USB drive the other day, I came across a few files I hadn’t seen before. They were hidden files nestled in with a few folders of PDF’s. They have a unique name, so I figured it would be easy to find some documentation on them on the web. Turns out, there is very little.

-rwx------@ 1 tyler  staff    235 Aug 22 00:04 XNAME.CRS
-rwx------@ 1 tyler  staff    235 Aug 22 00:04 XNAME.LIB

The files were only a couple years old, so I figured there had to be some modern software which created them. A look inside the files with a hex editor didn’t provide much information.

hexdump -C XNAME.LIB 
00000000  22 80 21 36 00 00 00 00  00 00 00 00 00 00 00 00  |".!6............|
00000010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000090  00 00 00 00 00 00 4c 3c  55 6e 61 73 73 69 67 6e  |......L<Unassign|
000000a0  65 64 3e 00 00 00 00 00  00 00 00 00 00 00 00 00  |ed>.............|
000000b0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

I was about to give up since there wasn’t much data and since they were hidden files, I assumed they were probably just some cached files with little value. But wanting to learn more I did some more digging and at first thought they might have something to do with DropBox, as a user said they just showed up one day, but later found they probably were created by some Document Management software known as Worldox. I found a support page claiming these two files are part of a database.

XNAME.LIB	Contains document numbers (DOS names), extended names, and file security information.
XNAME.CRS	Contains custom profile field and version control information.

There is a key term in the definition of XNAME.LIB, “extended names”. I was curious what that meant and found Worldox has been around awhile. The World Software Corporation has been around since 1988 and Worldox was released in 1993, but before that it specialized in an interesting DOS software package called “Extend-A-Name” or “Extend-A-File”. The name gives away its purpose, it literally extends the name of the limited 8 Characters you could use in DOS. I can remember trying to decide on a filename that would accurately describe my file so I knew what it was later on. 8 characters is not enough to explain the content of a file, especially if you have hundreds or thousands of file to manage.

Extend-a-file was software which bonded with another piece of software like WordPerfect and loaded itself in memory. Then when you went to create a file or locate a file within WordPerfect, Extend-a-File would take over and allow you to create a file with a traditional 8 Character name, but also a name much longer.

This extended name allowed you to describe the files content with much more detail. Making it also very easy to find previous documents.

Pretty slick, this software really would make a big difference to managing a large amount of files in the old DOS days. Ok, it adds extended names, but where is this information stored? That is where the XNAME files come into play.

hexdump -C XNAME.LIB | head
00000000  6d 92 15 59 47 47 15 00  00 00 00 00 00 00 00 00  |m..YGG..........|
00000010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000090  00 00 00 00 00 00 4c 10  20 4d 41 53 54 45 52 20  |......L. MASTER |
000000a0  4c 49 42 52 41 52 59 20  2d 20 41 6c 6c 20 46 69  |LIBRARY - All Fi|
000000b0  6c 65 73 20 4c 69 73 74  65 64 00 20 64 72 69 76  |les Listed. driv|
000000c0  65 20 43 20 00 58 4e 50  4c 55 53 2e 24 24 24 00  |e C .XNPLUS.$$$.|
000000d0  fd 05 fe 49 6e 73 75 66  66 69 63 69 65 6e 74 20  |...Insufficient |
000000e0  64 69 73 6b 20 73 70 61  63 65 20 46 54 68 69 73  |disk space FThis|
000000f0  20 69 73 20 61 20 74 65  73 74 20 6f 66 20 58 4e  | is a test of XN|
00000100  41 4d 45 20 20 20 20 20  20 20 20 20 20 20 20 20  |AME             |
00000110  20 20 20 20 20 20 20 20  20 20 20 20 20 20 20 20  |                |
00000120  20 20 20 20 20 20 20 20  00 56 44 53 30 30 30 30  |        .VDS0000|
00000130  2e 44 4f 43 00 00 96 00  56 44 53 00 00 00 00 4f  |.DOC....VDS....O|
00000140  55 30 30 30 30 30 30 0a  09 1d 00 02 00 00 00 00  |U000000.........|
00000150  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

This XNAME.LIB was generated by a running copy of XNPLUS circa 1990, bonded with a copy of DisplayWrite 4. It adds much more information within the “Library”.

So it seems this method of storing the extended filenames and other metadata started in the Extend-a-File software and has been brought along and used in modern versions of the Worldox Document Management software. Much of its purpose to extend an 8 character filename to around 60 character is no longer needed as most systems now allow for filenames with at least 256 characters. I imagine there is more the software can add to these files, but found that the samples I have really don’t have any information in them at all. The Worldox software seems to be marketed toward law firms and others who have a lot of documents to manage, but I have been unable to find a way to play with the software to see what can be embedded within the XNAME.LIB files.

There is also some discussion out there about wether to backup these two hidden files and what might happen if they are lost. Regardless, you may want to think twice before tossing them as I almost did. They could contain valuable information needed to give context.

I am not sure it is possible to have a good signature for identification of these files. The samples I have and others I found online, here, here, here, and here, just don’t have much data within them. In fact they are all exactly 235 bytes. The only consistent byte within them and the samples I generated from XNPLUS is “4C” at offset 150, but everything else seems arbitrary. Here is a sample I generated from XNPLUS if you want to take a closer look.

A2R / MOOF / WOZ

August 16, 2024 by Thor 2 Comments

There seems to be a never ending growing list of disk image formats. Many have features which are specific to the media and format. If you have ever imaged an older Macintosh floppy you know they are special. If you add in copy-protection which many early Apple II floppies have, and you need special drives, hardware, and a special format to store the floppy data.

When imaging special media, especially with unique media, it is best practice to image the floppies at the magnetic flux level.

Floppy disks contain magnetic fluctuations which are measured and recorded using specialized equipment. A popular method is using a Kryoflux board, floppy drive, and software. The software communicates with a custom controller board connected to a floppy drive through USB. If you are interested in the different controller boards, a good list has been compiled here.

A Kryoflux, fluxengine, greaseweazle, all can image specialized disks like a Macintosh 800k floppy, but the best controller board for them is an Applesauce setup. They are specifically designed to for the task. With that task, comes a few specialty formats.

A file format which can store flux data is a bit different than a regular disk image format. The flux data contains all the low-level recordings which can then be interpreted into disk images much like the original floppy. In the case of an Applesauce flux image, it can contain all the small nuances of the original floppy, this includes recording any copy protection or other creative methods used by software vendors throughout the years. The format used for storing this flux data is the A2R format.

A2R is in its third iteration. Let’s take a look at the basics of the format.

hexdump -C Samplev3.a2r | head
00000000  41 32 52 33 ff 0a 0d 0a  49 4e 46 4f 25 00 00 00  |A2R3....INFO%...|
00000010  01 41 70 70 6c 65 73 61  75 63 65 20 76 31 2e 38  |.Applesauce v1.8|
00000020  38 2e 35 20 20 20 20 20  20 20 20 20 20 20 20 20  |8.5             |
00000030  20 02 01 01 00 52 57 43  50 e9 49 6e 01 01 24 f4  | ....RWCP.In..$.|
00000040  00 00 00 00 00 00 00 00  00 00 00 00 00 43 01 00  |.............C..|
00000050  00 01 27 3a 25 00 91 d9  00 00 21 20 21 21 21 21  |..':%.....! !!!!|
00000060  1f 21 21 21 21 1f 24 5e  24 1f 21 21 20 21 24 5c  |.!!!!.$^$.!! !$\|
00000070  24 20 21 21 21 1f 24 5c  25 21 21 1f 21 21 23 5b  |$ !!!.$\%!!.!!#[|
00000080  25 20 21 21 21 1f 21 22  23 3f 41 3f 26 3e 43 3f  |% !!!.!"#?A?&>C?|
00000090  43 5f 41 27 3d 61 41 27  3d 61 3f 28 3e 61 3f 26  |C_A'=aA'=a?(>a?&|

hexdump -C Samplev2.a2r | head
00000000  41 32 52 32 ff 0a 0d 0a  49 4e 46 4f 24 00 00 00  |A2R2....INFO$...|
00000010  01 41 70 70 6c 65 73 61  75 63 65 20 76 31 2e 31  |.Applesauce v1.1|
00000020  2e 36 20 20 20 20 20 20  20 20 20 20 20 20 20 20  |.6              |
00000030  20 02 01 01 53 54 52 4d  75 17 5d 01 00 01 e6 da  | ...STRMu.].....|
00000040  00 00 83 a9 12 00 12 1e  11 13 1e 13 1e 13 11 1f  |................|
00000050  21 1f 11 13 1c 14 1e 30  14 20 1e 14 1e 14 1c 14  |!......0. ......|
00000060  1c 13 11 20 21 1f 11 11  0f 13 1e 14 1c 14 2e 21  |... !..........!|
00000070  13 1e 13 1e 14 1e 11 11  20 21 1f 11 11 13 1e 1f  |........ !......|
00000080  13 20 30 21 11 11 0f 13  1e 13 11 30 1f 21 20 13  |. 0!.......0.! .|
00000090  11 30 1f 14 1e 30 14 1e  11 11 11 1e 13 11 1e 14  |.0...0..........|

The A2R format uses a chunk system to store the various pieces to the format. Earlier versions used a STRM Chunk to store all the raw flux data. Version 3 changed to a RWCP Chunk to store all the raw flux data. Applesauce uses a 2-pass imaging process, doing a rapid imaging to determine where on the media surface track data exists and then a second pass that captures longer durations for processing and error correction.

Once the full raw flux data has been captured that data can be interpreted as a disk image. The Applesauce software is able to make a regular disk image, a Disk Copy 4.2 file, which are well known and identify in PRONOM as fmt/625, but can also create a couple of special disk image formats which allow for special nuances on an original disk.

The WOZ Disk Image format is an offshoot of the Applesauce project. Capturing highly accurate bit data is of no use if you don’t have a container to hold the data. The WOZ format was designed to be able to contain every possible Apple ][ disk structure and layout. It can be so accurate that even copy protected software can’t tell that it isn’t an original disk.

The WOZ format has become very popular in the Apple II community and is ideal for emulating all the old games and software titles popular in the early 1980’s. You may have guessed where the name comes from. The internet archive has a large collection of WOZ disks in their WOZ-a-Day collection. The file format of a WOZ disk image is also a chunk based format similar to the A2R format, it has two versions. Let’s take a look.

hexdump -C WOZ 1.0/Blazing Paddles (Baudville).woz | head
00000000  57 4f 5a 31 ff 0a 0d 0a  f6 f5 92 d6 49 4e 46 4f  |WOZ1........INFO|
00000010  3c 00 00 00 01 01 00 01  01 41 70 70 6c 65 73 61  |<........Applesa|
00000020  75 63 65 20 76 30 2e 32  36 20 20 20 20 20 20 20  |uce v0.26       |
00000030  20 20 20 20 20 20 20 20  20 00 00 00 00 00 00 00  |         .......|
00000040  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000050  54 4d 41 50 a0 00 00 00  00 00 ff 01 01 01 ff 02  |TMAP............|
00000060  02 02 ff 03 03 03 ff 04  04 04 ff 05 05 05 ff 06  |................|
00000070  06 06 ff 07 07 07 ff 08  08 08 ff 09 09 09 ff 0a  |................|
00000080  0a 0a ff 0b 0b 0b ff 0c  0c 0c ff 0d 0d 0d ff 0e  |................|
00000090  0e 0e ff 0f 0f 0f ff 10  10 10 ff 11 11 11 ff 12  |................|

hexdump -C WOZ 2.0/Blazing Paddles (Baudville).woz | head
00000000  57 4f 5a 32 ff 0a 0d 0a  21 da c2 c8 49 4e 46 4f  |WOZ2....!...INFO|
00000010  3c 00 00 00 02 01 00 01  01 41 70 70 6c 65 73 61  |<........Applesa|
00000020  75 63 65 20 76 31 2e 31  20 20 20 20 20 20 20 20  |uce v1.1        |
00000030  20 20 20 20 20 20 20 20  20 01 01 20 00 00 00 00  |         .. ....|
00000040  0d 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000050  54 4d 41 50 a0 00 00 00  00 00 ff 01 01 01 ff 02  |TMAP............|
00000060  02 02 ff 03 03 03 ff 04  04 04 ff 05 05 05 ff 06  |................|
00000070  06 06 ff 07 07 07 ff 08  08 08 ff 09 09 09 ff 0a  |................|
00000080  0a 0a ff 0b 0b 0b ff 0c  0c 0c ff 0d 0d 0d ff 0e  |................|
00000090  0e 0e ff 0f 0f 0f ff 10  10 10 ff 11 11 11 ff 12  |................|

Unlike a common disk image, a WOZ image contains more than the bits on the disk, it contains a mapping of all the tracks and the associated data, this is how it can even contain copy-protection usually only possible with a physical disk. The ‘TMAP’ chunk contains a track map and the ‘TRKS’ chunk contains all the data.

What the WOZ is for the Apple II, MOOF was made for the Macintosh. You may wonder what is with the funny name, but there is a long history around “Clarus the Dogcow”. I’m sure this factoid will help you impress your friends or win at trivia night. Again, the purpose of the special format for Macintosh disks is to allow for emulating disks, even with copy protection. You can also find quite the collection of old Macintosh software in the MOOF format on the Internet Archive, even emulate your favorite game, such as Dark Castle, which I played for hours as a kid. Also a chunk based format, let’s take a look at the header.

hexdump -C Dark Castle v1.0 - Disk 1.moof | head
00000000  4d 4f 4f 46 ff 0a 0d 0a  b5 75 f9 4e 49 4e 46 4f  |MOOF.....u.NINFO|
00000010  3c 00 00 00 01 01 00 01  10 41 70 70 6c 65 73 61  |<........Applesa|
00000020  75 63 65 20 76 31 2e 37  33 20 20 20 20 20 20 20  |uce v1.73       |
00000030  20 20 20 20 20 20 20 20  20 00 13 00 00 00 00 00  |         .......|
00000040  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000050  54 4d 41 50 a0 00 00 00  00 ff 01 ff 02 ff 03 ff  |TMAP............|
00000060  04 ff 05 ff 06 ff 07 ff  08 ff 09 ff 0a ff 0b ff  |................|
00000070  0c ff 0d ff 0e ff 0f ff  10 ff 11 ff 12 ff 13 ff  |................|
00000080  14 ff 15 ff 16 ff 17 ff  18 ff 19 ff 1a ff 1b ff  |................|
00000090  1c ff 1d ff 1e ff 1f ff  20 ff 21 ff 22 ff 23 ff  |........ .!.".#.|

All three formats created for imaging and emulating Apple and Macintosh software are well documented and open. They are also well suited for preservation as they can contain extensive metadata in the INFO chunk which gives provenance information on the source of the files. The Applesauce software even has a camera to photograph the disk itself for archiving. All of this makes these formats great for preservation and emulation. Take a look at my proposal for a signature on my Github.

Binder

August 9, 2024 by Thor Leave a comment

Microsoft is never in short supply of file formats. They have made many changes over the years. Introduced lots of products, some lasting longer than others. The list is quite long.

One such software was called Office Binder. Introduced with Office 95, it was a companion application to combine a number of OLE objects together in one “Binder”. Meant to be the digital version of an Office Binder one often uses for presentations or proposals.

You could add sections and include Word documents, Images, Powerpoint, Excel spreadsheets, basically any OLE object. Of course a Binder file itself was an OLE compound object. They had the extension OBD, and templates used OBT. The PRONOM registry has PUID’s for the different Binder versions, but there are some issues.

PUID	Format Name	Format Version	Extension
fmt/237	Microsoft Office Binder File for Windows	95	obd
fmt/240	Microsoft Office Binder File for Windows	97-2000	obd
fmt/238	Microsoft Office Binder Template for Windows	95	obt
fmt/241	Microsoft Office Binder Template for Windows	97-2000	obt
fmt/239	Microsoft Office Binder Wizard for Windows	95	obz
fmt/242	Microsoft Office Binder Wizard for Windows	97-2000	obz

filename : 'Binder95-s01.obd'
filesize : 5120
modified : 2024-08-08T21:24:34-06:00
errors   : 
matches  :
  - ns      : 'pronom'
    id      : 'fmt/240'
    format  : 'Microsoft Office Binder File for Windows'
    version : '97-2000'
    mime    : 
    class   : 
    basis   : 'extension match obd; container name Binder with name only'

Turns out only one of the PRONOM PUID’s has an actual signature, the others are placeholders. So when I run Siegfried on an Office Binder 95 file, it comes back as fmt/240 which points to an Office Binder 97-2000 file. It’s a simple signature, looking for an internal file named “Binder”, which is inherent of all the Binder file types.

    <ContainerSignature Id="5500" ContainerType="OLE2">
      <Description>Microsoft Office Binder File for Windows 97-2000</Description>
      <Files>
        <File>
          <Path>Binder</Path>
        </File>
      </Files>
    </ContainerSignature>

Taking a look inside the Office 95 Binder file, we can see the “Binder” file.

Path = Binder95-s01.obd
Type = Compound
Physical Size = 5120
Extension = compound
Cluster Size = 512
Sector Size = 64

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
                    .....          316          320  [5]SummaryInformation
                    .....          144          192  Binder
                    .....          280          320  [5]DocumentSummaryInformation
------------------- ----- ------------ ------------  ------------------------
                                   740          832  3 files

hexdump -C Binder95-s01/Binder 
00000000  90 00 00 00 05 00 00 00  00 00 00 00 05 00 00 00  |................|
00000010  00 00 00 00 a1 6a 8a 8e  cc 55 ef 11 ab 06 00 0c  |.....j...U......|
00000020  29 b1 b4 d0 00 00 00 00  00 00 00 00 00 00 00 00  |)...............|
00000030  00 00 00 00 00 00 00 00  00 00 00 00 40 86 61 a6  |............@.a.|
00000040  0b ea da 01 00 00 00 00  00 00 00 00 40 86 61 a6  |............@.a.|
00000050  0b ea da 01 09 00 00 00  00 00 00 00 00 00 00 00  |................|
00000060  00 00 00 00 2c 00 00 00  00 00 00 00 01 00 00 00  |....,...........|
00000070  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
00000080  2c 00 00 00 2c 00 00 00  13 03 00 00 44 02 00 00  |,...,.......D...|

The bytes within a “Binder” file has some patterns, but nothing decipherable.

Microsoft Office Binder was only included in three versions of Office. Office 95, 97, and 2000. Let’s look at the other two versions.

Path = Binder97-s04.obd
Type = Compound
Physical Size = 5632
Extension = compound
Cluster Size = 512
Sector Size = 64

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
                    .....           28           64  HdrFtr
                    .....          144          192  Binder
                    .....          260          320  [5]SummaryInformation
                    .....          404          448  [5]DocumentSummaryInformation
------------------- ----- ------------ ------------  ------------------------
                                   836         1024  4 files

Path = Binder2K-S01.obd
Type = Compound
Physical Size = 5632
Extension = compound
Cluster Size = 512
Sector Size = 64

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
                    .....           28           64  HdrFtr
                    .....          144          192  Binder
                    .....          260          320  [5]SummaryInformation
                    .....          232          256  [5]DocumentSummaryInformation
------------------- ----- ------------ ------------  ------------------------
                                   664          832  4 files

It looks like version 97 and 2000 have an extra file. The “HdrFtr” file seems to reference a Header and Footer, which according to documentation was a feature added in Office 97.

What’s new in Office Binder 97

Office Binder makes it possible for you to group all your documents, workbooks, and presentations for a project in one place. To get started with Office Binder 97, add a new or existing document to your binder. Use the new Office 97 features while you work in a binder……. Print headers and footers for a binder

We can use the “HdrFtr” file within the container to differentiate between the 95 version and 97-2000 formats. Perhaps, a closer look at the DocumentSummaryInformation file in the future, might help with a more precise identification later. There doesn’t seem to be anything to distinguish an OBD file from a OBT template file, so those PUID’s may not be needed. The other format related to the Binder software has the OBZ extension. It is called a Wizard template file in some documentation, but I have been unable to find any type of “Wizard” functionality in the Office Binder Apps to generate a file. The OBZ format seems to have something to do with macros in Visual Basic. Luckily there are a few examples available on Office install disc‘s.

Path = CLIENT.OBZ
Type = Compound
Physical Size = 364032
Extension = doc
Cluster Size = 512
Sector Size = 64

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
1995-07-05 17:25:15 D....                            7
1995-07-05 17:25:14 D....                            5
1995-07-05 17:25:13 D....                            4
                    .....          106          128  4/[1]CompObj
                    .....           20           64  4/[1]Ole
                    .....         8880         9216  4/WordDocument
                    .....           32           64  4/[3]View000
                    .....          492          512  4/[5]SummaryInformation
                    .....          236          256  4/[5]DocumentSummaryInformation
1995-07-05 17:25:14 D....                            6
                    .....        17760        17920  6/Book
                    .....           20           64  6/[1]Ole
                    .....            0            0  6/[3]View000
                    .....          102          128  6/[1]CompObj
                    .....         3260         3264  6/[5]SummaryInformation
                    .....          192          192  6/[5]DocumentSummaryInformation
                    .....          106          128  5/[1]CompObj
                    .....           20           64  5/[1]Ole
                    .....         8055         8192  5/WordDocument
                    .....           32           64  5/[3]View000
                    .....         7280         7680  5/[5]SummaryInformation
                    .....          220          256  5/[5]DocumentSummaryInformation
1995-07-05 17:25:16 D....                            9
1995-07-05 17:25:15 D....                            8
                    .....        13857        14336  8/Book
                    .....           20           64  8/[1]Ole
                    .....            0            0  8/[3]View000
                    .....          102          128  8/[1]CompObj
                    .....          188          192  8/[5]SummaryInformation
                    .....          196          256  8/[5]DocumentSummaryInformation
                    .....          854          896  Binder
1995-07-05 17:25:19 D....                            10
                    .....        80382        80384  10/Book
                    .....           20           64  10/[1]Ole
                    .....            0            0  10/[3]View000
                    .....          102          128  10/[1]CompObj
                    .....         4044         4096  10/[5]SummaryInformation
1995-07-05 17:25:19 D....                            10/_VBA_PROJECT
                    .....         9425         9728  10/_VBA_PROJECT/812f9922c6
                    .....        12302        12800  10/_VBA_PROJECT/7b2f9922a4
                    .....        36937        37376  10/_VBA_PROJECT/dir
                    .....         6609         6656  10/_VBA_PROJECT/7e2f9922b5
                    .....        23014        23040  10/_VBA_PROJECT/872f9922e8
                    .....         7995         8192  10/_VBA_PROJECT/842f9922d9
                    .....         5338         5632  10/_VBA_PROJECT/902f992333
                    .....        36119        36352  10/_VBA_PROJECT/8d2f99231e
                    .....        18129        18432  10/_VBA_PROJECT/932f992342
                    .....        13055        13312  10/_VBA_PROJECT/b42fbcaa59
                    .....          208          256  10/[5]DocumentSummaryInformation
                    .....         4228         4608  [5]SummaryInformation
                    .....          956          960  [5]DocumentSummaryInformation
                    .....          106          128  9/[1]CompObj
                    .....           20           64  9/[1]Ole
                    .....         5914         6144  9/WordDocument
                    .....            0            0  9/[3]View000
                    .....         1520         1536  9/[5]SummaryInformation
                    .....          220          256  9/[5]DocumentSummaryInformation
                    .....        16141        16384  7/Book
                    .....           20           64  7/[1]Ole
                    .....            0            0  7/[3]View000
                    .....          102          128  7/[1]CompObj
                    .....          188          192  7/[5]SummaryInformation
                    .....          192          192  7/[5]DocumentSummaryInformation
------------------- ----- ------------ ------------  ------------------------
1995-07-05 17:25:19             345316       351168  55 files, 8 folders

Sure enough, the OBZ file has a Visual Basic macro (VBA_Project). Unfortunately, it appears to be nested in an additional folder within the container, with a variable number number which is likely to change from file to file. That fact will make identification in PRONOM much more difficult, as the signatures are not designed for variable names. Possibly something we can investigate later.

Microsoft Binder was only released in Office 95, 97, and 2000, but was supported in Office XP and 2003 through an UNBIND.EXE application which would simply separate all the different objects back out to the individual files.

The Microsoft Office Binder is not included in Office 2003. However, if a Binder file created in a previous version of Office contains information you want to access, you can use the Unbind tool to pull out the information and save it in the formats of the appropriate programs. In order to do this procedure, the Unbind tool must be installed.

As always, you can look at some sample files and my proposal for updated signatures on my GitHub page.