Receiving electronic media from an outside source can be an adventure. Often times you find yourself sorting the valuable files and separating them from the chaff. There can be hidden files, cache files, application files, drivers, and everything in between. Determining what formats are important can sometimes be difficult, especially if you don’t know the file format of some of the files.
I was recently working on a collection of files which had been produced through some audio software. When working with audio, a WAVE file is what is usually kept as they contain the actual audio data. With these files they came with a couple other formats. One of those formats was a bunch of SFK peak files. These files are meant to be temporary as they are generated from the WAVE file to make opening of audio data faster. They are important, but can easily be regenerated. One could argue they have historical value, but also they don’t contain anything that can be used by itself, so alone they don’t have much value.
The other format found with the WAVE files have a CDP extension. These came up as unknown when using DROID. It is not a common extension so finding the name of the software which created the files wasn’t too hard. Let’s take a look at one of them.
Huh, this is a RIFF file. RIFF is most commonly used as the container used for WAVE and AVI files. You can read more about the RIFF format on a previous post. The RIFF container format can be used for all sorts of things. Looking at the internals we can see a few unique list chunk’s.
Lots of references to other files, specifically WAVE files. But not a lot of actual data. That is because this format turns out to be just a project format for some software called “CD Architect“. Sonic Foundry was an audio software developer for a few years before they sold their catalog to Sony in 2003. In looking at the manual for CD Architect version 5.2, it explains the CDP Project format.
CD Architect software handles the organization of your CD using a small project file (CDP) that saves information about source file locations, edits, cuts, and insertion points. This project file is not a multimedia file, but is instead used to create the CD when editing is finished.
Looking at another CDP file from the collection, I noticed something different.
That’s odd, the RIFF format is always uppercase ASCII, this is lowercase. Also the important RIFF form, which was “SFPJ” in the other sample, is missing. This is not a valid RIFF format.
But further down in the file I can see the same list chunks. Did they take RIFF format and make a proprietary version of their own? I think they may have. It seems the first example was from CD Architect version 4 and these other files are from CD Architect version 5. That complicates things. Sony stopped developing CD Architect after version 5.2d and maintained it for a few years before selling many of their titles to MAGIX Software. As far as I know there was never any new versions released. The software was very popular, as it had some really nice audio mastering features and was easy to use. Many were upset when the software was abandoned.
Creating a signature for both version 4 and version 5 CDP files will be pretty straightforward. I feel knowing what you have in a collection you are processing is the first step in making informed decisions. Wether or not you keep the project files are up for debate. Some may only want the final audio created from a CD Architect project, while others may want to see the way the audio was put together and mixed. Either way, the more you know…..
One more thing. CD Architect would default to saving a CDP project file, but could also save a “CD Image file”. This process actually would save the project to a full WAVE file with some extras baked in.
An image file is essentially a wave file with volume, crossfades, effects, mixes, and track information embedded. Burning an image file will reduce the risk of buffer underruns (especially if you have a complex project or are using a slow computer) since no audio processing is required.
Interesting, normally when working with track information in a single WAVE file you would need a companion CUE Sheet in order to reference the track layout of the Audio CD. So I am curious how they do all of this. Lets take a look at a “CD Image”.
mediainfo CDArch52d-s02.wav General Complete name : CDArch52d-s02.wav Format : Wave Format settings : PcmWaveformat File size : 5.05 MiB Duration : 30 s 0 ms Overall bit rate mode : Constant Overall bit rate : 1 411 kb/s Conformance errors : 2 RIFF : Yes General compliance : File size 5292434 is less than expected size 5292823 (offset 0x8) WAVE : Yes General compliance : Element size 5292811 is more than maximal permitted size 5292422 (offset 0xC)
Audio Format : PCM Format settings : Little / Signed Codec ID : 1 Duration : 30 s 0 ms Bit rate mode : Constant Bit rate : 1 411.2 kb/s Channel(s) : 2 channels Sampling rate : 44.1 kHz Bit depth : 16 bits Stream size : 5.05 MiB (100%)
Already seeing some issues with the format, but all the important bits are there. JHOVE doesn’t like them much either.
JHOVE is giving me two issues. The major error is the file appears truncated according to both MediaInfo and JHOVE. The InfoMessage which is less of an issue but more of a heads up that the WAVE file has an extra LIST type. “PQLS”, which was also in the CPD RIFF file we looked at earlier. So it seems by making a “CD Image” of a project embeds the project chunk data into the WAVE container. Identification is not an issue as these WAVE’s follow the standard pattern and therefore identify correctly, but one might want to be aware through further characterization these WAVE’s have some not so obvious extra data.
My attempts to find any samples from version 3 of CD Architect have failed. Until then, my proposal is to add version 4 & 5 to PRONOM with the signature on my Github page. There you will find a few samples as well.
For #WDPD24 and PRONOM Hackathon week this year, I want to find some older formats listed which did not have a signature. There is a list to choose from, but I wanted to find something I hadn’t worked on before. I came across two entries for Real Video:
PUID
Name
Extension
fmt/204
RealVideo Clip
rv
x-fmt/277
Real Video
rv
I was familiar with Real Media and Real Audio, but had yet to come across any RealVideo with the RV extension. I thought it would be easy to find some references and samples, but that was not the case. I assume PRONOM originally added these based on MIME types available.
Real or RealNetworks is/was an Internet media company who jumped on the rapidly growing World Wide Web in 1995 to become a leader in Internet Media Delivery. Their initial offerings mainly focused on audio streaming and they accomplished all of this by providing free players and web browser extensions to make it easy to serve up a website with streaming media everyone could enjoy. Later adding video streaming optimized for the slower dialup and connections of the day. They used codecs based on common technology like H.263 and H.264, but used then to make their own proprietary codecs identified through FourCC codes, RV10-RV60.
So thought it would be easy to find a reference to the RV extension, I quickly discovered it wasn’t. Looking at the Wikipedia page on RealVideo, I found no reference to the RV extension. RV is an abbreviation for RealVideo, right? Well, I ended up finding a reference in the RealAudio page under file extensions. Ok, First clue to the existence of the RV extension. The page references RV as being used for video only files and was used by the flagship encoder (RealProducer).
RealProducer was the tool for creating the streaming audio and video formats that could then be used for your website or streaming platform. The RealProducer software came in a Basic version, which was free, and the Plus or Pro version, which was not free and provided more options. The first version of RealProducer to make video files was version 4. I was able to find a copy of the encoder and installed it under a Windows 95 emulator. To my surprise it only saved to the RealMedia RM file format. This format is well known and identified with PRONOM as x-fmt/190 also documented at the LoC.
This was the same with RealProducer 5, 7, 8, 9, and 10 that I was able to try. All made no mention of the RV extension. I was starting to feel this format didn’t exist or that some decided to use the RV extension on their own. Searches on Google yielded a couple results, mostly from users who had found a few files on their older discs and wanted to migrate them to something newer. I was able to find one example, one user shared, but it had the same header as the RealMedia format. The clue was in the file.
RealProducer Basic 11 for Windows. The Wikipedia article did hint at this by saying “the latest version of RealProducer reverted to using .ra for audio only files and began using .rv for video files with or without audio.” Why would they use the RM extension for so long, then revert to a different extension with a later version? I found more in the User Manual for version 11.
• .rv – RealVideo RealProducer uses the .rv file extension if the input is video-only or video-with-audio. You can also select the .rm file extension for video content. Tip: Using the .rv file extension helps search engines identify the file as a RealVideo clip.
• .rm – RealAudio or RealVideo RealProducer chooses the .rm file extension if it cannot determine the content of the input clip. You can use .rm file extension for any RealAudio or RealVideo clip, except for variable bit-rate clips.
Ok, so a few things to learn from this. One is the RV extension was used as the default for version 11 as they wanted search engines to identify them as a RealVideo clip. Second thing we learned is there is no difference between the two placeholders in PRONOM, one being a RealVideo file and the other being a RealVideo Clip. We don’t need both.
Now, is there any difference between an RV and RM file?
They both look very similar to me. Aside from a few bytes, they are practically identical. Lets see what MediaInfo has to say.
mediainfo Producer11-01.rv General Complete name : Producer11-01.rv Format : RealMedia File size : 164 KiB Duration : 6 s 999 ms Overall bit rate : 225 kb/s Frame rate : 24.000 FPS Copyright : (C) 2005 FileExtension_Invalid : rm rmvb ra
Video ID : 0 Format : RealVideo 4 Codec ID : RV40 Codec ID/Info : Based on AVC (H.264), Real Player 9 Duration : 6 s 999 ms Bit rate : 181 kb/s Width : 640 pixels Height : 424 pixels Display aspect ratio : 3:2 Frame rate : 24.000 FPS Bits/(Pixel*Frame) : 0.028 Stream size : 155 KiB (94%)
Audio ID : 1 Format : Cooker Codec ID : cook Codec ID/Info : Based on G.722.1, Real Player 6 Duration : 7 s 429 ms Bit rate : 44.1 kb/s Channel(s) : 2 channels Sampling rate : 44.1 kHz Bit depth : 16 bits Stream size : 40.0 KiB (24%)
mediainfo Producer11-01.rm General Complete name : Producer11-01.rm Format : RealMedia File size : 151 KiB Duration : 6 s 999 ms Overall bit rate : 225 kb/s Frame rate : 24.000 FPS Copyright : (C) 2005
Video ID : 0 Format : RealVideo 4 Codec ID : RV40 Codec ID/Info : Based on AVC (H.264), Real Player 9 Duration : 6 s 999 ms Bit rate : 181 kb/s Width : 640 pixels Height : 424 pixels Display aspect ratio : 3:2 Frame rate : 24.000 FPS Bits/(Pixel*Frame) : 0.028 Stream size : 155 KiB
Audio ID : 1 Format : Cooker Codec ID : cook Codec ID/Info : Based on G.722.1, Real Player 6 Bit rate : 44.1 kb/s Channel(s) : 2 channels Sampling rate : 44.1 kHz Bit depth : 16 bits
Other than the RV file having a invalid file extension, they both identify as a RealMedia file and have identical properties. So it seems the RV file is really no different than the RM file. I think the best course of action for PRONOM is to deprecate these two RV PUID’s and just ad RV as an acceptable extension for the RealMedia format.
To add to the evidence, here is the output from ffprobe:
But wait, there are a couple formats we could add which are related to RealProducer. RealProducer used a few other formats to manage projects and other metadata for streaming. They include:
Don’t get excited, the RealPix Image format really isn’t an image, it is simply an XML file with all the details of an image or group of images. Pretty boring. It was however a big thing in the day, even got a full guide written up for the process. “All information in the file occurs between an opening <imfl> tag and a closing </imfl> tag. This is the only tag that uses an end tag.” This format was the topic of discussion as malicious code could be in the RP file and executed just by having someone load your webpage. IMFL is obviously an acronym, but none of the documents I could find tells me what it stands for, so I did what everyone does now, I asked ChatGPT.
The RealPix format by RealNetworks, which was used for interactive multimedia content, indeed utilized IMFL as its tagged format. IMFL stands for “Interleaved Media File Language.” This markup was particularly designed to handle multimedia presentations, allowing the synchronization of images, audio, and video in a slideshow-style format. It used XML-like syntax where elements like <imfl>, <head>, and <fadein/> defined media objects, transitions, and their timing. Key components included attributes for positioning, color, and animation effects, making RealPix a flexible format for creating multimedia sequences compatible with RealPlayer.
For technical details, the RealPix format closely resembles SMIL (Synchronized Multimedia Integration Language) and supports strict tag closure and case sensitivity. This means all tags and attribute names must be lowercase, and attributes must be in double quotes, as seen in SMIL and RealSystem G2 markup, RealNetworks’ broader multimedia framework.
When I asked for a source, it could not give me one. So not sure if it is the correct answer, but it seems to fit. Here are some samples of RP, RT and SMIL files.
For RealText with the RT extension, we find a similar tagged text. This format is used to provide text presentations to go along with Images, Audio, or Video. The tagged text then describes when and how the text is displayed. This is all done in a player window, therefore the root tag of these RT documents starts and ends with <window>. I guess these could be considered a subtitle format for streaming media.
The SMIL files is interesting, it is known standard, but in many cases, does not have an XML declaration, therefore not identified by current PRONOM. They are used to link everything together. I might suggest a variant of the SMIL format to not have the XML declaration to identify these formats correctly.
The .RPAD RealProducer Audience File, .RPJF RealProducer Job File, .RPSD RealProducer Server Destination are all XML files for managing some of the configuration found in the RealProducer software.
Those three formats should be easy enough, especially if we look for Namespace urls.
The RAM and RPM formats are simply text files with a URL. You can find some samples here and here.
An RM and RV file are the same format as the RMVB file but just with a variable bitrate. Later on a new format was used to improve the quality of video. This format has the extension RMHD, referring to RealMedia HD. Let’s take a look.
The format looks very similar, but has the magic header of .RMP instead of .RMF. MediaInfo and FFProbe are unaware of the format. The software mentions a RV11 codec which is confusing as the codecs went from RV10-RV60.
Phew, that was a lot considering the two formats I tried to research came up the same as an existing format. There are probably others I have missed. I did see a reference to an RMX format which seems to be an encrypted RM file. The header is the same so it will identify as a RealMedia file, but with the wrong extension. Let me know if you come across any. I have some samples of the formats mentioned here, plus a proposal of new signatures on my Github repository.
Last week I had the pleasure of attending the 20th annual iPres conference on Digital Preservation in Ghent, Belgium. I enjoyed hearing from many of my respected colleagues on many aspects of preservation including one of my favorite topics, floppy disks. There was tutorials, lightning talks, and even a workshop, presented by Leontien Talboom, Elizabeth Kata, Chris Knowles, and myself. We titled the workshop “A Guide to Imaging Obscure Floppy Disk Formats“. The workshop was conceived by a mutual interest in imaging Wang 5.25in word processor disks, but expanded to include imaging of Amstrad 3in disks, 240K Brother Typewriter Disks, and Macintosh 400/800k disks.
I brought my hand soldered FluxEngine board and others brought their Greaseweazle board to show off how imaging obscure and uncommon disks can be done on a budget.
During the conference we talked a bit about the different type of hardware that can be used and the difference between a disk image and flux image. There seems to be quite the exhaustive list of different types of file formats, some specific to a platform and others more generic. I recently did a blog post on the formats used by the Applesauce software, which have some unique features.
There are many disk image types which should be researched and added to PRONOM and other format description sites, but today lets take a look at a generic format used by many tools.
The HxC Floppy Emulator file format which the extension HFE is a popular format used with floppy drive emulators. There is a lot of complexity with what is included in many of these image formats, some are simply a raw sector representation of the binary data on a disk, others contain the complete flux readings from a floppy disk. The HFE format contains a little more than a raw image, including a header, a track lookup table, and the bitstreams for each track all with the purpose of emulating the physical media. The HFE format contains only a single pass over the data, where other formats may contain multiple reading of each track to get more complete data which can be helpful for damaged or purposely copy-protected disks. You can read more on Ashley’s blog, Library of Congress format description.
When using the HxC Floppy Emulator software, you can open and save to many different formats. The main format being their HFE native format. It comes in 5 versions.
Above is a hexdump of the main SDCard HxC Floppy Emulator file format. The format specification shows the 8 byte header “HXCPICFE”. This is a very unique pattern and should be all we need to make a robust signature for the format, but we do need to take into account the other HFE “versions” and see if they might clash or need to be identified separately.
The “Rev 2” version also has the same header. But if you look at the 9th byte you can see the value changed from 00 to 01, which according to the specification, this is the revision byte.
This last format seems to be a special HxC stream image.
It seems the best option is to make three signatures to identify the three main headers. Additional software can be used to further parse the disk image. If you would like to see some sample images, you can download a bunch here. You can also take a look at my GitHub repository to see additional samples and a proposed set of signatures.
The year was 2001 and I found myself in need of an audio player and recorder. I had been burning CD’s for a few years, making mixed CD’s was fun and convenient, but I needed more flexibility. After some research I decided on a device that was super popular outside the United States, but was gaining some loyal fans.
This MZ-G750MiniDisc device could record in a standard high quality mode through RCA, optical digital cable, and an optional microphone in mini-plug. This model also had the LP2 and LP4 modes which compressed higher, but could record up to 320 minutes on one MD disc.
Sony accomplished this by using a propriety compression codec called ATRAC, or Adaptive TRansform Acoustic Coding. This compression format was used with the MiniDisc and other Sony devices like the the flash memory Walkman’s sold later.
I recorded and stored a lot of music on the few disc’s I purchased over the next year, but as you may have surmised, the iPod came out later that year. I waited a bit but eventually purchased the updated 10GB model and the MiniDisc only was used to make a few recordings over the next little while.
As good as the MiniDisc is, the model I owned could record in a digital format, but lacked the connections to transfer the audio to a computer unless you used the optical cable and captured in real time to a computer with an optical input. This was by design, even when they put USB ports on later models, the software only allowed sending audio to the MiniDisc, but not back from the device.
A few years back I heard of some work the community has done to bring MiniDisc’s back from shadows. Now there is a thriving market and some models can cost a pretty penny. With that came some great tools and the ability to copy from the device back to the computer. The only problem, my device lacks a USB port. I kept my eye out for a “good” deal on a NetMD MiniDisc device. It took some time, but I am happy to report I am now the proud owner of a MZ-N420D.
With a new USB capable NetMD in hand, lets take a look at the different ATRAC formats!
The most common ATRAC formats are the ATRAC3 versions which generally have the extension OMA or OMG. But let’s start with ATRAC1, the format used on my earlier MiniDisc device when captured in Standard Mode. Using the amazing https://webmd.pro/ tool, I was able to connect my new device and “archive” my disc.
ffprobe -i Test1.aea [aea @ 0x7fc5e6c04fc0] Estimating duration from bitrate, this may be inaccurate Input #0, aea, from 'Test1.aea': Duration: 00:00:01.63, bitrate: 302 kb/s Stream #0:0: Audio: atrac1, 44100 Hz, stereo, fltp, 292 kb/s
ATRAC1 files can have the AEA extension, which ffmpeg can decode, but MediaInfo doesn’t appear to have added the support. According to the decoder the magic numbers for the ATRAC1 format are “Magic is ‘00 08 00 00‘ in little-endian”. This pattern matches my files, but the recent addition PRONOM fmt/1968 doesn’t match all the samples I have.
The magic numbers are too simple to be the only pattern used in a signature. The Track title follows the magic numbers but are not static. Then there are quite a bit of zero bytes, like a lot. All the samples I have seem to have some data around the 260 offset, then more zero bytes until around 2400 to 2800 byte offset range. I scanned all the samples I have through Tridscan, and it looks like the only bytes in common are the magic header, lots of zero’s, and a few strings.
The ffmpeg libavformat code does tell us at byte 264 there will be a 01 or 02 which indicates channels. 44.1 kHz is assumed and the bitrate is calculated from a constant by how many channels, so not much else to identify common patterns. More testing needed.
ATRAC3 is what allowed my original MiniDisc to record in LP2 and LP4, extending the recording time. This format was also how some DRM was added to the device and computer to allow for some checking-in and checking-out of files, but to control their use. This was done with Desktop software from Sony, originally in the form of the title SonicStage, later incorporating OpenMG to manage the DRM. I used SonicStage to encode some audio into OMG and OMA formats.
OpenMG format files
These are audio files which have been converted to ATRAC3 format and encrypted in OpenMG format, which is the copyright protection technology for audio contents specific to OpenMG (with the extension .omg).
ffprobe -i /01-Untitled.omg [oma @ 0x7fed2440e980] Format oma detected only with low score of 1, misdetection possible! [oma @ 0x7fed2440e980] Couldn't find the EA3 header ! /01-Untitled.omg: Invalid data found when processing input
The good news is there appears to be a standard header for the OMG format, but ffmpeg assumes they are OMA files. Turns out OMG was the original form of the format, but was replaced with OMA starting with SonicStage v2.1.
We learned from trying an OMG file in ffprobe that ffmpeg is looking for EA3 header, which is found in this OMA file. Both of these formats should have a nice header to work from for a signature. In fact there has already been a request and signature submitted for the OMA format. Mine are slightly different, but only takes a small tweak to work with all my samples. Also, it seems the extension AA3 was used for awhile before settling on OMA. OMA can have a few different types:
ffprobe -i 02-Untitled.oma [oma @ 0x7fbc7ef047c0] Estimating duration from bitrate, this may be inaccurate Input #0, oma, from '/Star Trek/02-Untitled.oma': Metadata: title : Untitled(2) album : Star Trek OMG_TRACK : 2 OMG_ALBMS : Star Trek OMG_ASGTM : 2366000 OMG_TIT2S : Untitled(2) TLEN : 27000 Duration: 00:00:27.21, start: 0.000000, bitrate: 193 kb/s Stream #0:0: Audio: atrac3p ([1][0][0][0] / 0x0001), 44100 Hz, stereo, fltp, 192 kb/s
I’ll leave the technical properties to be handled by tools more suited for parsing the format like ffmpeg. Maybe MediaInfo could have the formats added, but until then, it might be best to simply identify the main format. I am also aware of some later additions to the ATRAC family, such as ATRAC3plus, ATRAC Advanced Lossless, and ATRAC9 (WAV RIFF). There are other extensions like AT3 out there which use the ATRAC codec, like Sony’s Playstation or PSP. I will have to keep my eyes out for the even more elusive Hi-MD MiniDisc devices to find out more. For now, take a look at some samples and my proposal for signatures on my GitHub.
Most File Systems have unique ways for doing things, but also many things in common. On a Macintosh you might have some extended attributes, or that pesky hidden .DS_Store file no one really knows why it’s there. On Windows you may find a hidden thumbs.db file throwing off your file count. Hidden files are everywhere. Many have a real purpose, and that purpose may be insignificant or important in finding or giving context to other files.
While processing a collection from a USB drive the other day, I came across a few files I hadn’t seen before. They were hidden files nestled in with a few folders of PDF’s. They have a unique name, so I figured it would be easy to find some documentation on them on the web. Turns out, there is very little.
-rwx------@ 1 tyler staff 235 Aug 22 00:04 XNAME.CRS -rwx------@ 1 tyler staff 235 Aug 22 00:04 XNAME.LIB
The files were only a couple years old, so I figured there had to be some modern software which created them. A look inside the files with a hex editor didn’t provide much information.
I was about to give up since there wasn’t much data and since they were hidden files, I assumed they were probably just some cached files with little value. But wanting to learn more I did some more digging and at first thought they might have something to do with DropBox, as a user said they just showed up one day, but later found they probably were created by some Document Management software known as Worldox. I found a support page claiming these two files are part of a database.
XNAME.LIB Contains document numbers (DOS names), extended names, and file security information. XNAME.CRS Contains custom profile field and version control information.
There is a key term in the definition of XNAME.LIB, “extended names”. I was curious what that meant and found Worldox has been around awhile. The World Software Corporation has been around since 1988 and Worldox was released in 1993, but before that it specialized in an interesting DOS software package called “Extend-A-Name” or “Extend-A-File”. The name gives away its purpose, it literally extends the name of the limited 8 Characters you could use in DOS. I can remember trying to decide on a filename that would accurately describe my file so I knew what it was later on. 8 characters is not enough to explain the content of a file, especially if you have hundreds or thousands of file to manage.
Extend-a-file was software which bonded with another piece of software like WordPerfect and loaded itself in memory. Then when you went to create a file or locate a file within WordPerfect, Extend-a-File would take over and allow you to create a file with a traditional 8 Character name, but also a name much longer.
This extended name allowed you to describe the files content with much more detail. Making it also very easy to find previous documents.
Pretty slick, this software really would make a big difference to managing a large amount of files in the old DOS days. Ok, it adds extended names, but where is this information stored? That is where the XNAME files come into play.
This XNAME.LIB was generated by a running copy of XNPLUS circa 1990, bonded with a copy of DisplayWrite 4. It adds much more information within the “Library”.
So it seems this method of storing the extended filenames and other metadata started in the Extend-a-File software and has been brought along and used in modern versions of the Worldox Document Management software. Much of its purpose to extend an 8 character filename to around 60 character is no longer needed as most systems now allow for filenames with at least 256 characters. I imagine there is more the software can add to these files, but found that the samples I have really don’t have any information in them at all. The Worldox software seems to be marketed toward law firms and others who have a lot of documents to manage, but I have been unable to find a way to play with the software to see what can be embedded within the XNAME.LIB files.
There is also some discussion out there about wether to backup these two hidden files and what might happen if they are lost. Regardless, you may want to think twice before tossing them as I almost did. They could contain valuable information needed to give context.
I am not sure it is possible to have a good signature for identification of these files. The samples I have and others I found online, here, here, here, and here, just don’t have much data within them. In fact they are all exactly 235 bytes. The only consistent byte within them and the samples I generated from XNPLUS is “4C” at offset 150, but everything else seems arbitrary. Here is a sample I generated from XNPLUS if you want to take a closer look.
There seems to be a never ending growing list of disk image formats. Many have features which are specific to the media and format. If you have ever imaged an older Macintosh floppy you know they are special. If you add in copy-protection which many early Apple II floppies have, and you need special drives, hardware, and a special format to store the floppy data.
When imaging special media, especially with unique media, it is best practice to image the floppies at the magnetic flux level.
Floppy disks contain magnetic fluctuations which are measured and recorded using specialized equipment. A popular method is using a Kryoflux board, floppy drive, and software. The software communicates with a custom controller board connected to a floppy drive through USB. If you are interested in the different controller boards, a good list has been compiled here.
A Kryoflux, fluxengine, greaseweazle, all can image specialized disks like a Macintosh 800k floppy, but the best controller board for them is an Applesauce setup. They are specifically designed to for the task. With that task, comes a few specialty formats.
A file format which can store flux data is a bit different than a regular disk image format. The flux data contains all the low-level recordings which can then be interpreted into disk images much like the original floppy. In the case of an Applesauce flux image, it can contain all the small nuances of the original floppy, this includes recording any copy protection or other creative methods used by software vendors throughout the years. The format used for storing this flux data is the A2R format.
A2R is in its third iteration. Let’s take a look at the basics of the format.
hexdump -C Samplev2.a2r | head 00000000 41 32 52 32 ff 0a 0d 0a 49 4e 46 4f 24 00 00 00 |A2R2....INFO$...| 00000010 01 41 70 70 6c 65 73 61 75 63 65 20 76 31 2e 31 |.Applesauce v1.1| 00000020 2e 36 20 20 20 20 20 20 20 20 20 20 20 20 20 20 |.6 | 00000030 20 02 01 01 53 54 52 4d 75 17 5d 01 00 01 e6 da | ...STRMu.].....| 00000040 00 00 83 a9 12 00 12 1e 11 13 1e 13 1e 13 11 1f |................| 00000050 21 1f 11 13 1c 14 1e 30 14 20 1e 14 1e 14 1c 14 |!......0. ......| 00000060 1c 13 11 20 21 1f 11 11 0f 13 1e 14 1c 14 2e 21 |... !..........!| 00000070 13 1e 13 1e 14 1e 11 11 20 21 1f 11 11 13 1e 1f |........ !......| 00000080 13 20 30 21 11 11 0f 13 1e 13 11 30 1f 21 20 13 |. 0!.......0.! .| 00000090 11 30 1f 14 1e 30 14 1e 11 11 11 1e 13 11 1e 14 |.0...0..........|
The A2R format uses a chunk system to store the various pieces to the format. Earlier versions used a STRM Chunk to store all the raw flux data. Version 3 changed to a RWCP Chunk to store all the raw flux data. Applesauce uses a 2-pass imaging process, doing a rapid imaging to determine where on the media surface track data exists and then a second pass that captures longer durations for processing and error correction.
Once the full raw flux data has been captured that data can be interpreted as a disk image. The Applesauce software is able to make a regular disk image, a Disk Copy 4.2 file, which are well known and identify in PRONOM as fmt/625, but can also create a couple of special disk image formats which allow for special nuances on an original disk.
The WOZ Disk Image format is an offshoot of the Applesauce project. Capturing highly accurate bit data is of no use if you don’t have a container to hold the data. The WOZ format was designed to be able to contain every possible Apple ][ disk structure and layout. It can be so accurate that even copy protected software can’t tell that it isn’t an original disk.
The WOZ format has become very popular in the Apple II community and is ideal for emulating all the old games and software titles popular in the early 1980’s. You may have guessed where the name comes from. The internet archive has a large collection of WOZ disks in their WOZ-a-Day collection. The file format of a WOZ disk image is also a chunk based format similar to the A2R format, it has two versions. Let’s take a look.
Unlike a common disk image, a WOZ image contains more than the bits on the disk, it contains a mapping of all the tracks and the associated data, this is how it can even contain copy-protection usually only possible with a physical disk. The ‘TMAP’ chunk contains a track map and the ‘TRKS’ chunk contains all the data.
All three formats created for imaging and emulating Apple and Macintosh software are well documented and open. They are also well suited for preservation as they can contain extensive metadata in the INFO chunk which gives provenance information on the source of the files. The Applesauce software even has a camera to photograph the disk itself for archiving. All of this makes these formats great for preservation and emulation. Take a look at my proposal for a signature on my Github.
Microsoft is never in short supply of file formats. They have made many changes over the years. Introduced lots of products, some lasting longer than others. The list is quite long.
One such software was called Office Binder. Introduced with Office 95, it was a companion application to combine a number of OLE objects together in one “Binder”. Meant to be the digital version of an Office Binder one often uses for presentations or proposals.
You could add sections and include Word documents, Images, Powerpoint, Excel spreadsheets, basically any OLE object. Of course a Binder file itself was an OLE compound object. They had the extension OBD, and templates used OBT. The PRONOM registry has PUID’s for the different Binder versions, but there are some issues.
filename : 'Binder95-s01.obd' filesize : 5120 modified : 2024-08-08T21:24:34-06:00 errors : matches : - ns : 'pronom' id : 'fmt/240' format : 'Microsoft Office Binder File for Windows' version : '97-2000' mime : class : basis : 'extension match obd; container name Binder with name only'
Turns out only one of the PRONOM PUID’s has an actual signature, the others are placeholders. So when I run Siegfried on an Office Binder 95 file, it comes back as fmt/240 which points to an Office Binder 97-2000 file. It’s a simple signature, looking for an internal file named “Binder”, which is inherent of all the Binder file types.
<ContainerSignature Id="5500" ContainerType="OLE2"> <Description>Microsoft Office Binder File for Windows 97-2000</Description> <Files> <File> <Path>Binder</Path> </File> </Files> </ContainerSignature>
Taking a look inside the Office 95 Binder file, we can see the “Binder” file.
It looks like version 97 and 2000 have an extra file. The “HdrFtr” file seems to reference a Header and Footer, which according to documentation was a feature added in Office 97.
What’s new in Office Binder 97
Office Binder makes it possible for you to group all your documents, workbooks, and presentations for a project in one place. To get started with Office Binder 97, add a new or existing document to your binder. Use the new Office 97 features while you work in a binder……. Print headers and footers for a binder
We can use the “HdrFtr” file within the container to differentiate between the 95 version and 97-2000 formats. Perhaps, a closer look at the DocumentSummaryInformation file in the future, might help with a more precise identification later. There doesn’t seem to be anything to distinguish an OBD file from a OBT template file, so those PUID’s may not be needed. The other format related to the Binder software has the OBZ extension. It is called a Wizard template file in some documentation, but I have been unable to find any type of “Wizard” functionality in the Office Binder Apps to generate a file. The OBZ format seems to have something to do with macros in Visual Basic. Luckily there are a few examples available on Office install disc‘s.
Sure enough, the OBZ file has a Visual Basic macro (VBA_Project). Unfortunately, it appears to be nested in an additional folder within the container, with a variable number number which is likely to change from file to file. That fact will make identification in PRONOM much more difficult, as the signatures are not designed for variable names. Possibly something we can investigate later.
Microsoft Binder was only released in Office 95, 97, and 2000, but was supported in Office XP and 2003 through an UNBIND.EXE application which would simply separate all the different objects back out to the individual files.
The Microsoft Office Binder is not included in Office 2003. However, if a Binder file created in a previous version of Office contains information you want to access, you can use the Unbind tool to pull out the information and save it in the formats of the appropriate programs. In order to do this procedure, the Unbind tool must be installed.
As always, you can look at some sample files and my proposal for updated signatures on my GitHub page.
Researching file formats isn’t for everyone. Others may find it boring or even odd. Trying to explain to others the nuances of a binary format versus a container format would bring many tears. Their reactions sometimes are similar to hearing someone explain their belief in aliens. Passionate, but a bit on the crazy side.
So with aliens and containers on my mind, let’s take a look at a format with the extension UFO. It is not an unidentified flying object or a UAP, it may as well be an unidentified file object, but in this case it is a “Ulead File for Objects” format. It is the exclusive file format for use with the PhotoImpact software from Ulead Systems, a Taiwanese developer known for many popular software programs. First released in 1996 with version 3, the PhotoImpact software was marketed as “a fully object-based tool, which pioneered a number of important innovations“.
The reason it was a considered a full object-based tool was the UFO format is based on the, at the time, popular OLE Compound File Storage object format developed by Microsoft. So by using some OLE tools we can take a closer look at some of these Unidentified File Objects……..
oleid Sample.ufo oleid 0.60.1 - http://decalage.info/oletools THIS IS WORK IN PROGRESS - Check updates regularly! Please report any issue at https://github.com/decalage2/oletools/issues
Filename: Sample.ufo --------------------+--------------------+----------+-------------------------- Indicator |Value |Risk |Description --------------------+--------------------+----------+-------------------------- File format |Generic OLE file / |info |Unrecognized OLE file. |Compound File | |Root CLSID: - None |(unknown format) | | --------------------+--------------------+----------+-------------------------- Container format |OLE |info |Container type --------------------+--------------------+----------+-------------------------- Encrypted |False |none |The file is not encrypted --------------------+--------------------+----------+-------------------------- VBA Macros |No |none |This file does not contain | | |VBA macros. --------------------+--------------------+----------+-------------------------- XLM Macros |No |none |This file does not contain | | |Excel 4/XLM macros. --------------------+--------------------+----------+-------------------------- External |0 |none |External relationships Relationships | | |such as remote templates, | | |remote OLE objects, etc --------------------+--------------------+----------+--------------------------
Well, it is a OLE file, but is unrecognized/unidentified by the oletools software. It also appears to be missing the root entry and CLSID you commonly find in OLE files. Since this is an OLE container we can also just use 7zip to peek inside as well.
In this sample file, we have a bunch of directories and objects, but none of what we expect to see in an OLE file, such as a “SummaryInformation” or “DocumentSummaryInformation” like we would see in a Word DOC file. By not having the standard contents of the container, it makes these files very specific to PhotoImpact software.
Here is another UFO file from the last version of the software PhotoImpact X3 when it was owned by Corel, but phased out in 2009. This is the basic file structure with no objects added to the file. We can be fairly confident these are the base files in most every other UFO file. It doesn’t have any of the “OS” folders which contain the objects, so I think the LtfHeader file might be our best bet for a signature. Let’s take a look at the Hex values for a few of them.
Making a signature using the first 8 bytes of the LtfHeader file appears to have worked for all the 3,400+ sample files I have collected. Problem is it also worked for another extension found in the some of the later versions of PhotoImpact.
When you have successfully finished your template, make sure to save it in the Ulead File For Photo Project format (*.UFP). This allows you to open and use your template in the Photo Projects dialog box. In the Template tab, click Open Project and browse for the created file.
They appear to be a template version for the format so we should be fine just adding the extension to the same signature.
Well, this Unidentified File Object is no longer unidentifiable. Was it sent by aliens? Possibly, but at least we know where these UFO’s came from, PhotoImpact. Take a look at the samples and proposed signature in my GitHub.
I have used and have researched a lot of audio editing software. Some are very simple and straightforward, others are feature rich and take some time to learn. While looking in a format, I came across some Audio software which nothing like I have used before. At first I was confused, I figured it would be simple to open a certain file format and play the audio. Not so fast.
Max is software which proudly says it is an, “infinitely flexible space to create your own interactive software”. Created by Cycling ’74 software, Max has been around for awhile, being developed in the mid 1980’s. It allows the user to make “patches” stringing around components and effects to accomplish an infinite amount of options and outcomes.
The software produces simple project files and patch files, but hey are just JSON data, at least in the latest version. But when working with audio files the software can save to a number of formats.
One of the options is a format called “SDIF”, which stands for “Sound Description Interchange Format“. SDIF was jointly developed by IRCAM and CNMAT, with proposals starting back in the mid-1990’s. Originally written as a Spectral Description, it was later changed to refer to a Sound Description.
The Specification states the general idea was to “store information related to signal processing and specifically of sound, in files, according to a common format to all data types. Thus, it is possible to store results or parameters of analyses, syntheses…” So not exactly the same as a simple WAVE file you can open and edit, this format was meant to store signal data for analysis.
Each SDIF file consists of a header and then an overall a succession of frames, not unlike chunks in the IFF/AIFF/RIFF formats, ordered in time. Each frame matrix declares a “Type” which can be a combination of many options. Lets take a look at a SDIF file:
This test file has the opening frame “SDIF“, to identify it as an SDIF, then a reference to the type “1TRC“. I would try and explain a Matrix 1TRC Sinusoidal Track, but I have no idea what it means. Something, something sine wave, etc. Someone much smarter than me can make use of this format. Here are a couple examples of SDIF with other frame types.
Unfortunately, the common tools I use to explore AV formats don’t seem to work on this format. MediaInfo, FFProbe, Exiftool, all give me unknown file warnings. So I had to compile the SDIF software in order to get some details.
querysdif angry_cat.part.sdif Header info of file angry_cat.part.sdif:
Data in file angry_cat.part.sdif (9504872 bytes): 1933 1TRC frames in stream 0 between time 0.000000 and 5.794875 containing 1933 1TRC matrices with 45 --400 rows, 4 -- 4 columns
An interesting thing is that a SDIF file can be in text form as well.
An interesting format for sure. But wait, there is more!
My initial interest in this format was when I was given access to a set of MUBU files. I was unclear on how there were created at first and it took me down a long path of learning about SDIF and the Max software from Cycling ’74 and IRCAM. MUBU turns out to be a toolbox for Max which adds more analysis features.
MUBU stands for MUlti-BUffer, which helps overcome some limitations. It is actually a container using the SDIF standard. Lets take a look.
hexdump -C test.mubu | head 00000000 53 44 49 46 00 00 00 08 00 00 00 03 00 00 00 01 |SDIF............| 00000010 31 4e 56 54 00 00 00 78 ff ef ff ff ff ff ff ff |1NVT...x........| 00000020 ff ff ff fd 00 00 00 01 31 4e 56 54 00 00 03 01 |........1NVT....| 00000030 00 00 00 53 00 00 00 01 4d 75 42 75 2e 43 6f 6e |...S....MuBu.Con| 00000040 74 61 69 6e 65 72 2e 4e 75 6d 54 72 61 63 6b 73 |tainer.NumTracks| 00000050 09 31 0a 4d 75 42 75 2e 43 6f 6e 74 61 69 6e 65 |.1.MuBu.Containe| 00000060 72 2e 56 65 72 73 69 6f 6e 09 31 2e 35 0a 4d 75 |r.Version.1.5.Mu| 00000070 42 75 2e 43 6f 6e 74 61 69 6e 65 72 2e 4e 75 6d |Bu.Container.Num| 00000080 42 75 66 66 65 72 73 09 31 0a 00 00 00 00 00 00 |Buffers.1.......| 00000090 31 4e 56 54 00 00 00 38 ff ef ff ff ff ff ff ff |1NVT...8........|
A MUBU file has the same SDIF frame header, but also include a “1NVT” frame, which is a Name Value Table. This is where the MUBU container is referenced. The MuBu file has its own structure:
If I query the MuBu file like I did the SDIF, I get the following:
querysdif test.mubu Header info of file test.mubu:
Data in file test.mubu (3741392 bytes): 77929 M000 frames in stream 0 between time 0.000000 and 1.623500 containing 77929 M000 matrices with 2 -- 2 rows, 1 -- 1 columns
The MuBu file contains one audio track and one buffer. This is a simple test file, but MuBu files can be quite large with multiple tracks.
Working with the Max software or OpenMusic is not something I found to be easy to understand. I am sure if I was more musically inclined and with a little practice I could make some of this work. For the time being, a signature to identify a SDIF and MUBU will have to do. Check out the GitHub for my proposed signature and a couple examples.
The 1990’s was an amazing time for multimedia. Compared to what is possible today, the graphics were more simple but there were many software titles leading the charge in Animation. Macromedia Director, along with Flash, dominated the interactive multimedia market for quite some time. Eventually being picked up by Adobe and discontinued in 2013. Quite a few multimedia disc’s out there were built using Director.
Competing with Director, another company had a strong product. Motion Works International was an early pioneer in the multimedia CD-ROM scene. Rumor has it, Motion Works was started by a 12 year old. Motion Works had been making software for use with the highly successful HyperCard software since 1988. In 1992 they released the successor to their ADDmotion software, a path based animation tool called PROmotion.
PROmotion was used with with many Multimedia titles, some in cooperation with the Corel Home series. In addition to commercial titles PROmotion was a great tool for the creation of animation clips and other marketing material. I came across some stand-alone marketing files for old scriptwriting software called ScriptWare. When I unarchived the HQX file and Installed the Demo, I was presented with a set of files with the .MW extension.
Looking at the files in the directory with their extended attributes I can see all the .MW files have no data fork (0 bytes), only a resource fork. This is common for any Application on the MacOS systems prior to MacOS X. At first the MW extension made me thing of MacWrite, but launching one of these MW files brought up an interactive menu. The type being APPL, which is Application.
What I thought would be a demo of the application Scriptware was actually interactive animations demonstrating the software. By dumping the resource fork of one of the MW files I found some information which helped me know what software created these interactive demos.
Makes sense, MW stood for “Motion Works”. ADDmotion was another software title developed by Motion Works, most will remember it as an add-on for Hypercard for adding animation to stacks. These MW files are created using PROmotion and exporting them as a stand-alone animation which includes the “AM Viewer” built in. A regular PROmotion file, however, did not include a viewer and requires the software in order to open and run.
The files do have a Type/Creator code of “ADDm”, but with no data fork, identification through standard means is not possible. They also do not have the “vers” string to help identify them within the Resource Fork. Since standard methods of identification are impossible, I hope in the future there will be more tools available to read the Type/Creator codes while on the Mac, or in a disk image, or within a container and return back the Software which created the file and the file type.
The products from Motion Works where significantly cheaper than animation tools such as Director, but were still pretty powerful for its day. I was surprised when I found the company didn’t last much longer than 1998 before disappearing. There are probably many stories like PROmotion, coming onto the scene with new and exciting features before thought impossible only to die out as other tools dominate the market.
If you are interested in looking at the files yourself, here is a link to some original files, and the same files encoded in MacBinary.