File extensions are the easiest way to quickly identify a file format, but they can be misleading. This is the reason in Digital Preservation format identification tools like DROID are important to look closer at the file structure to more accurately identify formats. The other complication is some extensions are used for more than one format. Extensions like .DOC or .ISO can be used with many formats.
The PRONOM registry which DROID uses will list extensions associated with each format signature, but for some, they only have an extension and no signature. It’s nice to have an official ID to go with a format but with no signature it only matches based on extension.
This caused a problem awhile back for me while working with some files with the extension CDX. Which according to PRONOM, there are 5 completely different formats which use the extension, and probably others.
My CDX was related to some indexing software called Cindex. At the time the only format with a signature was for the WARC summary file CDX. The other was for a CorelDraw Compressed format with no signature. Confusing right? When I would run format identification on my Cindex files, they would default to the CorelDraw Compressed format, identified by extension. It was easy enough to create a signature for the Cindex format as I had enough samples to know the patterns needed for correct identification. But I was curious about the CorelDraw format. Should be easy to find, right?
Wrong. Finding a sample of this format was very elusive. All I had to go by was the name given to the format by PRONOM and the extension. I scoured every Corel CD and image I could get my hands on. For months I looked and could never find a single CDX file. Each CorelDraw software I was able to run did not have any ability to save in the CDX format. I scoured clipart discs, other Corel software, like Designer, PrintHouse, Photo-Paint, nada, nothing. I started to wonder if the format even existed. That’s when I noticed in the filters included with CorelDraw a reference to the ability to import a CDX but not write to one.
This led to me finding a reference on the old Corel FTP site for knowledge base number 4550.
It mentioned something called ArtShow, where version 5 supported the file format CDX. ArtShow was a gallery of winning designs released on a CD-ROM and book each year. The first one being ArtShow 91, then ArtShow 3, 4, 5, 6, and finally 7 was the last. Each one released used a different proprietary compressed format for storing all the designs, these formats exist nowhere else. The question remains, why didn’t they use other popular Corel formats like CDR, CMX, or CCX which were used on many other clip art titles.
It took some time but I was finally able to find copies of a few of the Artshow CD-ROM discs, especially numbers 5 & 6. Which had the CDX format and the second generation CPX formats.
Each format had a easy to recognize header making a PRONOM signature easy to create. PRONOM already had the PUID for the two formats CDX & CPX, so sending in the signature added to the registry and hopefully will help distinguish between all the CDX formats!
Sony’s IC Recorders have been a popular small digital voice recorder for many consumers. The current models all use common recording formats like Linear PCM WAVE files or MP3, but it wasn’t always so. One of the first models ICD-R100 would record to the ICS audio format, which was Sony’s original sound formats used on the IC Recorders. I am still looking for samples of this format. If you do have a need to convert this format, Sony has free converter software.
The next generation of IC Recorders used a Memory Stick and therefore recorded audio to the MSV (Memory Stick Voice) format. There were actually two different types of MSV files, the first used the ADPCM codec and the next used the LPEC codec. Later IC Recorders would record to the DVF (Digital Voice Format) which also had a couple versions, one using the LPEC codec and the other the older TRC codec.
AFAIK, none of the codecs used in these file formats has been made public and these formats are not readable by tools such as MediaInfo. The only way to know details of a file and have the ability to play or convert is to use Sony software which has been discontinued and the replacement, Sound Organizer, can only recognize the LPEC codec versions of MSV and DVF. There is also a plugin for Windows Media Player available here, which is required even for Switch to work.
PRONOM currently has one signature for the LPEC versions of MSV and DVF, so lets look closer at the formats and see if we can determine what they are from the header.
The current software for managing audio files from IC Recorder is Sound Organizer. The software does open and convert some MSV/DVF files as long as they use the LPEC codec. Sound Organizer Compatible formats.
Also note, Sony made one ICD-CX series recorder which could also capture photos. It requires the Visual & Voice Player software. Audio is recorded in the DVF format.
Test Data Set
In order to explore the different formats I first needed to gather some samples. There are a few out there, but with the Digital Voice Editor 3 software, I was able to take a sample file and convert it to the many options available. You can see in the screenshot below, the different samples, their extension and the codec used. You can find my samples in GitHub here.
All MSV and DVF file have a similar pattern. The first 32 bytes have the text string “MS_VOICE SONY CORPORATION”. In between MS_VOICE and SONY, there is 4 bytes which vary slightly between the different formats. Here is a table of samples and the 4 bytes so we can see the differences.
Model
CODEC
EXTENSION
Hex Values
ICD-Px0
TRC
DVF
01020000
ICD-Px8
TRC
DVF
01020000
ICD-Px7
TRC
DVF
01020000
ICD-SXxx0
LPEC
MSV
01030000
ICD-SXx8
LPEC
MSV
01030000
ICD-SXx7
LPEC
MSV
01030000
ICD-SXx6
LPEC
DVF
01020000
ICD-SXx5
LPEC
DVF
01020000
ICD-SXx0
LPEC
DVF
01020000
ICD-MX
LPEC
MSV
01020000
ICD-BM
LPEC
MSV
01020000
ICD-ST
LPEC
DVF
01020000
ICD-MS5xx
LPEC
MSV
01010000
ICD-S
LPEC
MSV
01010000
ICD-BPx50
LPEC
DVF
01010000
ICD-BP100/x20
LPEC
DVF
01010000
ICD-MS1/MS2
ADPCM
MSV
01000000
ICD-R100/R200
Unknown
ICS
There is an obvious pattern to the hex values as they increment 0100, 0101, 0102, and 0103. But there is some overlap between extension and codec, so probably more of a version number than specific to the codec. Currently the PRONOM signature for this format fmt/472, has the pattern for the 0102 version, but none of the others. We could simply add a variable in the signature for the different values and update the PRONOM signature so more samples would be identified. This would work well if there was a secondary characterization process to get technical metadata such as the codec and quality, but I am unaware of any tool to gather this information from the format, so I wonder if we can find any hints in the file to identify the codec so we have multiple PRONOM signatures to choose from. Also, you can see from the screenshot above that some of the LPEC formats have specific model numbers in the codec column, which could mean they may not be exactly the same. Each IC Recorder model has different quality settings and it appears, some settings may not be compatible with other models.
Looking beyond the first 16 bytes there is a lot of hex values which are unknown. A close comparison of all the samples leads me to the 4 bytes at offset 60. They seem to be the same for files with the same settings. Below is a chart of those values.
Extension
CODEC
Quality
Offset 60
DVF
TRC
HQ
00300001
DVF
TRC
SP
00350001
DVF
TRC
LP
00370001
DVF
LPEC (ICD-BP-100/x20)
SP
00150001
DVF
LPEC (ICD-BP-100/x20)
LP
00190001
DVF
LPEC
SP
002A0001
DVF
LPEC
LP
002C0001
MSV
LPEC (ICD-BM/MX/SXx7/SXx8/SXxx0)
SP
004A0001
MSV
LPEC (ICD-BM/MX/SXx7/SXx8/SXxx0)
LP
004C0001
MSV/DVF
LPEC (ICD-SXx7/SXx8/SXxx0)
STHQ
00200002
MSV/DVF
LPEC (ICD-SXx7/SXx8/SXxx0)
ST
00240002
MSV
ADPCM
SP
00050001
MSV
ADPCM
LP
00090001
Just to be sure this value at offset 60 was indeed an indication of codec and quality I manually switch out the 4 bytes from a LPEC ST file for a TRC HQ file. Sure enough, the software now saw the file as a TRC HQ audio file, even though the original is a Stereo file.
There is a very good chance this is not all the options. I only have one physical recorder which only records in Mono. But this gives us a really good idea of how to tell the difference between files. Below are the patterns I am submitting to PRONOM.
This is one example of a file format which has a proprietary component which was never released from the vendor. When the vendor stopped supporting the software to open and read these formats, the risk increased for long-term preservation. It would be really nice when a vendor discontinues a technology, which was used by consumers, they would make the documentation for the format openly available. If you know more about the format, please reach out or if you have samples which don’t match the patterns mentioned here.