Image compression has been around for awhile. It seems everyone took a crack at making better algorithms to improve quality and size. Some chose to invent new ways and others chose to use existing methods but with their own flare. Kodak tried this with their PhotoCD, but there was a couple other photo processing options that popped up in 90’s. One was Seattle FilmWorks and another was Konica PC PictureShow. Both of which used “proprietary” formats to deliver developed film on disk.
Seattle FilmWorks later called PhotoWorks, used an image format with the extension SFW and was based on BMP and JPG, but with their own twist. The same goes for the format used by Konica’s PC PictureShow.
If you took your film in to be developed at one of Konica’s photo labs, you could could have those images put on a diskette or later a CD-R. The disks came with software to view your photos called PC PhotoShow. The images stored on disk where in another proprietary format with the extension KQP. The KQP format was actually licensed from another company called Pegasus Imaging Corporation, later known as Accusoft. They developed their own way to compress a JPEG file which they called an ePic. An SDK called PICTools was offered for many years, but seems not to be available anymore.
ePIC (Proprietary)
Supports PIC format compression, replacing the JPEG Huffman encoder with the proprietary ELS entropy encoder for 15% more compression.
Can be losslessly converted back to JPEG format using Op_RORE.
A search on the internet for Konica KQP shows quite a few people over the years wondering what to do with their old disks and converting the old format to JPG, only to find a lack of information and available tools to do so. One such person used python to edit the file and making the file renderable as a JPG. While the method worked well for their KQP files, it might not work for all of them. Let’s look closer and understand why.
At first glance the file appears to be a Bitmap (BMP), and it does have a Bitmap header claiming to have JPEG compression, but if we look a little further into the file.
identify -verbose Sample.PIC identify: length and filesize do not match `Sample.PIC' @ error/bmp.c/ReadBMPImage/950. identify: unrecognized compression `Sample.PIC' @ error/bmp.c/ReadBMPImage/1019.
We find a JPG marker, in fact almost the whole jpg file is included, except the quantization tables for luminance and chrominance which are needed to properly display the image. This is the area the Pegasus company thought they could encode better to further compress the image. Their method was to use a new algorithm called ELS (Entropy Logarithmic-Scale). This new method was used by the PICTools software to make a Pegasus PIC file while Konica used it for their KQP format. They are identical. By choosing the luminance and chrominance values during compression, you could make a highly compressed image, but required specific software to render.
Pegasus also made use of a special custom APP marker (PIC) within the JPEG structure of the PIC/KQP and also any JPG compressed using their software. This marker which takes up around 8 bytes holds the luminance and chrominance values. Take the above sample for instance, it is compressing the image with a Luminance of 25 and a Chrominance of 30, these are integer values and in hex they would be “19” and “1E” respectively.
So in theory one could strip out any part of the file before the JPG beginning of file magic bytes (FF D8 FF E0), locate the APP marker, use the values to generate the two quantization tables, insert them in the appropriate spot and save out a JPG file.
This may be the case for the first few versions of the ePic format, but later versions got more complicated. It seems a “PIC2” version replaced the earlier versions and this format is a little more complicated.
Instead of the Bitmap (BMP) header, a proprietary PIC2 header is used, still containing a JPG in the JFIF format along with a the PIC APP marker, but encoded in a way that the simple method of adding a quantization table may not work. With the original format the JPG and the PIC/KQP were approximately the same size, this new version significantly reduces the size of the PIC/KQP in comparison with the JPG.
The ELS compression technology used in the ePic format seems to be patented by Pegasus and Accusoft, but is not entirely hidden as the libavcodec library includes a ELS decoder. Might be a fun project to use the code to decode the PIC/KQP formats fully.
In the meantime, a signature identifying the two versions should be added to PRONOM. Check out my proposal on my GitHub. If you need to convert your KQP or PIC files back to JPG here are a few links:
One of my favorite legacy formats to explore is any type of multimedia CD-ROM. The 1990’s and early 2000’s were filled with all sorts of multimedia for CD, Web, and Television. It is also one of the most difficult formats to try and preserve for the future. Many CD-ROM’s are filled with executables and/or Macromedia Director media, later having flash content. The operating systems and security needs today make playback almost impossible. For this reason many have built emulation services to mimic the original operation system and software to allow the many historic multimedia CD-ROM’s to once again interact with the user in a way many current systems still struggle with.
Many CD-ROM’s would come as Hybrid disc’s allowing them to be used on a Windows and Macintosh system, sometimes providing two different experiences. Then there were CD-Extra or Enhanced CD‘s as a separate session to an Audio CD which would contain bonus content playable only on a computer.
For fun I took a look back at some of my older Audio CD titles. I came across a couple, one claiming to be a “CD-Extra” and another an “Enhanced CD“. The CD-Extra disc when queried with cd-info claimed to have 12 tracks, with the 12th being a data XA track.
Disc mode is listed as: CD-ROM Mixed CD-ROM Track List (1 - 12) #: MSF LSN Type Green? Copy? Channels Premphasis? 1: 00:02:00 000000 audio false no 2 no 2: 02:13:66 009891 audio false no 2 no 3: 05:21:28 023953 audio false no 2 no 4: 08:18:19 037219 audio false no 2 no 5: 12:28:37 055987 audio false no 2 no 6: 16:11:58 072733 audio false no 2 no 7: 19:21:56 086981 audio false no 2 no 8: 23:17:49 104674 audio false no 2 no 9: 26:01:17 116942 audio false no 2 no 10: 28:30:02 128102 audio false no 2 no 11: 31:07:70 139945 audio false no 2 no 12: 37:29:46 168571 XA true no 170: 51:35:07 231982 leadout (520 MB raw, 516 MB formatted) CD Analysis Report CD-Plus/Extra session #2 starts at track 12, LSN: 168571
Mounting the 12th track showed a mix of Macromedia Director (.DIR) files and quite a few Quicktime MOV movies. Playback was not possible on my current computer so I had to resort to using an emulator to experience this bonus content, full of band member photos and biographies.
The other disc I pulled out to explore was a bit different. Using cd-info the disc looked very similar:
Disc mode is listed as: CD-ROM Mixed CD-ROM Track List (1 - 13) #: MSF LSN Type Green? Copy? Channels Premphasis? 1: 00:02:00 000000 audio false no 2 no 2: 04:20:08 019358 audio false no 2 no 3: 08:04:27 036177 audio false no 2 no 4: 11:15:62 050537 audio false no 2 no 5: 14:54:32 066932 audio false no 2 no 6: 19:57:73 089698 audio false no 2 no 7: 26:12:36 117786 audio false no 2 no 8: 29:51:59 134234 audio false no 2 no 9: 34:44:00 156150 audio false no 2 no 10: 39:36:62 178112 audio false no 2 no 11: 42:06:01 189301 audio false no 2 no 12: 45:42:26 205526 audio false no 2 no 13: 57:10:54 257154 XA true no 170: 72:56:67 328117 leadout (735 MB raw, 730 MB formatted) CD Analysis Report CD-Plus/Extra session #2 starts at track 13, LSN: 257154
The disc’s, even though were labeled CD-Extra and Enhanced CD, had the same structure and format. The difference was in the type of multimedia used. There was a simple application which launched Quicktime and loaded a single MOV movie. But, this was not your regular Quicktime Movie, this is a highly complex Interactive Quicktime movie.
The Quicktime movie could only be launched from an older operating system using Quicktime 6, and on the Macintosh, only a PPC CPU. The movie would launch with an interactive menu, allowing navigation as you might find on a DVD or Flash website, but all within a single MOV file. When I ran MediaInfo on the MOV file I got back quite a few tracks:
Ten video tracks and 51 other tracks. Exploring with Quicktime, I could see the entire list of embedded content:
Quicktime movies, an Audio track, dozens of Flash, Photos, Animations, Sprites, with the possibility of more. These types of Quicktime files had requirements in order to run with Quicktime 6 being the last which could playback all the content correctly. Current versions of Quicktime give a warning on the lack of compatibility.
This Interactive Quicktime movie proudly claims; “Made with LiveStage Pro“, which was an authoring environment for Quicktime made by Totally Hip Software Inc. Started in 1995, but seemed to disappear after 2004 with no new development and by 2014 the website went offline.
If you would like to see a couple of Apple created simple examples see here.
LiveStage Pro was a very powerful authoring tool in its time, another similar tool called Electrifier competed for the interactive Quicktime market. Adobe GoLive also competed, but offered fewer features. The final Quicktime movie exported from LiveStage Pro was the main component, but the software did save a project format with the extension “LSD”. Versions 2 through 4 of LiveStage Pro had a similar header.
All the samples from version 2 through 4 have the first four bytes as “LSAF“. It also seems the next four bytes may be version related. Version 1 however has a different header.
Identification of a LiveStage project should be simple enough, but identifying and rendering back a Quicktime movie made by this software takes some work. In fact there are many “Enhanced CD’s” and CD-Extra titles out there with quite a few system requirements. If we are not careful, many of these little gems might get more difficult to experience or lost completely.
If you would like to explore the Quicktime Movie from the Enhanced CD mentioned here, send me a message. You can also take a look at my signature proposal and samples files on my Github for LiveStage.
I have used and have researched a lot of audio editing software. Some are very simple and straightforward, others are feature rich and take some time to learn. While looking in a format, I came across some Audio software which nothing like I have used before. At first I was confused, I figured it would be simple to open a certain file format and play the audio. Not so fast.
Max is software which proudly says it is an, “infinitely flexible space to create your own interactive software”. Created by Cycling ’74 software, Max has been around for awhile, being developed in the mid 1980’s. It allows the user to make “patches” stringing around components and effects to accomplish an infinite amount of options and outcomes.
The software produces simple project files and patch files, but hey are just JSON data, at least in the latest version. But when working with audio files the software can save to a number of formats.
One of the options is a format called “SDIF”, which stands for “Sound Description Interchange Format“. SDIF was jointly developed by IRCAM and CNMAT, with proposals starting back in the mid-1990’s. Originally written as a Spectral Description, it was later changed to refer to a Sound Description.
The Specification states the general idea was to “store information related to signal processing and specifically of sound, in files, according to a common format to all data types. Thus, it is possible to store results or parameters of analyses, syntheses…” So not exactly the same as a simple WAVE file you can open and edit, this format was meant to store signal data for analysis.
Each SDIF file consists of a header and then an overall a succession of frames, not unlike chunks in the IFF/AIFF/RIFF formats, ordered in time. Each frame matrix declares a “Type” which can be a combination of many options. Lets take a look at a SDIF file:
This test file has the opening frame “SDIF“, to identify it as an SDIF, then a reference to the type “1TRC“. I would try and explain a Matrix 1TRC Sinusoidal Track, but I have no idea what it means. Something, something sine wave, etc. Someone much smarter than me can make use of this format. Here are a couple examples of SDIF with other frame types.
Unfortunately, the common tools I use to explore AV formats don’t seem to work on this format. MediaInfo, FFProbe, Exiftool, all give me unknown file warnings. So I had to compile the SDIF software in order to get some details.
querysdif angry_cat.part.sdif Header info of file angry_cat.part.sdif:
Data in file angry_cat.part.sdif (9504872 bytes): 1933 1TRC frames in stream 0 between time 0.000000 and 5.794875 containing 1933 1TRC matrices with 45 --400 rows, 4 -- 4 columns
An interesting thing is that a SDIF file can be in text form as well.
An interesting format for sure. But wait, there is more!
My initial interest in this format was when I was given access to a set of MUBU files. I was unclear on how there were created at first and it took me down a long path of learning about SDIF and the Max software from Cycling ’74 and IRCAM. MUBU turns out to be a toolbox for Max which adds more analysis features.
MUBU stands for MUlti-BUffer, which helps overcome some limitations. It is actually a container using the SDIF standard. Lets take a look.
hexdump -C test.mubu | head 00000000 53 44 49 46 00 00 00 08 00 00 00 03 00 00 00 01 |SDIF............| 00000010 31 4e 56 54 00 00 00 78 ff ef ff ff ff ff ff ff |1NVT...x........| 00000020 ff ff ff fd 00 00 00 01 31 4e 56 54 00 00 03 01 |........1NVT....| 00000030 00 00 00 53 00 00 00 01 4d 75 42 75 2e 43 6f 6e |...S....MuBu.Con| 00000040 74 61 69 6e 65 72 2e 4e 75 6d 54 72 61 63 6b 73 |tainer.NumTracks| 00000050 09 31 0a 4d 75 42 75 2e 43 6f 6e 74 61 69 6e 65 |.1.MuBu.Containe| 00000060 72 2e 56 65 72 73 69 6f 6e 09 31 2e 35 0a 4d 75 |r.Version.1.5.Mu| 00000070 42 75 2e 43 6f 6e 74 61 69 6e 65 72 2e 4e 75 6d |Bu.Container.Num| 00000080 42 75 66 66 65 72 73 09 31 0a 00 00 00 00 00 00 |Buffers.1.......| 00000090 31 4e 56 54 00 00 00 38 ff ef ff ff ff ff ff ff |1NVT...8........|
A MUBU file has the same SDIF frame header, but also include a “1NVT” frame, which is a Name Value Table. This is where the MUBU container is referenced. The MuBu file has its own structure:
If I query the MuBu file like I did the SDIF, I get the following:
querysdif test.mubu Header info of file test.mubu:
Data in file test.mubu (3741392 bytes): 77929 M000 frames in stream 0 between time 0.000000 and 1.623500 containing 77929 M000 matrices with 2 -- 2 rows, 1 -- 1 columns
The MuBu file contains one audio track and one buffer. This is a simple test file, but MuBu files can be quite large with multiple tracks.
Working with the Max software or OpenMusic is not something I found to be easy to understand. I am sure if I was more musically inclined and with a little practice I could make some of this work. For the time being, a signature to identify a SDIF and MUBU will have to do. Check out the GitHub for my proposed signature and a couple examples.
I was recently going through some of my old CD-R’s and came across this 11 year old fun memory.
I remember going to this 2003 Toad the Wet Sprocket concert in Salt Lake City with some friends, I had seen this band perform before, but this was the first time I was able to get a recording of the show. Normally having a recording of a concert of a well known band was a little shady, but for some bands, they not only allow recording of their live concerts, but they encourage it. There has been a few bands over the years who have this philosophy, one most have heard of is the Grateful Dead, because of all the tape trading, the band’s numerous concerts will live on forever.
The scene of recording concerts is still alive and well, and if you are into recording and sharing it is expected you share in a lossless audio format. The world of lossless audio is definitely in the minority of all those who listen to music on the daily. Most of us have been placated with the infinite playlists on services like Apple Music, Spotify, and Amazon Music. Most probably don’t care about owning music anymore, but for the few who consider themselves Audiophiles, having a lossless audio file is the only choice.
When it comes to formats, there are a few lossless formats to choose from, they all come with some advantages as well as some downsides. WAV files contain the full PCM audio stream, and while internet bandwidth today can handle full uncompressed audio, it can still be beneficial to use some compression for archiving or sharing over the web.
The most common lossless format today is the Free Lossless Audio Codec or FLAC, but there are also quite a few who like the Apple Lossless Audio Codec. Both offer many advantages, especially with metadata, cuesheets, and can contain cover album art. But many years ago another lossless format was most often used with bootleg recordings and audio sharing.
Shorten was one of the first lossless formats, developed by Tony Robinson in 1993 for SoftSound. It could cut the size in half of a typical 16-bit WAV file. It achieved this by using Huffman coding, kinda the same way a JPEG works, by reducing the frequency of how often patterns occur. Today FLAC and ALAC have replaced this format and offer improved features and support. Many audio players have dropped support for shorten making it difficult to use this old format.
The Shorten format uses the .SHN extension. It is one of the formats listed on the Library of Congress Sustainability of Digital Formats with the ID fdd000199, although a couple links don’t appear to work as it hasn’t been updated since 2011. Support was ended for this format and many of the links found on various websites are for broken, usually referencing the etree wiki. Much of which is archived on the Internet Archive.
Let’s take a look at the what makes up a lossless compressed SHN file. A quick look at a sample header:
The first four bytes seem to be consistent among my samples. It makes me wonder if the ascii values have something to do with the author, Anthony (Tony) J. Robinson. In the source code for the shorten software, the file shorten.h defines the ascii “ajkg” as the magic header for the SHN format. Also found in current ffmpeg code. Although the tools don’t have much to say about them.
mediainfo test.shn
General
Complete name : test.shn
Format : Shorten
Format version : 2
File size : 3.17 MiB
Audio
Format : Shorten
Compression mode : Lossless
ffprobe -i test.shn
Input #0, shn, from 'test.shn':
Duration: N/A, start: 0.000000, bitrate: N/A
Stream #0:0: Audio: shorten, 44100 Hz, 2 channels, s16p
Using the older SHNTOOL, we can get more information.
shntool info test.shn
-------------------------------------------------------------------------------
File name: test.shn
Handled by: shn format module
Length: 0:32.23
WAVE format: 0x0001 (Microsoft PCM)
Channels: 2
Bits/sample: 16
Samples/sec: 44100
Average bytes/sec: 176400
Rate (calculated): 176400
Block align: 4
Header size: 44 bytes
Data size: 5697720 bytes
Chunk size: 5697756 bytes
Total size (chunk size + 8): 5697764 bytes
Actual file size: 3325489
File is compressed: yes
Compression ratio: 0.5836
CD-quality properties:
CD quality: yes
Cut on sector boundary: no
Sector misalignment: 1176 bytes
Long enough to be burned: yes
WAVE properties:
Non-canonical header: no
Extra RIFF chunks: no
Possible problems:
File contains ID3v2 tag: no
Data chunk block-aligned: yes
Inconsistent header: no
File probably truncated: unknown
Junk appended to file: unknown
Odd data size has pad byte: n/a
Extra shn-specific info:
Seekable: yes
Many Shorten Audio Files are found out there in archives and file sharing sites, so even though the format isn’t used to create new files, it will still be around for awhile. My GitHub has my signature proposal and a couple of samples.
When it comes to design software there were many options over the years, many being released with a lot of hype and others disappearing not long after they released. There are few which lasted long enough to not be gobbled up by big names such as Adobe. One of those is Canvas by Deneba Systems.
First released in 1987, it is still available over at Canvas GFX. It’s amazing it was never bought by one of the big names, Adobe, Corel, Aldus, etc and remained under Deneba Systems until 2003 when it was bought by ACD Systems, but kept the name Deneba Canvas for a time. The later versions were not popular to all, and Mac support was dropped, but the software continued. Awhile back I was looking through a few of my old ZIP disks and found some software my father used in the mid 1980’s. He had a copy of Canvas version 2 for Macintosh. At that time I was more interested in playing games on our family’s Macintosh 128k than using design software.
Over the years I have come across many Canvas documents. With each version released, changes were made to the file format used to store the drawings and artwork. There were many file format changes as well as the extensions used with each version. Some are easily identifiable and others have some confusing structures. Lets look into it.
Version
Platform
Extension
Description
Canvas 1-3 & artWORKS
Macintosh
none
no strong pattern
Canvas 3.5
Mac & Windows
CVS
Similar to v1-3
Canvas 5
Mac & Windows
CV5
CANVAS5 string
Canvas 6-8
Mac & Windows
CNV
CANVAS6 string
Canvas 9-X
Mac & Windows
CVX
Similar to 6-8
Canvas Draw
Mac
CVD
Different than others
Canvas Image File
CVI
DAD5PROX
The first three versions of Canvas were Macintosh only and in those early days there was no extension, just a Type / Creator indicating to the Finder how to open them. Deneba Systems used the Creator codes DAD2, DAD5, through DADX.
The first versions are quite frustrating. I have gathered samples from Version 2, 3, 3.5 and artWORKS version 1. Even with numerous samples, there are no patterns I can discern from them. I even reached out to the current CanvasX technical support for answers. They wanted to be helpful, but their answers didn’t offer much help.
With “CVS” or ‘drw2’ for mac, the header contains ranges inside a structure, and other data like if it was compressed. When we see if it’s a valid file we check the ranges. There is no easy way to determine what hex values would be written because of flipping, Intel vs (PPC or 68K). Unfortunately, the research needed to identify the Hex value will require the original code for version 3.5 which we do not have access to easily. Canvas 3.5 code is 16 bit… this would also be an issue.
In the version 2 & 3 samples you can see some patterns, which I thought would allow for proper identification, but looking at more samples I found differences. One pattern I was hopeful might be consistent was the hex values “002000400060008000C00140018001C002400280”, but there are some which don’t match this pattern. If the file is truly compressed, it will be hard to know which values would be consistent among all files. I have over 8,000 samples and have a signature that only excludes around 20, so it will have to do for now.
When we start with Version 5 we get into some more identifiable headers, there is some oddness with some samples. But with an ascii string like “CANVAS5”, it should be easy, right? Not so fast, in version 5 you can compress the file structure. This removes the easily identifiable “CANVAS5” string. But some have a small string at the tail end, but others do not.
Canvas 6 uses a new extension, but has a similar structure to the file format. With compression as an option. But some of the compressed files on Windows has a reversed string, “5VNC“. So many Canvas 5 compressed look identical to Canvas 6 compressed, complicating identification.
While most have the “CANVAS6” string near the beginning, quite a few are missing the CNV5/5VNC string at the end. Instead, many have the string “%SI-0200” near the end, which I use in my signature suggestion. This structure remained the same from version 6 to 8.
In version 9 and forward we have an extension change to CVX, but the format is similar with the “CANVAS6” string, but is a slightly different offset. It is still used with the current version of Canvas X.
This collection of file formats is very hard to make sense of. Some really great consistent patterns on many samples, with lots of exceptions. Super confusing. This software has had a long run, with the latter years staying pretty stagnate in terms of new development. It is worth defining and creating a signature for the consistent patterns, then we can dial in the variants over time?
The signatures I have built miss about 23 files in versions 1-3 out of the ~9000 samples I have and for Canvas 5, only some of the compressed files are currently not identified. But so far all my CNV and CVX files identify correctly, so probably good for now.
CanvasX dropped supported for the Macintosh, but did release an entirely different product called Canvas X Draw, which does support the Macintosh. Here is what a CVD file looks like:
There is also the matter of a Canvas Image, which the User Guide calls proxy images. They are Raster images used in placements within Canvas Documents. Should be easy to identify.
Phew, if you held on for this whole post you must really like confusing file format structures. This format has been on my mind on and off for about 6 years. Hopefully these signatures will work for the vast majority of the Canvas files found in archives and personal systems. As always here is my GitHub with the signatures I am proposing and a few samples to get you confused.
There are probably many reasons why a software developer might want to create a proprietary format to store their files in. The software may require special features that don’t fit into an existing format. I would hope a developer would try to use existing formats, or even better open formats, but for many reasons, which probably include profits, they choose to re-invent the wheel often.
MAGIX is a German company which started making software in 1994. In 2001 they developed their first video editing software which was called Movie Edit Pro. The software seems to be well received and is still in use today.
Like most video editing software, project files are used to store all the edits and links to video files. These are usually smaller text based, with many using XML as the project format. Not MAGIX, they decided to go with a different yet known format for their project files.
Yes, they used the RIFF container format for their projects. Seems an odd choice, especially for video production although it is well suited for it. AVI is another video format which uses the RIFF container. The MVP project file uses the ID SEKD with the format MVPH. Earlier versions of Movie Edit Pro used a different extension.
The MVD format used on an earlier version of Movie Edit Pro is also a RIFF, and with the ID of SEKD, but has a format of SVIP.
RIFFpad can break down the chunks we see in an MVP file. Each of the LIST chunks has their own subchunks as well. I assume this his how the editing software stores each video/audio track references, etc. So I give it to MAGIX for at least using an understandable format to store their projects.
MAGIX has also used RIFF in many of its supporting formats. So far I have found mfx, afx, ifx, cfx, ctf, tfx, ufx, mmt, mmm, hdp, each having their own format:
Not sure the best way to manage all of these in terms of identification, as I am not sure what what is the purpose of each format. Maybe for now I’ll make a generic to catch them all as a MAGIX File.
Extension
ID
FORMAT
AFX
SEKD
SAFX
CFX
SEKD
SCFX
CTF
SEKD
SVIP
HDP
SEKD
SHDP
IFX
SEKD
SIFX
MFX
SEKD
MAFX
MMM
SEKD
SVIP
MMT
SEKD
SVIP
MVD
SEKD
SVIP
MVP
SEKD
MVPH
MXM
MXMD
mxmi
TFX
SEKD
STFX
UFX
SEKD
SVIP
But, when it comes to their proprietary MAGIX Video format, I think they may have pushed things a little too far. Meet the MXV format:
I am not sure what I am looking at, is it a RIFF? Is it a RIFF variant like RF64? MAGIX claims the format is:
This is the MAGIX video format for quicker processing with MAGIX products. It offers very low loss of quality, but it cannot be played via conventional DVD players.
A look around the internet doesn’t bring much up in reference to this format. Just my recent page on the format wiki. A search for MXRIFF64 bring up nothing. But a closer look at other strings within the MXV file reveal we are probably looking at some sort of MPEG format.
I was able to locate a project on GitHub which claims to be able to demux the MXV format. The software is written in GO and appears to indicate this format is chunked based and has most of the chunks figured out. So if you find yourself stuck with some MXV files and don’t want to use the latest from MAGIX, this might be the tool for you.
This demuxer also has an interesting file you can download. It is called a “GRAMMAR” file and can be loaded into hex viewers like Synalyze It! can show the parts of a file you load. Its a great way to explore a format!
None of these formats are found in PRONOM, project files are not usually kept in archives, but if would be good to know about the RIFF files if they do turn up. The video format is for sure something the archival world should know about. MediaInfo is currently not aware of this format, but seems like it might be an easy task.
As usual, you can see some samples and my proposal signatures on my GitHub.
I came across another CD-ROM the other day with some fun embroidery formats. It includes the HUS format I recently posted on, plus a few more.
Like I mentioned before, this is a format genre which is not normally seen in the archival world, but is fun to take a peek into the world of embroidery formats. The HUS format from Husqvarna was a unique proprietary format, but looking at another in this set, we see a common container format.
filename : 'CH1604.ofm'
filesize : 25600
modified : 2002-04-29T05:58:26-06:00
errors :
matches :
- ns : 'pronom'
id : 'fmt/111'
format : 'OLE2 Compound Document Format'
version :
mime :
class : 'Text (Structured)'
basis : 'byte match at 0, 30'
First, what is an OFM file? It is the native format for Melco branded embroidery machines. They have been around for a few years. Melco has been around since 1972, but i’m sure the format is much newer. The fact that it is in an OLE container would indicate it was created in the mid 1990’s.
The EdsIV Object seems specific. Looking back at the web archive it looks like EDS IV was software available for the Melco products. In a user manual there are three formats associated with the software:
.CND – Condensed Format
.EXP – Expanded Format
.OFM – Project (Layout format)
The EdsIV Object file is unique and will work well for identification. There also seems to be some common patterns within the file that can further the correct identification.
Currently Melco distributes a different software for use with their embroidery machines. Their DesignShop software also works with the OFM format. Downloading a copy of version 11 and using the trial version I get access to a few OFM sample files. Let’s see if they are the same.
Well that is very different than the earlier example. We can see right away this is a different type of file, in fact the first few bytes tells us this another container format. The Resource Interchange File Format, is used in many various file formats, the most popular are WAVE, AVI, and CorelDRAW. It is a chunk based format and there are a few tools we can use to look closer.
Riffpad can open the file, but claims there is some extra data at the end. It does see four chunks and it gives us the code “OFM8”, which is what identifies this particular RIFF type.
I was also able to get some samples of version 10 of DesignShop and found they are the same OLE container. Also has the same “EdsIV Object” within the container. There is a small paragraph in the EdsIV user manual that indicates there are some versioning within the OFM format.
If you open an EDS III .OFM file and save it, it will be converted into an EDS IV .OFM file, which is no longer readable in EDS III. Files saved in this version of EDS IV cannot be read by previous versions of EDS IV.
This version of EDS IV is capable of producing two types of OFM files. Files saved as “Melco Project File (.ofm)” can only be read with this version or higher versions of EDS IV. Files saved as “Melco Version 2.00 (.ofm)” can be read by any EDS IV user that has version 2.00.006 or higher software.
It never ceases to amaze me how many formats use the Compound Object Container format. Seems like more and more are documented often. For now, I made a signature to identify the OLE and RIFF version of OFM. I’ll keep my eye out for the older EDS III and other related formats. As always, you can find my signatures and a sample file on my GitHub.
I think when most of us have some data to sort or make sense of, we tend to gravitate toward a spreadsheet. Using Excel or LibreOffice, or if you really like to party, OpenRefine. There are plenty of meme’s out there representing the frustration people have with bugs, features and limitations of Excel specifically.
Optimist: The glass is ½ full. Pessimist: The glass is ½ empty. Excel: The glass is January 2nd.
There are more tools out there for making sense of data, one some people have access to is Microsoft’s more advanced PowerBI tool. Marketed as a Data Visualization tool it is accessible to many with a Office 365 subscription. It offers expanded features than excel and isn’t as limited in row maximums.
PowerBi was recently the topic of a Code4Lib editorial issue. The writer of an article for their journal posted two PowerBI datasets which a reader later noticed had private data. After some miscommunications and misunderstandings an open letter was drafted and received some support. Code4Lib did release a statement and lessons were learned.
One statement from the Code4Lib staff caught my eye. “The released files were in a proprietary file format, Microsoft Power BI, with which none of the editors have experience.”
We all use tools for our jobs we are most familiar or available to us. No one can be an expert in all file formats. Some us try, but things change so fast it is impossible. But, we can do more in documenting and making formats identifiable through the tools we use for digital preservation. The File Format Wiki and PRONOM have had no mention of Power BI, so let’s change that.
Microsoft Power BI was released in 2011 and has been part of the Microsoft Power Platform. Power BI can gather data from many sources. The software can be accessed in the Office 365 cloud, but also using a Desktop application. In the desktop application, all the data sources and connections are stored in a single file with the extension PBIX. But there are other related formats.
Just like many modern Microsoft formats it is a ZIP container with a mixture of XML and JSON. There is also a DataModel file along with Settings and Connections. A quick peek at some of the contents shows us:
So it looks like the ZIP structure follows the standard for OpenXML packages as it contains a “[Content_Types].xml” file. So using this XML alone would clash with too many other formats. From what I could find the “DataModel” file is what stores the data is more unique to this format, even though the name is pretty generic. Using a string within the file would probably help be more accurate. The “DataModel” file does have unicode double byte strings we can use. “STREAM_STORAGE_SIGNATURE” seems like a unique enough string to use, but it looks like it may not be unique to PBIX. Looks like the “DataModel” file is a Microsoft “MS-XLDM” file format and is a “Spreadsheet Data Model File Format“.
There is a variation to the DataModel file and I am not sure when the standard is used verses this variation, “This backup was created using XPress9 compression”. Not sure if it is versioning or how the file is saved, but they both seem to function correctly.
After a bit of digging it seems like the MS-XLDM format can be found within an XSLX file. I found an example with these datasets. Within an XSLX there can be a found a file “xl/model/item.data” and it has the same structure as DataModel within a PBIX.
Because this file has a different filename and is in a different path, using “DataModel” should keep identification specific to a PBIX file.
The Power BI Report has a template option. This format uses the .PBIT extension and doesn’t contain any data only a template to use with other data. The structure is roughly the same, but doesn’t contain the “DataModel” file, but “DataModelSchema”, which appears to be a JSON file.
The DataModelSchema JSON has some plain text strings which could be used for identification. Later in the file there is a string, “defaultPowerBIDataSourceVersion“.
In the Classic Macintosh world back in the day it was important to use compression tools to keep files small and also allow you to send Macintosh files through the internet. Floppy disks could only hold a small amount of data so utilizing compression was a way to use the space effectively. I have already made posts on BINHEX and DiskDoubler which where also used for similar purposes. The most popular compression software for Macintosh is Stuffit, which used .SIT and .SEA extensions. One of the other often used tools was called Compact Pro.
Compact Pro, originally know as Compactor, developed by Bill Goodman in the early 1990’s and was quite popular. It was generally faster in its ability to compress and decompress files on the Macintosh. By 1995 the last version was released and by 2002 the software was officially discontinued.
Also, Macintosh files often contain a Resource Fork to go along with the data. Archiving files within a Compact Pro archive could contain both forks along with creation, modification dates and the finder Type/Creator codes. Then an archive could be transferred through the internet or on a non Macintosh file system without loosing these key bits of information.
You can see from the image below, the compression of a PICT file retained the resource fork and finder data with an impressive 60% savings in size.
PICT File within a Compact Pro archive.
Compact Pro could also segment an archive into multiple parts. This was advantageous when needing to copy a larger file on to a set of floppy disks, or for transferring smaller files through the internet and combined later. Segments would be extracted by opening the final segment.
The other nifty feature of Compact Pro is it could create a Self-Extracting Archive. Archiving as an SEA, would compress the file into an archive, but contained within an application which could extract the archive without the use of the the full Compact Pro application. This was used mainly for use on distributed Macintosh file system disks as the application could only be run on a Mac OS system.
The file format is not recognized by PRONOM, and as you can see from the headers above, identification is not easy as there are no magic bytes. Using Unarchiver they identify as Compact Pro.
lsar CP-s01.cpt
CP-s01.cpt: Compact Pro
CP.PICT
The only bytes which seem to be consistent is the first two, but “01 01” is not a signature which is unique to Compact Pro. The Unarchiver uses a more complicated calculation of file size and the CRC for identification, from what I can tell.
The self extracting archive has the same basic structure. I have also noticed on all the archive samples I have, the byte at offset 8 is always “80”. This could be significant.
Another thing to note, when looking at a segmented archive, the first two bytes are in sequence, 0101 for the first, 0102 for the second and so on.
Is there a perfect raster image format? TIFF has been around quite some time and is generally accepted as a preferred preservation format. There have been a few attempts to have a single file contain multiple resolutions with the purpose of providing resolutions for different uses, lower-resolution for web and higher-resolution for print. Even the semi popular JPEG2000 added multiple resolutions to improve the JPEG format. Kodak came up with a few ideas to do this as well. The Kodak PCD, PhotoCD or Image PAC files was one that was used for awhile before it was abandoned. Another was FlashPix.
I briefly mentioned FlashPix on an earlier post about the Microsoft Picture It! format. They are extremely similar. Both. have the same basic structure in a Compound Object format. Some of the FlashPix files generated by Picture It! even have the same identifiers in the CompObj header.
FlashPix was supposed to be the answer to all the problems with storing bitmap image data and how we view the web. Kodak partnered with some big names, Microsoft Corporation, Hewlett-Packard Company and Live Picture, Inc, were among them. Kodak marketed the format and even included it as a native file format to some of its new digital cameras. The format was made official in June of 1996, with a Whitepaper explaining all the benefits and architecture. There was a lot of hype, some even calling it, “Not your Grandma’s format“. Many graphics software started to include support for the new format, including Adobe Photoshop. So what happened, why didn’t the format catch on? Some say it was the size of storing multiple resolutions in one file, others believe it was the complicated Compound Object structure that lead to its demise. Either way, the format had a lot of hype in the late 1990’s, but by the year 2000, it had gone silent and all the websites went away.
FlashPix did have a big impact, and there were many software and hardware devices which were made compatible. There are a few stories left behind of those who scanned all their photos to the FlashPix format only to find a few years later it was unsupported on more modern computers. There was also a few early digital camera’s which could capture directly to the format. Take my Kodak DC260 zoom camera, circa 1998. Changing the Capture Preferences, I can switch between a JPG and FPX.
Using exiftool we can take a look at one of the images from the camera:
exiftool P0004795.FPX
ExifTool Version Number : 12.73
File Name : P0004795.FPX
Directory : GitHub/digicam_corpus/Kodak/DC260/DC260_01
File Size : 251 kB
File Modification Date/Time : 2024:01:06 12:54:20-07:00
File Access Date/Time : 2024:01:06 13:20:46-07:00
File Inode Change Date/Time : 2024:01:06 13:04:34-07:00
File Permissions : -rwxrwxrwx
File Type : FPX
File Type Extension : fpx
MIME Type : image/vnd.fpx
Code Page : Unicode UTF-16, little endian
Data Object ID : 13BC5A58-6B90-1B6B-12C9-0800201177F8
Data Object Status : Exists, Not Purgeable
Creating Transform : Source Image
Using Transforms :
Cached Image Height : 1024
Cached Image Width : 1536
Comp Obj User Type Len : 16
Comp Obj User Type : FlashPix_Object
Visible Outputs : 1
Maximum Image Index : 1
Maximum Transform Index : 0
Maximum Operation Index : 0
Thumbnail Clip : (Binary data 18480 bytes, use -b option to extract)
Revision Number : 1
Create Date : 2024:01:06 12:53:29
Modify Date : 2024:01:06 12:53:29
Software : KODAK DIGITAL SCIENCE DC260
Image Width : 1536
Image Height : 1024
Subimage Width : 1536
Subimage Height : 1024
Subimage Color : RGB
Subimage Numerical Format : 8-bit, Unsigned
Decimation Method : None (Full-sized Image)
JPEG Tables : (Binary data 558 bytes, use -b option to extract)
Number Of Resolutions : 1
Max JPEG Table Index : 1
Scene Type : Original Scene
Software Release : KODAK DIGITAL SCIENCE DC260
Make : Eastman Kodak Company
Camera Model Name : KODAK DIGITAL SCIENCE DC260
Serial Number : 7577
Exposure Time : 1/180
F Number : 4.7
Exposure Program : Program AE
Exposure Compensation : 0
Subject Distance : 0.520 m
Metering Mode : Center-weighted average
Light Source : Unknown
Focal Length : 24.0 mm
Max Aperture Value : 4.6
Flash : No Flash
Exposure Index : 90
Sharpness Approximation : 0
File Source : Digital Camera
Sensing Method : One-chip color area
Extension Create Date : 2024:01:06 12:53:29
Extension Modify Date : 2024:01:06 12:53:29
Creating Application : Picoss
Extension Name : ijuhsimasa
Extension Persistence : Always Valid
Extension Description : Data Object Store 000001
Storage-Stream Pathname : /Data Object Store 000001
Extension Class ID : 56616000-C154-11CE-8553-00AA00A1F95B
Used Extension Numbers : 1
Screen Nail : (Binary data 4304 bytes, use -b option to extract)
Subimage Tile Count : 384
Subimage Tile Width : 64
Subimage Tile Height : 64
Num Channels : 3
Audio Stream : (Binary data 30780 bytes, use -b option to extract)
Aperture : 4.7
Image Size : 1536x1024
Megapixels : 1.6
Shutter Speed : 1/180
Preview Image : (Binary data 4164 bytes, use -b option to extract)
Focal Length : 24.0 mm
The file also does identify in PRONOM:
sf P0004795.FPX
---
siegfried : 1.11.0
scandate : 2024-01-17T23:13:59-07:00
signature : default.sig
created : 2023-12-17T15:54:41+01:00
identifiers :
- name : 'pronom'
details : 'DROID_SignatureFile_V116.xml; container-signature-20231127.xml'
---
filename : 'P0004795.FPX'
filesize : 250880
modified : 2024-01-06T12:54:20-07:00
errors :
matches :
- ns : 'pronom'
id : 'x-fmt/56'
format : 'Kodak FlashPix Image'
version :
mime : 'image/vnd.fpx'
class : 'Image (Raster)'
basis : 'extension match fpx; container name CompObj with byte match at 53, 36 (signature 2/2)'
warning :
If you notice, PRONOM has two signatures for the FlashPix format, this image was identified with signature #2. The first signature looks for the string “FlashPix Object”, but the second looks for the CLSID which is unique to each compound object format. FlashPix has the CLSID: {56616700-c154-11ce-8553-00aa00a1f95b}. Looking at many of the other samples I have there is much variation on the use of the string and CLSID.
The images from the Kodak Camera use “FlashPix_Object” string so with the underscore it doesn’t match the first signature, but others I made using Picture It! software used a couple variations. Many don’t use the string at all. Others use a sightly different CLSID in both uppercase and lowercase. We will have to suggest adjustments to the current signature to identify them all.
Looking at the contents of the OLE container we can see some interesting things.
Path = P0004795.FPX
Type = Compound
Physical Size = 250880
Extension = compound
Cluster Size = 512
Sector Size = 64
Size Compressed Name
------------ ------------ ------------------------
188 192 [5]Data Object 000001
272 320 [1]CompObj
388 448 [5]Extension List
144 192 [5]Global Info
Data Object Store 000001
18704 18944 [5]SummaryInformation
816 832 Data Object Store 000001/[5]Image Contents
272 320 Data Object Store 000001/[1]CompObj
988 1024 Data Object Store 000001/[5]Extension List
1624 1664 Data Object Store 000001/[5]Image Info
4332 4608 Data Object Store 000001/[5]Screen Nail_bd0100609719a180
Data Object Store 000001/Resolution 0005
Data Object Store 000001/Audio_bd0100609719a180
1112 1152 Data Object Store 000001/[5]KDC_bd0100609719a180
72 128 Data Object Store 000001/[5]SummaryInformation
108 128 Data Object Store 000001/Audio_bd0100609719a180/[5]Audio Info
30808 31232 Data Object Store 000001/Audio_bd0100609719a180/Audio Stream 000000
6208 6656 Data Object Store 000001/Resolution 0005/Subimage 0000 Header
176378 176640 Data Object Store 000001/Resolution 0005/Subimage 0000 Data
------------ ------------ ------------------------
242414 244480 16 files, 3 folders
The main CompObj is where we find the identification information, but the Data Object Store 000001 directory is where all the image data is stored. In a multiple resolution image we might see additional Resolution directories. You may also notice a mention of an Audio directory. Yes, this image was captured and then audio was recorded with it. Not a video, but an audio clip associated with the image. FlashPix can contain audio streams. This isn’t the first time we have seen this, HP camera’s also have this function which as it turns out is stored in a FlashPix exif extension within a JPEG.
The FlashPix native format may have disappeared, but the format lives on as an extension to Exif data, allowing you to embed audio and other media within a JPEG file. The code for FlashPix was given to ImageMagick and is maintained by them.