Student Writing Center

When it comes to difficult file formats, one of the more difficult groups of formats are word processing text files. Difficult for many reasons, one being the shear number of them, the other is their lack of identifiable headers. Just when you think you have seen them all another pops up to add to the mix.

In a batch of other known word processing formats I came across a few files with no extension and with the following header:

The rest of the file was binary so the only thing I had to go one was the string “TLC” and “FF”. A few searches across the interwebs didn’t reveal much, seems it wasn’t a well documented format. From the names of the files and the fact they were with other word processing formats led me to assume they were also some sort of document format. The date stamps were still intact and I could see they were from the mid 1990’s. It took a few creative searches before I wondered if the “TLC” might have something to do with “The Learning Company“. If it was, I still had quite a bit of work ahead as the software developer had produced quite a few titles over the years. You probably remember the “Reader Rabbit” series of educational games.

After a bit of time I narrowed it down to a few titles and started looking for samples of each. Software was hard to find as well. I tried opening the file in a few different software until I finally came to one called “Student Writing Center”. Which may sound familiar to some of you, but there was some variations on this name out there. Some of which are:

  • Student Writing Center
  • Student Writing & Publishing Center 
  • The Children’s Writing & Publishing Center
  • The Writing Center
  • Ultimate Writing & Creativity Center

There were probably others, considering the budget software company started in 1980 and made titles for a few computer platforms starting with the Apple II. The story behind the company is a fun read.

The Student Writing Center was a simple word processor aimed at students 10 years old and older. It was found in many schools right along side Kid Pix, another very popular graphic program for kids. The software had a few different document types to help students get started writing their book reports or journal entries.

The Student Writing Center ran on both Macintosh and Windows allowing it to be one of the more popular writing tools for the younger crowd.

Each document type had a unique interface and save menu, which on Windows would save with the extensions, .RP, .NL, .JN, .LT, and .SG. They also had a slightly different header.

Reports:        1A544C43 01464600 0000
Newsletters:    1A544C43 00464600 0300
Journals:       1A544C43 00464600 0100
Letters:        1A544C43 00464600 0400
Signs:          1A544C43 00464600 0200

The signatures submitted to PRONOM take into account endianness for Windows and Macintosh with the last two byte locations being swapped. Also every document had the values “46461A” “FF” at the end of the file.

But wait! Just when you think you had it figured out…….

This file may look similar, but they are two different formats and are not compatible with each other. The little brother to the Student Writing Center was called “Ultimate Writing & Creativity Center” and was made for younger kids, ages 6-10. It had more of a cartoon interface and a cute little fountain pen teacher to walk you through the writing process.

When you saved your file in UWCC, you could choose between formats and I guess move your documents up to the more advanced program once you turned 10! If you would like to experience or re-live the opening sequence, enjoy.

I’m not done yet………

To complicate things even more The Learning Company also released another word processor called “The Writing Center“. This gets confused with Student Writing Center frequently.

But unlike the two others, this format is very different.

We’ll have to save this format for another day.

There seems to be a never ending list of word processor formats, with no end in sight. But if you used a school computer back in the early 1990’s and still have your floppy disk from back then, hopefully now you can open that report you wrote on Abraham Lincoln.


A few years ago I had someone contact me with a desperate plea. They had a disk which contained years of journal entries and letters to loved ones she could no longer access. She had used a Macintosh in the late 1980’s and early 1990’s to create all these files, but wanted to convert them all to PDF so she could make a book. She said she had tried everything, contacted a lot of people and her son had told her it was a lost cause. In talking with others at my institution, they knew I had a background in older Macintosh formats and so she contacted me. I made no promises, but offered to try.

The files she provided were indeed early Macintosh files. One obvious trait was the lack of an extension. One might think a lack of an extension was poor planning for Apple, but they choose a different method for the operating system to know the relationship between files and applications. They did this through the use of a Type/Creator code. If you were a software developer for the Macintosh you could register a four character “Creator” code, then for all the different files you used with your software you could register a “Type” code. This told the Macintosh operating system exactly which software created the file and the type so it could be opened properly. Unlike today where an extension is defaulted to one application even if it isn’t the software which created the file.

ResEdit view of Hypercard Stack Info

In some ways this was a superior identification method as there was many software titles which could all create the same file format, but this way the correct software would open the file and render it correctly.

Looking at the files provided to me, there was a few which at first seemed like they were damaged somehow, they were extremely small compared to the other files. About half the size. When I opened them in a hex editor this is what I saw.

Usually document formats during this time would keep the text in plain ascii, but these files were different, they had binary data. In the header was the only plain text strings in the file, “WDBNMSWD”. I had seen these codes before, a Microsoft Word Document! But they weren’t….. What are they?

The head of the file has the hex values “ABCD0054”, so I started searching the internet for some help. There were others having the same problem I was having. I finally came across a tool called the “Unarchiver“. Running the command line version of the software “unar”, suddenly I had a file twice the size and could be opened by Microsoft Word!

unar Letter 
Letter: DiskDoubler
"./Letter" already exists.
Successfully extracted to "./Letter-1".

Remember back in the 1990’s when storage was expensive? Instead of dropping another $20 for a 100MB ZIP Disk, you could use Symantec’s DiskDoubler. The software would be installed on your Macintosh and then a window would come up showing you all the files on your drive. With one click you could compress a single file or a directory of files saving you tons of space. When you needed the file, just double click and the software would uncompress on the fly and then open the correct application to edit the file.

With a few clicks I was able to uncompress all the affected files and provide a PDF of all the letters and journals my new friend had tried so desperately for years to open. She was thrilled to say the least.

But why stop there? PRONOM needs to know about this format!

Once I had DiskDoubler installed I could make a few more samples, where is where I found there was a few different compression methods used by the software. They are labeled AD 1 & 2 and DD 1, 2 & 3. Making samples of each of the different types I was able to confirm the first 4 bytes of every file was the hex values “ABCD0054”. I was able to submit the format to PRONOM and it was added and given the PUID fmt/1399.

One of the other features of DiskDoubler was an ability to create a Self Extracting Archive (SEA). An sea file could contain a compressed file but also contained the code to uncompress itself. This was mostly seen with the Stuffit software, but there were many other compression tools which could write to this format. The Stuffit formats have been added to PRONOM which include identification of an SEA created by stuffit, but the SEA created by DiskDoubler is different and needs to be added.

Shockwave Audio

Ok, confession time.

There is only a couple moments in my tech history which had a profound effect on me, enough to sear the memory of the moment into my brain. When I was in college around 1997 I had a decent CD collection and I had learned how to copy those AIFF files off the disc and use them on my trusty PowerCenter Pro. These files were huge, at the time. I knew a regular size song would take up around 50MB on my hard drive. This was a lot of space back in 1997, but I could then mix them with other songs, something I did sometimes for friends I had on the dance team. I didn’t have a CD burner at the time so I would transfer them to cassette tape. I know, but remember this was the 1990’s when everything was changing and expensive.

One night I was exploring the world wide web and I happened across someone sharing a few songs. I assumed they were just clips as they were only 5MB in size, a tenth the size they should be. I downloaded the song, which of course still took a few minutes back in those days. When I played the song, I was dumbfounded, it was the whole song. I was completely confused. How could they take a 4+ minute song and compress it down to under 5MB? This was amazing.

I started grabbing every song I could find. Before long I had quite the collection. And before you judge me for downloading music from the web, this was a couple years before the advertisement we all remember reminding us that we wouldn’t steal a car so why would we steal music.

The files I found on the internet were MP3 files, the same we are familiar with today. Back then creating MP3 files wasn’t easy. MP3 was actually a licensed product so you had to get a little creative in order to make them. On my Macintosh PowerCenter Pro, there were even fewer options. I was already familiar with the sound editing application from Macromedia called SoundEdit 16, it was the tool I used to do all my editing. I found there was a plugin you could add which allowed export to a format called Shockwave Audio. This was meant for use in Macromedia’s Director application to add sound to the growing Flash animation industry. Once I got the plugin and installed I couldn’t stop making files and I made them as fast as I could. For a whole album this could take over an hour on my hardware, but it was worth it. Before long I had a large collection of popular music ready to play at a moments notice. My player of choice was MacAMP, a sibling of the popular WinAMP. I even borrowed some equipment from a friend who DJ’d on the weekends and DJ’d a college dance. I lugged my whole PowerCenter Pro tower and 17in trinitron monitor over to the school. It was so much fun and folks didn’t understand when they asked to see my CD collection.

Enough about transgressions from my youth, lets talk about the Shockwave Audio format.

To create a SWA file you would first need SoundEdit 16 Version 2. Then the plugins to enable export. This would only run on PowerPC computers running Macintosh OS or Classic in Mac OS X. For this post I pulled out my trusty PowerBook G4 Titanium running MacOS 9 and MacOS X 10.2. Installed SoundEdit 16 and the plugins in the Xtras folder and we are good to go.

Before you export you need to set what bitrate you prefer for the final file, giving you the option of 8KBits up to 160KBits per second. The higher the bitrate the longer it took and made larger files.

SoundEdit 16 had a native audio format and also frequently used the SoundDesigner II format to save the uncompressed files. On a Macintosh you had to be careful as these formats did not travel well to other systems on account of the resource forks associated with the data.

Because these SWA files were meant to be used in websites and other non-Mac systems, they did not have a resource fork, but had the Creator/Type codes, SwaT/SHCK. An extension wasn’t necessary for use on your Macintosh, but it was best to use .swa.

Here is what the data looks like for a SWA file.

Even though the SWA format uses MPEG compression, this is not a typical header you might see in a MP3. There was no ID3 tags at the time so not much in terms of metadata.

Complete name                            : tone2.swa
Format                                   : MPEG Audio
File size                                : 80.7 KiB
Duration                                 : 5 s 166 ms
Overall bit rate mode                    : Constant
Overall bit rate                         : 128 kb/s
FileExtension_Invalid                    : m1a mpa mpa1 mp1 m2a mpa2 mp2 mp3

Format                                   : MPEG Audio
Format version                           : Version 1
Format profile                           : Layer 3
Format settings                          : Joint stereo / MS Stereo
Duration                                 : 5 s 172 ms
Bit rate mode                            : Constant
Bit rate                                 : 128 kb/s
Channel(s)                               : 2 channels
Sampling rate                            : 44.1 kHz
Frame rate                               : 38.281 FPS (1152 SPF)
Compression mode                         : Lossy
Stream size                              : 80.7 KiB (100%)
ffprobe -i tone2.swa 
[mp3 @ 0x155704a60] Format mp3 detected only with low score of 25, misdetection possible!
[mp3 @ 0x155704a60] Skipping 324 bytes of junk at 0.
[mp3 @ 0x155704a60] Estimating duration from bitrate, this may be inaccurate
Input #0, mp3, from 'tone2.swa':
Duration: 00:00:05.15, start: 0.000000, bitrate: 128 kb/s
Stream #0:0: Audio: mp3, 44100 Hz, stereo, fltp, 128 kb/s

There are a few consistencies among all my files. They all begin with the hex values “00000140000000030000” for the first 10 bytes and all of them seem to have the string “MACRZ” at offset 36. I haven’t been able to find a open specification for this file format, so we will have to go with what we can find in the samples. According to ffprobe from above, there is 324 bytes of a header before the first MP3 frame starts.

MPEG signatures are difficult, there are no headers, just a sequence of frames. This is why there are often so many identification conflicts with the MP3 format. These SWA files indeed identify as MP3 files, but with a mismatch extension.

filename : 'tone2.swa'
filesize : 82661
modified : 1970-01-01T00:00:00-07:00
errors   : 
matches  :
  - ns      : 'pronom'
    id      : 'fmt/134'
    format  : 'MPEG 1/2 Audio Layer 3'
    version : 
    mime    : 'audio/mpeg'
    class   : 'Audio'
    basis   : 'byte match at 0, 4088 (signature 5/9)'
    warning : 'extension mismatch'

If we wanted to distinguish an SWA from an MP3 we would need to create a new signature and give it priority over the MP3 signature. There is enough of a header this would be possible and easier, but since they are, in reality, just MP3 files does it matter? Trying to play a SWA on a modern computer is only possible if you change the extension to MP3.

If you want to take a look at some samples you can grab a couple I made on my GitHub page or check out some commercially made files for an awesome Star Trek Starship Creator game.


One of the first PRONOM signatures I submitted was for a format I felt responsible for, considering where I worked. This is the GEDCOM format, which is an acronym for GEnealogical Data COMmunication. At the time I submitted the signature the format hadn’t been updated in years.

Very recently it has seen a renewed interest from those in the Genealogical community. In 2021 the format was renewed with a Version 7 specification with the purpose of simplifying and clarifying the format. In addition a new format was released to handle storing multimedia files in a container called GED-ZIP.

My first attempt at a signature was based on the specification generally, but with the new version released, I thought it might be good to revisit this format and see if we need to make any adjustments. There needs to be a new signature for the GED-ZIP format as well.

The original signature, fmt/851, created for PRONOM is:


It has an offset of 0-3 to account for any Unicode BOM, but starts with “0 HEAD”; this is the required start to a GEDCOM file. The next bits can be a source of the software which created the GEDCOM, using the tag “SOUR” which can also include a version of the software and name and address of the developer. This can take a bit of space so we include 0-1024 bytes for this information. The next tag is the subrecord of HEAD, “GEDC”, then the next subrecord, “VERS”. Most GEDCOM validations will look for HEAD.GEDC.VERS for the version of GEDCOM the file claims to conform with. The hex values, (0D0A|0D|0A), is the hard return accounting for the different systems that could write the GEDCOM.

A minimal GEDCOM version 5.5 would contain the following.

2 VERS 5.5

The end of the file is marked by the tag “TRLR” in reference to a Trailer. I didn’t include this in my initial signature, but probably should have.

GEDCOM files have been around a long time, the first draft was released in 1984, but the GEDCOM structure we see now really didn’t come along until version 3 in 1987, when the format was standardized and made public. The HEAD.GEDC.VERS wasn’t standardized until version 4. You can see the history here.

So moving forward we should probably have a new PUID for Version 3, Version 4, Version 5 and the new Version 7 and leave the existing signature as is.

Version 3 only requires the tags HEAD, SOUR, DEST and the ending TRLR.

BOF 302048454144(0D0A|0D|0A)3120534F5552{0-128}312044455354
EOF 302054524C52

Version 4 requires the HEAD.GEDC.VERS sequence.

BOF 302048454144{0-1024}47454443(0D0A|0D|0A)3220564552532034
EOF 302054524C52

Version 5 is similar.

BOF 302048454144{0-1024}47454443(0D0A|0D|0A)3220564552532035
EOF 302054524C52

Version 7 is also similar.

BOF 302048454144{0-1024}47454443(0D0A|0D|0A)3220564552532037
EOF 302054524C52

For the new GED-ZIP format we need to create a container signature as the format is a ZIP file but with a GEDCOM inside. The GED-ZIP specifications states:

A GEDCOM ZIP file should:
• include exactly one GEDCOM file with the name “gedcom.ged”
• include all the multimedia objects references by that GEDCOM file
• not include unreferenced multimedia objects

Our Container signature would look like this:

<ContainerSignature Id="1000" ContainerType="ZIP">
	 <InternalSignature ID="300">
	  <ByteSequence Reference="BOFoffset">
	    <SubSequence Position="1" SubSeqMinOffset="0" SubSeqMaxOffset="3">
	      <Sequence>30 20 48 45 41 44</Sequence>

I recently learned of a variation on the GEDCOM format which can cause a lot of confusion. The software Family Tree Maker could export to the GEDCOM format, but had a checkbox which, unchecked, allowed you to not abbreviate the tags. The tags in the GEDCOM format are expected just the way they are, which makes me wonder why they would do something so confusing. You can read more about this format here.

I was recently made aware a few of these rouge “GEDCOM” files were out there, in the wild, causing confusion during identification. My first thought was to adjust the signature to make it a little more loose to fit these variations, but then discovered they are not GEDCOM files. In fact later versions of FTM forgot they did this and would error when you tried to import them back into the software. I think it would be wise to identify these FTM GEDCOM variants, just so one is aware of the difference and can then decide how to handle them properly.

The format was named “FTW TEXT”, so we can use that to call the new signature. Instead of “0 HEAD”, “0 HEADER” is used, instead of “0 SOUR”, “0 SOURCE” is used, and instead of “0 TRLR” at the end, “0 TRAILER” is used.

BOF 3020484541444552(0D0A|0D|0A)3120534F55524345
EOF 3020545241494C4552

It was fun to look back at this format and try and improve on it a bit. I learned more than I did when I initially wrote the signature and hopefully documented it well enough. The FTM variant was an interesting twist I was not expecting, which I am sure will show up again in the future. Take a look at the signatures and samples I updated and let me know what you think.

RIS Citation

Up until recently I was working in a Corporate archive preserving all sort of content. The corporation throughout the years used many different software packages to produce all sorts of data. When I moved to an academic library I saw much of the same content, but there was a some new file formats which I needed to document and manage. Many of those come from scholarly journals , theses, dissertation, and data sets for projects.

One format which I came across often but seems to be missing from the standard file format known lists was the RefMan citation format. This format is a simple text based format which serves to standardize citations from scholarly sources. Created by Research Information Systems, the format uses the RIS extension used by Procite and Reference Manager (RefMan). ISI ResearchSoft managed the format for a bit in the 1990’s, this is where you can find most of the specifications.

Now that I am a little more familiar with the format I see it everywhere! Find any scholarly journal and there will usually be a “cite” feature to download the citation in a few formats, RIS being one of the most common.

Example: Theory and Craft of Digital Preservation

It can be called by a few names, mostly based on the systems which support it. You might see Ris (Zotero), or EasyBib, Mendeley, ProCite, Reference Manager, and others. But they all follow the same format.

The format is simple plain text format, there are codes which indicate the different field types and tags. The basic structure would look like this:

AU  - Owens, Trevor 
LA  - eng
PB  - Johns Hopkins University Press Baltimore, Maryland
CY  - Baltimore, Maryland
SN  - 9781421426976; 1421426978
PY  - 2018
TI  - The theory and craft of digital preservation
LK  -
ER  - 

The first tag always needed to be “TY” and the last tag “ER”. TY stands for Type of reference and ER stands End of reference.

There is actually two versions of the format, this original specification and a later one which added some header information. You can download the full documentation here.

Provider: The name of the information provider (required)
Database: The name of the database (optional)

Tagformat: Name of the tag format used identify fields (optional)
Content: media type for the body of the file (required)

Creation of a PRONOM signature for this text format is pretty straight forward. Looking for the TY and ER string should be enough to ensure the format doesn’t clash with other text based formats. Text formats are notoriously difficult to identify, but when they have expected patterns it makes it a little easier. I had to add a little buffer at the beginning of the signature to allow for the newer header information, but more samples will be needed to see if this is enough to identify the format in all situations. Take a look and see if it works for you!

Beef & Babe’s

The 1990’s was a an exciting time for Desktop Publishing. I got my first taste of design in the early 90’s with Aldus PageMaker. QuarkXPress was king in commercial publishing world. For the most part designers and commercial printers used Macintosh computers which QuarkXpress catered to. For those who could not afford the high prices, or used a PC, there was a few options. Microsoft Publisher, TimeWorks Publish It!, and Express Publisher were a few. There was many debates during that time on which software was the best.

I have submitted signatures to PRONOM for many of these:

Express Publisher was proving elusive for finding software and sample files. Express Publisher was developed by Power Up who had been developing a DOS version since the late 1980’s. At one point Power Up decided to sue QuarkXPress for the use of the name XPress. In 1991 Power Up sold all their assets to Spinnaker around the time they released the first Windows version of Express Publisher.

When I first took a look at some samples from the Windows version 1.0 of Express Publisher, the magic header looked familiar.

If it looks familiar to you it is similar to the famous, well nerd famous, JAVA Class file format.

The story goes that James Gosling needed a magic number for his new class format and was in a place they called Cafe Dead when he realized CAFE was a hex value, he soon used CAFEBABE and CAFED00D for his new formats. JAVA was released by Gosling in 1995 for SUN Microsystems.

File Format magic numbers are often used when designing a file to be used with software. Often times it is meant to be a sequence of hex values or a string indicating the file supported by certain software, this is more accurate than the simple extension at the end of most files. They are not required to be there, in fact there are a few formats which are difficult to identify as they don’t use this type of magic number in the header. To learn more read Ross Spencer’s post on magic numbers for digital preservation.

At first I thought it was some sort of homage to the JAVA Class format until I realized the Express Publisher file format was released 3 years earlier. Just a coincidence? I am sure whomever developed this format probably has an interesting story behind it.

These formats are not in PRONOM so lets take a look at what is needed.

The document format for Express Publisher version 1 for Windows 3 uses the EWD extension, as well as EWT for templates. Magic numbers work best for a signature when they are at least 4 bytes long, this gives enough to have little chance of conflicting with another file format. So our PRONOM signature byte sequence would look like this:

  <ByteSequence Reference="BOFoffset">
    <SubSequence MinFragLength="0" Position="1" SubSeqMaxOffset="0" SubSeqMinOffset="0">
      <Shift Byte="CA">4</Shift>
      <Shift Byte="FE">3</Shift>
      <Shift Byte="BE">2</Shift>
      <Shift Byte="EF">1</Shift>

In looking at the earlier DOS versions of Express Publisher they used the extension EPD for a document and EPT for templates. I only have a few samples of version 2 and version 3, but they have different headers.

Version 2 & 3 has consistent bytes starting at offset 4, version 2 using the string PAGES, and version 3 the string EP300. I will have to dig a little more to see if I can find some samples of version 1 to see how they compare and then should be able to submit a PRONOM signature for them.

For the time being, adding “CAFEBEEF” to PRONOM will be a good addition. I wonder if there are any other “CAFE” formats out there, if you know of any, let me know!

UPDATE – There is another format, AnFX Movie, which uses the magic header “CAFEBEEF”. More research is needed to distinguish the two formats.

Image PAC Files

I wouldn’t be surprised if you have never heard of an Image PAC file. You may know it by the more common name Kodak Photo CD Image. Kodak’s PhotoCD format actually refers to the system and Disc format used to store images for compatibility with other hardware. The Kodak PhotoCD format was pretty advanced for its time, it original purpose was to store scanned 35mm film to a disc which was playable on computers and other hardware. In fact, because it was meant to store 35mm rolls as they were scanned it was the first use of the linked Multi-session CD format made standard by the orange book specification. The format was widely adopted at first, but eventually lost favor and was abandoned by 2004.

The Kodak PhotoCD format was also used on many commercial CD-ROM products. One example was the Corel Professional CD series. Below is a photo of a case of 200 CD’s I recently acquired. Each has around a hundred PCD images and viewing software on disc. Most discs can be viewed here. Or you can view their “Sampler” CD-ROM.

The actual PCD image file format was referred to as an Image PAC File. The format was unique in the fact it has multiple resolutions built into a single file. It also stored the raster data in a format called Photo YCC color encoding metric, developed by Kodak. This requires conversion to RGB for many uses. Adobe Photoshop for many years had an import filter for the format built in which included ICC profiles for properly converting the source to a destination colorspace, but support was dropped in CS3 of their products.

Photoshop Kodak PCD import

The Image PAC PCD format was a proprietary format which Kodak protected aggressively, even to the point of threatening legal action to those who attempted to reverse engineer the format. This frustrated developers and was probably part of the reason the format was abandoned. Of course this didn’t deter some curious developers and was partially reversed engineered and is available in the NetPBM library formally knows as PBMPlus. The tool hpcdtoppm was developed to convert PCD to PBM.

The trick in preserving older obsolete formats is to find a way to first identify them, gather significant properties, then migrate to a modern format if appropriate with minimal loss of data. Luckily most PCD files have the ascii string “PCD_IPI” starting around offset 2048. This is basically how the PRONOM registry identifies the format and has assigned it fmt/211. Exiftool also supports the format in identifying some of the significant properties.

ExifTool Version Number         : 12.62
File Name                       : 136009.PCD
Directory                       : /Users/thorsted/Desktop/blog/Kodak/PCD
File Size                       : 3.6 MB
File Modification Date/Time     : 2023:06:23 10:48:55-06:00
File Access Date/Time           : 2023:06:26 23:43:50-06:00
File Inode Change Date/Time     : 2023:06:27 11:18:38-06:00
File Permissions                : -rwx------
File Type                       : PCD
File Type Extension             : pcd
MIME Type                       : image/x-photo-cd
Specification Version           : 0.6
Authoring Software Release      : 3.0
Image Magnification Descriptor  : 1.0
Create Date                     : 1993:09:20 07:35:34-06:00
Image Medium                    : Color reversal
Product Type                    : 116/01 SPD 0064  #00
Scanner Vendor ID               : KODAK
Scanner Product ID              : FilmScanner 2000
Scanner Firmware Version        : 2.21
Scanner Firmware Date           : 
Scanner Serial Number           : 0296
Scanner Pixel Size              : 0b.30 micrometers
Image Workstation Make          : Eastman Kodak
Character Set                   : 95 characters ISO 646
Photo Finisher Name             : HADWEN GRAPHICS
Scene Balance Algorithm Revision: 3.1
Scene Balance Algorithm Command : Neutral SBA On, Color SBA On
Scene Balance Algorithm Film ID : Unknown (131)
Copyright Status                : Restrictions apply
Copyright File Name             : RIGHTS.USE
Orientation                     : Horizontal (normal)
Image Width                     : 3072
Image Height                    : 2048
Compression Class               : Class 1 - 35mm film; Pictoral hard copy
Image Size                      : 3072x2048
Megapixels                      : 6.3

Exiftool is able to gather much of the important properties including an original creation date and the pixel dimensions. It would be nice if was able to mention each of the resolution options as some later Pro versions of PCD had a 64 base for resolutions of 4096 x 6144.

Migration to a more modern open format is a common preservation strategy. The National Archives and Records Administration has the format NF00224 listed as needing to migrate to JPG, while others prefer migration to TIFF. Others have learned valuable lessons attempting to find the right method for migration. There is a right way and a wrong way as the Center for Digital Archaeology learned. The easiest method is to use the popular ImageMagick command-line tool.

thorsted$ identify 136009.PCD 
136009.PCD PCD 768x512 768x512+0+0 8-bit YCC 3.44727MiB 0.020u 0:00.006
thorsted$ convert 136009.PCD[5] -colorspace sRGB +compress 136009.tif
thorsted$ identify 136009.tif
136009.tif TIFF 3072x2048 3072x2048+0+0 8-bit sRGB 18.0004MiB 0.000u 0:00.000

ImageMagick along with most other tools like IrfranView and XnView only see the base resolution of 768 x 512, but with an extra little addition to the command by adding “[5]” after the filename if forces the conversion to use the “Fifth” 16 Base resolution which is the highest resolution on most PCD files, the Pro versions may have higher. The other issue is the colorspace conversion. It is known there could be a loss of highlights. This webpage illustrates different tools and the issues with highlights. You can see the difference if I use -colorspace RGB instead of sRGB.

ImageMagick conversion using RGB vs sRGB colorspace setting.

Other tools such as the open source pcdtojpeg and paid pcdMagic both work well, but the only tool I have tested so far which keeps the original metadata is pcdMagic.

ExifTool Version Number         : 12.62
File Name                       : 136009_1.tif
Directory                       : .
File Size                       : 38 MB
File Modification Date/Time     : 2023:06:27 12:06:26-06:00
File Access Date/Time           : 2023:06:27 12:06:29-06:00
File Inode Change Date/Time     : 2023:06:27 12:06:27-06:00
File Permissions                : -rw-r--r--
File Type                       : TIFF
File Type Extension             : tif
MIME Type                       : image/tiff
Exif Byte Order                 : Little-endian (Intel, II)
Subfile Type                    : Full-resolution image
Image Width                     : 3072
Image Height                    : 2048
Bits Per Sample                 : 16 16 16
Compression                     : Uncompressed
Photometric Interpretation      : RGB
Image Description               : color reversal: Unknown film. SBA settings neutral SBA on, color SBA on
Make                            : KODAK
Camera Model Name               : FilmScanner 2000
Strip Offsets                   : 1622
Samples Per Pixel               : 3
Rows Per Strip                  : 2048
Strip Byte Counts               : 37748736
Planar Configuration            : Chunky
Software                        : pcdMagic V1.4.19
Modify Date                     : 2023:06:27 12:06:26
Copyright                       : Copyright restrictions apply - see copyright file on original CD-ROM for details
Exif Version                    : 0231
Date/Time Original              : 1993:09:20 07:35:34
Create Date                     : 1993:09:20 07:35:34
Offset Time                     : -06:00
User Comment                    : color reversal: Unknown film. SBA settings neutral SBA on, color SBA on
Color Space                     : Uncalibrated
File Source                     : Film Scanner
Profile CMM Type                : Unknown (KCMS)
Profile Version                 : 2.1.0
Profile Class                   : Display Device Profile
Color Space Data                : RGB
Profile Connection Space        : XYZ
Profile Date Time               : 1998:12:01 18:58:21
Profile File Signature          : acsp
Primary Platform                : Microsoft Corporation
CMM Flags                       : Not Embedded, Independent
Device Manufacturer             : Kodak
Device Model                    : ROMM
Device Attributes               : Reflective, Glossy, Positive, Color
Rendering Intent                : Perceptual
Connection Space Illuminant     : 0.9642 1 0.82487
Profile Creator                 : Kodak
Profile ID                      : 0
Profile Copyright               : Copyright (c) Eastman Kodak Company, 1999, all rights reserved.
Profile Description             : ProPhoto RGB
Media White Point               : 0.9642 1 0.82489
Red Tone Reproduction Curve     : (Binary data 14 bytes, use -b option to extract)
Green Tone Reproduction Curve   : (Binary data 14 bytes, use -b option to extract)
Blue Tone Reproduction Curve    : (Binary data 14 bytes, use -b option to extract)
Red Matrix Column               : 0.79767 0.28804 0
Green Matrix Column             : 0.13519 0.71188 0
Blue Matrix Column              : 0.03134 9e-05 0.82491
Device Mfg Desc                 : KODAK
Device Model Desc               : Reference Output Medium Metric(ROMM)
Make And Model                  : (Binary data 40 bytes, use -b option to extract)
Image Size                      : 3072x2048
Megapixels                      : 6.3
Modify Date                     : 2023:06:27 12:06:26-06:00

There is a way to convert the PCD to TIF using ImageMagick, then using Exiftool to map some of the metadata over to the new TIFF file. It would look something like this:

exiftool -addtagsfromfile 136009.PCD '-EXIF:DateTimeOriginal<PhotoCD:CreateDate' '-EXIF:CreateDate<PhotoCD:CreateDate' '-ExifIFD:SerialNumber<PhotoCD:ScannerSerialNumber' '-ExifIFD:ExifImageWidth<PhotoCD:ImageWidth' '-ExifIFD:ExifImageHeight<PhotoCD:ImageHeight' '-IFD0:Make<PhotoCD:ScannerVendorID' '-IFD0:Model<PhotoCD:ScannerProductID' '-IFD0:Orientation<PhotoCD:Orientation' '-IFD0:Copyright<PhotoCD:CopyrightStatus' 136009.tif

JPG Structure

If you hadn’t been over to see the posters made by Ange Albertini, head over now. Below is his poster on the JPG image file format. This is the basic JFIF file format, which stands for JPEG File Interchange Format. There are also raw JPEG streams and Exif, Exchangeable Image File Format.

The basic format is pretty straight forward. There is a start of image marker FFD8 some format information, then the raster compressed data, then an end of image marker FFD9. Identification of a JPEG file should be pretty straight forward. Knowing the start and end marker values and then the type of JPEG based on the Application data, can be very specific. That is until some software engineers start playing fast and loose with the format specifications.

A while back I received a JPG file which didn’t identify using the latest PRONOM signature. It’s happened before, some new phones came out and started using a newer version of the exif specification so I submitted an update to PRONOM for JPG’s using exif 2.3 and greater. But also may need to submit another signature soon for the newly released Exif 3.0 specification! But this JPG I received wasn’t a new version, it should have been identified with the current PRONOM signature. It started with FFD8 and when I went to look at the end of the file for the end of image marker FFD9, it wasn’t where I expected it to be.

This JPG file had an additional 9632 bytes after the FFD9 end of image marker. But why? The image rendered just fine in multiple JPG viewers. The only warning from Exiftool was for “Unrecognized MakerNotes”, which is not too uncommon. So I went to the JPG Exif specification.

EOI, Recording this marker is mandatory. It shall be recorded in this position.

But reading a little further we see…..

Moreover, Exif/DCF readers should be implemented to operate without interruption even if certain kinds of data have been recorded after EOI of the primary image defined in the Exif standard. Specifically, unknown data after EOI of the primary image should be skipped. (see section 4.7.1)

So the extra data is allowed by specification. Any readers should ignore or skip any data after the EOI (End of Image). Well that makes identification more difficult. All the PRONOM signatures are based on having the EOI marker at the “End”. Some have allowance for padding, but not enough for the worst offenders……

The image referenced above was created on a Huawei MHA-L29 cameraphone. But since finding this image, I have also found many Samsung phones do the same thing. Here is one from a Samsung SM-G975U1. Much less padding but enough to throw off identification.

Apple iPhones are also not exempt from this “feature” either. When using the MacOS ImageCapture tool with the HEIC format, a bug can add an excessive amount of empty data at the end of the converted JPG file.

So, when it comes to identification, if your JPG files don’t seem to identify correctly, look closer at the end of the file, it may have some “extra” data.

What’s the 411?

I am dating myself by using the phrase “What’s the 411?” Back in my day (before the Googles), if you wanted quick information you could pick up the “land line”, a corded phone in your home which could only make phone calls, and dial 4-1-1 and you would be connected to an operator that could help you locate businesses, tell you the time, answer simple questions, and was infinity smarter than Alexa.

Around the same time I was using 4-1-1 to answer all my questions, digital camera’s were just coming on the scene. One of those was the Sony Mavica line of digital camera’s. They were unique as they used a floppy disk as the storage media. They had a small LED screen for capture and playback of the captured images. In order to quickly preview the images captured on disk, the camera generates a hidden thumbnail file for each image, this file has the extension .411. When I first saw this file when I copied a floppy from my Mavica cameras, it reminded me of the old information line. I first assumed it was a metadata file as the first few Mavica camera did not use EXIF in their files, but they are simply a raster image in a 64×48 pixel file. Of course Sony did not document this file format and probably hoped no one would noticed as they are hidden on the floppy FAT12 formatted disk.

Video showing index of floppy disk.

One could argue the value of documenting and possibly identifying thumbnail formats as many in digital preservation have chosen not to keep the Thumbs.db file or other hidden files not meant to be preserved or accessible to the user. I have found documenting any format found through technical appraisals provides value to everyone, which may ultimately determine not to keep such formats in their repository, but knowing what they are is vital to the process. Come listen and chat with me about this topic at iPres 2023!

Usually the first part of documenting a format is looking for specifications online or documented somewhere. Since Sony did not publicly release any specifications for this format, we have to use others reverse engineering or do so ourselves. There have been a few attempts to document a conversion of the 411 format to a common raster format like BMP. Like this C code for conversion to BMP, or to NetPBM formats like PPM, or the Java “Javica” software which makes use of the 411 files. My first step was to see if we could find some common patterns in the many samples I have from my Mavica collection. Running Marco Pontello’s TrIDScan, across my 54 samples came up with no common patterns, this was expected as all the reverse engineering efforts points out the format is probably based on the CCIR.601 specification which is MPEG based on frames.

With no common patterns among all the samples, creating a PRONOM signature is not possible. In the future, file identification may be based more on dynamic pattern matching instead of the current static patterns we look for now. Until then, this may need to be submitted as an extension only entry. Two things to note, the files created by the camera are all named starting with “MVC” which could also be used for identification. You may also notice that every .411 file is exactly 4608 bytes. The extension .411 is also pretty unique, so I doubt it will clash with any other format for the moment.

Corel ArtShow

File extensions are the easiest way to quickly identify a file format, but they can be misleading. This is the reason in Digital Preservation format identification tools like DROID are important to look closer at the file structure to more accurately identify formats. The other complication is some extensions are used for more than one format. Extensions like .DOC or .ISO can be used with many formats.

The PRONOM registry which DROID uses will list extensions associated with each format signature, but for some, they only have an extension and no signature. It’s nice to have an official ID to go with a format but with no signature it only matches based on extension.

This caused a problem awhile back for me while working with some files with the extension CDX. Which according to PRONOM, there are 5 completely different formats which use the extension, and probably others.

My CDX was related to some indexing software called Cindex. At the time the only format with a signature was for the WARC summary file CDX. The other was for a CorelDraw Compressed format with no signature. Confusing right? When I would run format identification on my Cindex files, they would default to the CorelDraw Compressed format, identified by extension. It was easy enough to create a signature for the Cindex format as I had enough samples to know the patterns needed for correct identification. But I was curious about the CorelDraw format. Should be easy to find, right?

Wrong. Finding a sample of this format was very elusive. All I had to go by was the name given to the format by PRONOM and the extension. I scoured every Corel CD and image I could get my hands on. For months I looked and could never find a single CDX file. Each CorelDraw software I was able to run did not have any ability to save in the CDX format. I scoured clipart discs, other Corel software, like Designer, PrintHouse, Photo-Paint, nada, nothing. I started to wonder if the format even existed. That’s when I noticed in the filters included with CorelDraw a reference to the ability to import a CDX but not write to one.

Description=CorelDRAW Compressed (CDX)
FilterFullName=CorelDRAW Import Filter
Version=Version 6.00
Company=Corel Corporation
Copyright=Copyright © 1988-1995 Corel Corporation

This led to me finding a reference on the old Corel FTP site for knowledge base number 4550.

It mentioned something called ArtShow, where version 5 supported the file format CDX. ArtShow was a gallery of winning designs released on a CD-ROM and book each year. The first one being ArtShow 91, then ArtShow 3, 4, 5, 6, and finally 7 was the last. Each one released used a different proprietary compressed format for storing all the designs, these formats exist nowhere else. The question remains, why didn’t they use other popular Corel formats like CDR, CMX, or CCX which were used on many other clip art titles.

It took some time but I was finally able to find copies of a few of the Artshow CD-ROM discs, especially numbers 5 & 6. Which had the CDX format and the second generation CPX formats.

Each format had a easy to recognize header making a PRONOM signature easy to create. PRONOM already had the PUID for the two formats CDX & CPX, so sending in the signature added to the registry and hopefully will help distinguish between all the CDX formats!