RIS Citation

Up until recently I was working in a Corporate archive preserving all sort of content. The corporation throughout the years used many different software packages to produce all sorts of data. When I moved to an academic library I saw much of the same content, but there was a some new file formats which I needed to document and manage. Many of those come from scholarly journals , theses, dissertation, and data sets for projects.

One format which I came across often but seems to be missing from the standard file format known lists was the RefMan citation format. This format is a simple text based format which serves to standardize citations from scholarly sources. Created by Research Information Systems, the format uses the RIS extension used by Procite and Reference Manager (RefMan). ISI ResearchSoft managed the format for a bit in the 1990’s, this is where you can find most of the specifications.

Now that I am a little more familiar with the format I see it everywhere! Find any scholarly journal and there will usually be a “cite” feature to download the citation in a few formats, RIS being one of the most common.

Example: Theory and Craft of Digital Preservation

It can be called by a few names, mostly based on the systems which support it. You might see Ris (Zotero), or EasyBib, Mendeley, ProCite, Reference Manager, and others. But they all follow the same format.

The format is simple plain text format, there are codes which indicate the different field types and tags. The basic structure would look like this:

TY  - BOOK
AU  - Owens, Trevor 
LA  - eng
PB  - Johns Hopkins University Press Baltimore, Maryland
CY  - Baltimore, Maryland
SN  - 9781421426976; 1421426978
PY  - 2018
TI  - The theory and craft of digital preservation
LK  - https://worldcat.org/title/1030899528
ER  - 

The first tag always needed to be “TY” and the last tag “ER”. TY stands for Type of reference and ER stands End of reference.

There is actually two versions of the format, this original specification and a later one which added some header information. You can download the full documentation here.

Provider: The name of the information provider (required)
Database: The name of the database (optional)

Tagformat: Name of the tag format used identify fields (optional)
Content: media type for the body of the file (required)

Creation of a PRONOM signature for this text format is pretty straight forward. Looking for the TY and ER string should be enough to ensure the format doesn’t clash with other text based formats. Text formats are notoriously difficult to identify, but when they have expected patterns it makes it a little easier. I had to add a little buffer at the beginning of the signature to allow for the newer header information, but more samples will be needed to see if this is enough to identify the format in all situations. Take a look and see if it works for you!

MP4 & 360

Recently I have been exploring the MP4 format, more specifically the ISO Base Media File Format. It appears to be quite the versatile format. Based on the general Box/Atom format. Don’t mean to go much into the format here as there are so many formats which use this structure, like Quicktime MOV, Jpeg2000, to the more recent Canon RAW CR3. I have also been digging into the DASH MP4 format, but we’ll save that for a later time.

One of the more interesting uses of MP4 lately is 360 or spherical video. They are becoming more and more popular with content creators and also used for mapping like Google street view.

A while back I picked up a Insta360 Nano S camera. It attached directly to my iPhone. With a camera on each side it could capture images and video which could later be processed to produce some interesting results.

Of course it needs to be processed first so it doesn’t look like you are peering out of your peephole. Insta360 provides software for you to process the video into a regular video or some fun creative spherical video that makes you look like you are walking on a small globe.

The formats produced by the Insta360 Nano S are plain old JPG and MP4, but uses the extensions .INSP and .INSV respectively. Neither of which are documented in PRONOM yet. But because of the nature of 360 camera’s there is a little more under the hood. If you would like to look at some samples you can find some here.

The INSP file begins like any other EXIF JPEG file, but ends with a little additional info.

The 360 cameras have some additional information from the different gyros and accelerometers, as well as GPS information. The INSP file stores much of this information after the end of the JPG format. You can also see a string of alphanumeric numbers at the end, which is consistent with most of the files I have seen. One python parser of the additional data calls it the magic number. “8db42d694ccc418790edff439fe026bf” would make a good pattern for a signature.

The INSV files are similar, except they use the MP4 base media format.

Mediainfo indeed sees the file as an MPEG-4 with a AVC codec, but with a invalid extension.

Complete name                            : VID_20210222_170428_005.insv
Format                                   : MPEG-4
Format profile                           : JVT
Codec ID                                 : avc1 (avc1/isom)
File size                                : 41.1 MiB
Duration                                 : 7 s 608 ms
Overall bit rate mode                    : Variable
Overall bit rate                         : 45.4 Mb/s
Encoded date                             : UTC 2021-02-22 17:04:18
Tagged date                              : UTC 2021-02-22 17:04:18
IsTruncated                              : Yes
FileExtension_Invalid                    : braw mov mp4 m4v m4a m4b m4p m4r 3ga 3gpa 3gpp 3gp 3gpp2 3g2 k3g jpm jpx mqv ismv isma ismt f4a f4b f4v

In addition to a video and audio track, there is a text track.

Text
ID                                       : 3
Format                                   : Timed Text
Codec ID                                 : text
Duration                                 : 7 s 600 ms
Bit rate mode                            : Constant
Bit rate                                 : 240 b/s
Frame rate                               : 10.000 FPS
Stream size                              : 228 Bytes (0%)
Title                                    : Ambarella EXT
Language                                 : English
Forced                                   : No
Encoded date                             : UTC 2021-02-22 17:04:18
Tagged date                              : UTC 2021-02-22 17:04:18

With a little Exiftool magic, thank you Phil, we can see some of the extra data within the video file.

Serial Number                   : ISS2418ND7XH4H
Model                           : Insta360 Nano S
Firmware                        : v1.17.12.3_build1
Parameters                      : 2 947.866 946.388 964.646 0.000 0.000 90.000 942.993 2891.656 952.520 -0.682 -1.501 89.186 3840 1920 1040
Preview Image                   : (Binary data 578944 bytes, use -b option to extract)
Time Code                       : 62.155
Accelerometer                   : 0.0717358812689781 0.837667405605316 -0.541449248790741
Angular Velocity                : -0.00380666344426572 -0.0143540045246482 0.0170918852090836

Thanks to tools like Exiftool and MediaInfo we can take a peek into some of these formats. New ways of using the existing formats and new formats entirely keep popping up making it hard to know exactly what you have. Initially I just assumed the Insta360 formats didn’t need anything extra as they just used well known format with their own extension, but I needed to look a little closer. Many other cameras are now putting additional data at the end of a standard JPG. It will be interesting to see what new ideas camera developers come up in the coming years.

GoPro has a 360 camera as well and looking at a sample .360 file, I can see it also uses an MP4 base media format, but uses two video tracks to store video from the two cameras. Might need to dig into that format soon as well.

Hemera Photo-Object

Many years ago I dabbled in a little Graphic Design. Working for a commercial printer in the Pre-Press area, I was very familiar with all things graphics, but never had a great talent for design, especially drawing. I often needed the random clip art for a design I was working on, so I purchased the Hemera, The Big Box of Art, probably from my local CompUSA if that dates me.

Hemera Big Box of Art

The cool thing about clip art from Hemera is it was not your usual JPG or TIFF format, it was in a special Photo-Object format. This format included the raster image, but also included a mask or alpha channel for the main object. They marketed this format as an alternative to the sometimes larger formats of the day. GIF files didn’t have the color depth and PNG was new enough, Hemera was probably hoping this format would be the next greatest thing to happen to clip art.

A Hemera Photo-Object has the extension HPI. Lets take a closer look at a file and see what is under the hood. I pulled this file from Disc 1 on Archive.org

The HPI file has a unique header which should make identification really easy. But what do we see starting at offset 32? A JFIF! Just after a 32 byte header the file has a standard JPG file hidden inside. Now a standard JPG file does not have the ability to support an alpha channel so there must be something else they have within to mask this file. Lets look for the EOF file marker for the JPG format.

Well, well, well. It appears the JPG file is then followed by a standard PNG! Sneaky. The entire HPI file is a 32 byte HPI header, a JPG, followed by a PNG. One could easily carve out each of the formats and save as separate files if needed. There is a script you can use to do this for you, written by Ed Halley. The original Hemera software won’t run on modern systems.

Hemera had a good run for about 10 years before selling off their assets in 2004 to another stock image company. At one point Hemera even purchased the rights to all of Corel’s Premium photo library which I covered in my article about the Kodak PhotoCD format.

Image PAC Files

I wouldn’t be surprised if you have never heard of an Image PAC file. You may know it by the more common name Kodak Photo CD Image. Kodak’s PhotoCD format actually refers to the system and Disc format used to store images for compatibility with other hardware. The Kodak PhotoCD format was pretty advanced for its time, it original purpose was to store scanned 35mm film to a disc which was playable on computers and other hardware. In fact, because it was meant to store 35mm rolls as they were scanned it was the first use of the linked Multi-session CD format made standard by the orange book specification. The format was widely adopted at first, but eventually lost favor and was abandoned by 2004.

The Kodak PhotoCD format was also used on many commercial CD-ROM products. One example was the Corel Professional CD series. Below is a photo of a case of 200 CD’s I recently acquired. Each has around a hundred PCD images and viewing software on disc. Most discs can be viewed here. Or you can view their “Sampler” CD-ROM.

The actual PCD image file format was referred to as an Image PAC File. The format was unique in the fact it has multiple resolutions built into a single file. It also stored the raster data in a format called Photo YCC color encoding metric, developed by Kodak. This requires conversion to RGB for many uses. Adobe Photoshop for many years had an import filter for the format built in which included ICC profiles for properly converting the source to a destination colorspace, but support was dropped in CS3 of their products.

Photoshop Kodak PCD import

The Image PAC PCD format was a proprietary format which Kodak protected aggressively, even to the point of threatening legal action to those who attempted to reverse engineer the format. This frustrated developers and was probably part of the reason the format was abandoned. Of course this didn’t deter some curious developers and was partially reversed engineered and is available in the NetPBM library formally knows as PBMPlus. The tool hpcdtoppm was developed to convert PCD to PBM.

The trick in preserving older obsolete formats is to find a way to first identify them, gather significant properties, then migrate to a modern format if appropriate with minimal loss of data. Luckily most PCD files have the ascii string “PCD_IPI” starting around offset 2048. This is basically how the PRONOM registry identifies the format and has assigned it fmt/211. Exiftool also supports the format in identifying some of the significant properties.

ExifTool Version Number         : 12.62
File Name                       : 136009.PCD
Directory                       : /Users/thorsted/Desktop/blog/Kodak/PCD
File Size                       : 3.6 MB
File Modification Date/Time     : 2023:06:23 10:48:55-06:00
File Access Date/Time           : 2023:06:26 23:43:50-06:00
File Inode Change Date/Time     : 2023:06:27 11:18:38-06:00
File Permissions                : -rwx------
File Type                       : PCD
File Type Extension             : pcd
MIME Type                       : image/x-photo-cd
Specification Version           : 0.6
Authoring Software Release      : 3.0
Image Magnification Descriptor  : 1.0
Create Date                     : 1993:09:20 07:35:34-06:00
Image Medium                    : Color reversal
Product Type                    : 116/01 SPD 0064  #00
Scanner Vendor ID               : KODAK
Scanner Product ID              : FilmScanner 2000
Scanner Firmware Version        : 2.21
Scanner Firmware Date           : 
Scanner Serial Number           : 0296
Scanner Pixel Size              : 0b.30 micrometers
Image Workstation Make          : Eastman Kodak
Character Set                   : 95 characters ISO 646
Photo Finisher Name             : HADWEN GRAPHICS
Scene Balance Algorithm Revision: 3.1
Scene Balance Algorithm Command : Neutral SBA On, Color SBA On
Scene Balance Algorithm Film ID : Unknown (131)
Copyright Status                : Restrictions apply
Copyright File Name             : RIGHTS.USE
Orientation                     : Horizontal (normal)
Image Width                     : 3072
Image Height                    : 2048
Compression Class               : Class 1 - 35mm film; Pictoral hard copy
Image Size                      : 3072x2048
Megapixels                      : 6.3

Exiftool is able to gather much of the important properties including an original creation date and the pixel dimensions. It would be nice if was able to mention each of the resolution options as some later Pro versions of PCD had a 64 base for resolutions of 4096 x 6144.

Migration to a more modern open format is a common preservation strategy. The National Archives and Records Administration has the format NF00224 listed as needing to migrate to JPG, while others prefer migration to TIFF. Others have learned valuable lessons attempting to find the right method for migration. There is a right way and a wrong way as the Center for Digital Archaeology learned. The easiest method is to use the popular ImageMagick command-line tool.

thorsted$ identify 136009.PCD 
136009.PCD PCD 768x512 768x512+0+0 8-bit YCC 3.44727MiB 0.020u 0:00.006
thorsted$ convert 136009.PCD[5] -colorspace sRGB +compress 136009.tif
thorsted$ identify 136009.tif
136009.tif TIFF 3072x2048 3072x2048+0+0 8-bit sRGB 18.0004MiB 0.000u 0:00.000

ImageMagick along with most other tools like IrfranView and XnView only see the base resolution of 768 x 512, but with an extra little addition to the command by adding “[5]” after the filename if forces the conversion to use the “Fifth” 16 Base resolution which is the highest resolution on most PCD files, the Pro versions may have higher. The other issue is the colorspace conversion. It is known there could be a loss of highlights. This webpage illustrates different tools and the issues with highlights. You can see the difference if I use -colorspace RGB instead of sRGB.

ImageMagick conversion using RGB vs sRGB colorspace setting.

Other tools such as the open source pcdtojpeg and paid pcdMagic both work well, but the only tool I have tested so far which keeps the original metadata is pcdMagic.

ExifTool Version Number         : 12.62
File Name                       : 136009_1.tif
Directory                       : .
File Size                       : 38 MB
File Modification Date/Time     : 2023:06:27 12:06:26-06:00
File Access Date/Time           : 2023:06:27 12:06:29-06:00
File Inode Change Date/Time     : 2023:06:27 12:06:27-06:00
File Permissions                : -rw-r--r--
File Type                       : TIFF
File Type Extension             : tif
MIME Type                       : image/tiff
Exif Byte Order                 : Little-endian (Intel, II)
Subfile Type                    : Full-resolution image
Image Width                     : 3072
Image Height                    : 2048
Bits Per Sample                 : 16 16 16
Compression                     : Uncompressed
Photometric Interpretation      : RGB
Image Description               : color reversal: Unknown film. SBA settings neutral SBA on, color SBA on
Make                            : KODAK
Camera Model Name               : FilmScanner 2000
Strip Offsets                   : 1622
Samples Per Pixel               : 3
Rows Per Strip                  : 2048
Strip Byte Counts               : 37748736
Planar Configuration            : Chunky
Software                        : pcdMagic V1.4.19
Modify Date                     : 2023:06:27 12:06:26
Copyright                       : Copyright restrictions apply - see copyright file on original CD-ROM for details
Exif Version                    : 0231
Date/Time Original              : 1993:09:20 07:35:34
Create Date                     : 1993:09:20 07:35:34
Offset Time                     : -06:00
User Comment                    : color reversal: Unknown film. SBA settings neutral SBA on, color SBA on
Color Space                     : Uncalibrated
File Source                     : Film Scanner
Profile CMM Type                : Unknown (KCMS)
Profile Version                 : 2.1.0
Profile Class                   : Display Device Profile
Color Space Data                : RGB
Profile Connection Space        : XYZ
Profile Date Time               : 1998:12:01 18:58:21
Profile File Signature          : acsp
Primary Platform                : Microsoft Corporation
CMM Flags                       : Not Embedded, Independent
Device Manufacturer             : Kodak
Device Model                    : ROMM
Device Attributes               : Reflective, Glossy, Positive, Color
Rendering Intent                : Perceptual
Connection Space Illuminant     : 0.9642 1 0.82487
Profile Creator                 : Kodak
Profile ID                      : 0
Profile Copyright               : Copyright (c) Eastman Kodak Company, 1999, all rights reserved.
Profile Description             : ProPhoto RGB
Media White Point               : 0.9642 1 0.82489
Red Tone Reproduction Curve     : (Binary data 14 bytes, use -b option to extract)
Green Tone Reproduction Curve   : (Binary data 14 bytes, use -b option to extract)
Blue Tone Reproduction Curve    : (Binary data 14 bytes, use -b option to extract)
Red Matrix Column               : 0.79767 0.28804 0
Green Matrix Column             : 0.13519 0.71188 0
Blue Matrix Column              : 0.03134 9e-05 0.82491
Device Mfg Desc                 : KODAK
Device Model Desc               : Reference Output Medium Metric(ROMM)
Make And Model                  : (Binary data 40 bytes, use -b option to extract)
Image Size                      : 3072x2048
Megapixels                      : 6.3
Modify Date                     : 2023:06:27 12:06:26-06:00

There is a way to convert the PCD to TIF using ImageMagick, then using Exiftool to map some of the metadata over to the new TIFF file. It would look something like this:

exiftool -addtagsfromfile 136009.PCD '-EXIF:DateTimeOriginal<PhotoCD:CreateDate' '-EXIF:CreateDate<PhotoCD:CreateDate' '-ExifIFD:SerialNumber<PhotoCD:ScannerSerialNumber' '-ExifIFD:ExifImageWidth<PhotoCD:ImageWidth' '-ExifIFD:ExifImageHeight<PhotoCD:ImageHeight' '-IFD0:Make<PhotoCD:ScannerVendorID' '-IFD0:Model<PhotoCD:ScannerProductID' '-IFD0:Orientation<PhotoCD:Orientation' '-IFD0:Copyright<PhotoCD:CopyrightStatus' 136009.tif

JPG Structure

If you hadn’t been over to see the posters made by Ange Albertini, head over now. Below is his poster on the JPG image file format. This is the basic JFIF file format, which stands for JPEG File Interchange Format. There are also raw JPEG streams and Exif, Exchangeable Image File Format.

The basic format is pretty straight forward. There is a start of image marker FFD8 some format information, then the raster compressed data, then an end of image marker FFD9. Identification of a JPEG file should be pretty straight forward. Knowing the start and end marker values and then the type of JPEG based on the Application data, can be very specific. That is until some software engineers start playing fast and loose with the format specifications.

A while back I received a JPG file which didn’t identify using the latest PRONOM signature. It’s happened before, some new phones came out and started using a newer version of the exif specification so I submitted an update to PRONOM for JPG’s using exif 2.3 and greater. But also may need to submit another signature soon for the newly released Exif 3.0 specification! But this JPG I received wasn’t a new version, it should have been identified with the current PRONOM signature. It started with FFD8 and when I went to look at the end of the file for the end of image marker FFD9, it wasn’t where I expected it to be.

This JPG file had an additional 9632 bytes after the FFD9 end of image marker. But why? The image rendered just fine in multiple JPG viewers. The only warning from Exiftool was for “Unrecognized MakerNotes”, which is not too uncommon. So I went to the JPG Exif specification.

EOI, Recording this marker is mandatory. It shall be recorded in this position.

But reading a little further we see…..

Moreover, Exif/DCF readers should be implemented to operate without interruption even if certain kinds of data have been recorded after EOI of the primary image defined in the Exif standard. Specifically, unknown data after EOI of the primary image should be skipped. (see section 4.7.1)

So the extra data is allowed by specification. Any readers should ignore or skip any data after the EOI (End of Image). Well that makes identification more difficult. All the PRONOM signatures are based on having the EOI marker at the “End”. Some have allowance for padding, but not enough for the worst offenders……

The image referenced above was created on a Huawei MHA-L29 cameraphone. But since finding this image, I have also found many Samsung phones do the same thing. Here is one from a Samsung SM-G975U1. Much less padding but enough to throw off identification.

Apple iPhones are also not exempt from this “feature” either. When using the MacOS ImageCapture tool with the HEIC format, a bug can add an excessive amount of empty data at the end of the converted JPG file.

So, when it comes to identification, if your JPG files don’t seem to identify correctly, look closer at the end of the file, it may have some “extra” data.

What’s the 411?

I am dating myself by using the phrase “What’s the 411?” Back in my day (before the Googles), if you wanted quick information you could pick up the “land line”, a corded phone in your home which could only make phone calls, and dial 4-1-1 and you would be connected to an operator that could help you locate businesses, tell you the time, answer simple questions, and was infinity smarter than Alexa.

Around the same time I was using 4-1-1 to answer all my questions, digital camera’s were just coming on the scene. One of those was the Sony Mavica line of digital camera’s. They were unique as they used a floppy disk as the storage media. They had a small LED screen for capture and playback of the captured images. In order to quickly preview the images captured on disk, the camera generates a hidden thumbnail file for each image, this file has the extension .411. When I first saw this file when I copied a floppy from my Mavica cameras, it reminded me of the old information line. I first assumed it was a metadata file as the first few Mavica camera did not use EXIF in their files, but they are simply a raster image in a 64×48 pixel file. Of course Sony did not document this file format and probably hoped no one would noticed as they are hidden on the floppy FAT12 formatted disk.

Video showing index of floppy disk.

One could argue the value of documenting and possibly identifying thumbnail formats as many in digital preservation have chosen not to keep the Thumbs.db file or other hidden files not meant to be preserved or accessible to the user. I have found documenting any format found through technical appraisals provides value to everyone, which may ultimately determine not to keep such formats in their repository, but knowing what they are is vital to the process. Come listen and chat with me about this topic at iPres 2023!

Usually the first part of documenting a format is looking for specifications online or documented somewhere. Since Sony did not publicly release any specifications for this format, we have to use others reverse engineering or do so ourselves. There have been a few attempts to document a conversion of the 411 format to a common raster format like BMP. Like this C code for conversion to BMP, or to NetPBM formats like PPM, or the Java “Javica” software which makes use of the 411 files. My first step was to see if we could find some common patterns in the many samples I have from my Mavica collection. Running Marco Pontello’s TrIDScan, across my 54 samples came up with no common patterns, this was expected as all the reverse engineering efforts points out the format is probably based on the CCIR.601 specification which is MPEG based on frames.

With no common patterns among all the samples, creating a PRONOM signature is not possible. In the future, file identification may be based more on dynamic pattern matching instead of the current static patterns we look for now. Until then, this may need to be submitted as an extension only entry. Two things to note, the files created by the camera are all named starting with “MVC” which could also be used for identification. You may also notice that every .411 file is exactly 4608 bytes. The extension .411 is also pretty unique, so I doubt it will clash with any other format for the moment.

Corel ArtShow

File extensions are the easiest way to quickly identify a file format, but they can be misleading. This is the reason in Digital Preservation format identification tools like DROID are important to look closer at the file structure to more accurately identify formats. The other complication is some extensions are used for more than one format. Extensions like .DOC or .ISO can be used with many formats.

The PRONOM registry which DROID uses will list extensions associated with each format signature, but for some, they only have an extension and no signature. It’s nice to have an official ID to go with a format but with no signature it only matches based on extension.

This caused a problem awhile back for me while working with some files with the extension CDX. Which according to PRONOM, there are 5 completely different formats which use the extension, and probably others.

My CDX was related to some indexing software called Cindex. At the time the only format with a signature was for the WARC summary file CDX. The other was for a CorelDraw Compressed format with no signature. Confusing right? When I would run format identification on my Cindex files, they would default to the CorelDraw Compressed format, identified by extension. It was easy enough to create a signature for the Cindex format as I had enough samples to know the patterns needed for correct identification. But I was curious about the CorelDraw format. Should be easy to find, right?

Wrong. Finding a sample of this format was very elusive. All I had to go by was the name given to the format by PRONOM and the extension. I scoured every Corel CD and image I could get my hands on. For months I looked and could never find a single CDX file. Each CorelDraw software I was able to run did not have any ability to save in the CDX format. I scoured clipart discs, other Corel software, like Designer, PrintHouse, Photo-Paint, nada, nothing. I started to wonder if the format even existed. That’s when I noticed in the filters included with CorelDraw a reference to the ability to import a CDX but not write to one.

[CDX]
Signature=CORELFILTER - A
FilterEntry=1
Description=CorelDRAW Compressed (CDX)
FilterFullName=CorelDRAW Import Filter
Version=Version 6.00
Company=Corel Corporation
Copyright=Copyright © 1988-1995 Corel Corporation
Extensions=*.CDX
CorelID=0x704
FilterCapability1=0x9000
FilterCapability2=0x0
NoOfCompressions=0

This led to me finding a reference on the old Corel FTP site for knowledge base number 4550.

It mentioned something called ArtShow, where version 5 supported the file format CDX. ArtShow was a gallery of winning designs released on a CD-ROM and book each year. The first one being ArtShow 91, then ArtShow 3, 4, 5, 6, and finally 7 was the last. Each one released used a different proprietary compressed format for storing all the designs, these formats exist nowhere else. The question remains, why didn’t they use other popular Corel formats like CDR, CMX, or CCX which were used on many other clip art titles.

It took some time but I was finally able to find copies of a few of the Artshow CD-ROM discs, especially numbers 5 & 6. Which had the CDX format and the second generation CPX formats.

Each format had a easy to recognize header making a PRONOM signature easy to create. PRONOM already had the PUID for the two formats CDX & CPX, so sending in the signature added to the registry and hopefully will help distinguish between all the CDX formats!

Embedded WAVE, thanks HP 👋

Digital Preservation is all about identifying risks. This is done through a process which includes identification, validation, and metadata extraction. The more you know about the digital data you need to preserve over time, the more you can do to minimize those risks with the goal of making the data accessible over time.

Many formats are pretty straight forward, they are identifiable through a header and then have some binary bits or plain text that is readable by certain software. Others are more complicated. A common practice for more complex needs is to use a container. Word processing programs started out with plain text with maybe some formatting codes mixed in, then many moved to the Microsoft OLE container so you could have additional content embedded in a single file. Today file formats such as DOCX use a ZIP container, which houses all the text, images, formatting and anything else the format supports. Knowing what the format is and knowing what it may contain is important to preservation.

IM000959.JPG

I collect older digital cameras, specifically cameras with unique file formats, raw and otherwise. When I picked up a HP (Hewlett-Packard) point and shoot camera awhile back, I was initially unimpressed as it would only capture in a JPEG format and only 3 quality settings. While looking at a copy of the manual, I saw the camera was capable of capturing audio clips or voice memos for each photo taken. This can be handy when taking many photos and need a reminder about the context. This was not unique to HP, as many cameras could do this, normally a JPG was captured and the Audio would have the same name connecting the two. But when I recorded some audio on my little HP, placed the SD card in my computer, I couldn’t find the additional audio file. I also not the only one to ask about this.

There are many types of JPG files. Raw Streams, JPEG File Interchange Format (JFIF), and Exchangeable Image File Format (EXIF). Normally these formats have raster image data sprinkled with metadata. I have seen JPEG files embedded into other formats and containers, such as MP3, PDF, etc, but JPEG’s are not container formats. Or so I thought…..

View of HP Photosmart 433 folder in HP Photo & Imaging Gallery

Lets take a look at an image I took with my HP Photosmart 433. We’ll start with identification:

siegfried   : 1.10.1
scandate    : 2023-05-25T12:27:04-06:00
signature   : default.sig
created     : 2023-05-22T08:43:02-06:00
identifiers : 
  - name    : 'pronom'
    details : 'DROID_SignatureFile_V112.xml; container-signature-20230510.xml'
---
filename : 'GitHub/digicam_corpus/HP/Photosmart 433/IM000959.JPG'
filesize : 178922
modified : 2023-05-25T11:23:32-06:00
errors   : 
matches  :
  - ns      : 'pronom'
    id      : 'x-fmt/391'
    format  : 'Exchangeable Image File Format (Compressed)'
    version : '2.2'
    mime    : 'image/jpeg'
    class   : 'Image (Raster)'
    basis   : 'extension match jpg; byte match at [[0 16] [366 12] [178907 2]] (signature 2/2)'
    warning : 

IM000959.JPG was identified as x-fmt/391 which is a compressed Exchangeable Image File Format. version 2.2. Pretty straight forward. Next lets look at validation:

Jhove (Rel. 1.28.0, 2023-05-18)
 Date: 2023-05-25 12:35:16 MDT
 RepresentationInformation: GitHub/digicam_corpus/HP/Photosmart 433/IM000959.JPG
  ReportingModule: JPEG-hul, Rel. 1.5.4 (2023-03-16)
  LastModified: 2023-05-25 11:23:32 MDT
  Size: 178922
  Format: JPEG
  Status: Well-Formed and valid
  SignatureMatches:
   JPEG-hul
  ErrorMessage: Tag 41492 out of sequence
   ID: TIFF-HUL-2
   Offset: 606
  MIMEtype: image/jpeg
  JPEGMetadata: 
   CompressionType: Huffman coding, Baseline DCT
   Images: 
    Number: 1
    Image: 
     NisoImageMetadata: 
      FormatName: image/jpeg
      ByteOrder: big_endian
      CompressionScheme: JPEG
      ImageWidth: 640
      ImageHeight: 480
      ColorSpace: YCbCr
      DateTimeCreated: 2021-11-16T09:04:04
      ScannerManufacturer: Hewlett-Packard
      ScannerModelName: hp PhotoSmart 43x series
      DigitalCameraManufacturer: Hewlett-Packard
      DigitalCameraModelName: hp PhotoSmart 43x series
      FNumber: 4
      ................................
     Exif: 
      ExifVersion: 0220
      FlashpixVersion: 0100
      ColorSpace: sRGB
      ComponentsConfiguration: 1, 2, 3, 0
      CompressedBitsPerPixel: 1.568
      PixelXDimension: 640
      PixelYDimension: 480
      MakerNote: 0, 97, 48, 101, 114, 32, 78, 111, 116, 101, 115, 0, 0, 0, 0, 0
      DateTimeOriginal: 2021:11:16 09:04:04
      DateTimeDigitized: 2021:11:16 09:04:04
   ApplicationSegments: APP1, APP2, APP2, APP2, APP2, APP2, APP2, APP2, APP2, APP2, APP2, APP2, APP2, APP2, APP2, APP2

I removed a few lines to show important parts, but we get some similar information about the format, a JPEG with EXIF version 2.2. We also learn that HP improperly ordered their tags and put Tag 41492 out of sequence, but we can ignore that for now. Looking close at the output does not give us any indication of audio formats. There is a clue when we see the mention of a Flashpix version and additional Application Segments.

Since this is an image with EXIF data, lets also take a look at the output of Exiftool.

ExifTool Version Number         : 12.62
File Name                       : IM000959.JPG
Directory                       : .
File Size                       : 179 kB
File Modification Date/Time     : 2023:05:25 11:23:32-06:00
File Access Date/Time           : 2023:05:25 11:24:42-06:00
File Inode Change Date/Time     : 2023:05:25 11:24:39-06:00
File Permissions                : -rwxr-xr-x
File Type                       : JPEG
File Type Extension             : jpg
MIME Type                       : image/jpeg
Exif Byte Order                 : Little-endian (Intel, II)
Image Description               : IM000959.JPG
Make                            : Hewlett-Packard
Camera Model Name               : hp PhotoSmart 43x series
Orientation                     : Horizontal (normal)
X Resolution                    : 72
Y Resolution                    : 72
Resolution Unit                 : inches
Software                        : 1.400
Modify Date                     : 2021:11:16 09:04:04
Y Cb Cr Positioning             : Co-sited
Copyright                       : Copyright 2002-2003
Exposure Time                   : 1/29
F Number                        : 4.0
ISO                             : 100
Exif Version                    : 0220
Date/Time Original              : 2021:11:16 09:04:04
Create Date                     : 2021:11:16 09:04:04
Components Configuration        : Y, Cb, Cr, -
Compressed Bits Per Pixel       : 1.567552083
Shutter Speed Value             : 1/30
Aperture Value                  : 4.0
Exposure Compensation           : 0
Max Aperture Value              : 4.0
Subject Distance                : 1 m
Metering Mode                   : Average
Light Source                    : Unknown
Flash                           : Auto, Did not fire
Focal Length                    : 5.7 mm
Warning                         : [minor] Unrecognized MakerNotes
Flashpix Version                : 0100
Color Space                     : sRGB
Exif Image Width                : 640
Exif Image Height               : 480
Interoperability Index          : R98 - DCF basic file (sRGB)
Interoperability Version        : 0100
Digital Zoom Ratio              : 1
Subject Location                : 0
Compression                     : JPEG (old-style)
Thumbnail Offset                : 2046
Thumbnail Length                : 7112
Code Page                       : Unicode UTF-16, little endian
Used Extension Numbers          : 1, 31
Extension Name                  : Audio
Extension Class ID              : 10000100-6FC0-11D0-BD01-00609719A180
Extension Persistence           : Always Valid
Audio Stream                    : (Binary data 117820 bytes, use -b option to extract)
Image Width                     : 640
Image Height                    : 480
Encoding Process                : Baseline DCT, Huffman coding
Bits Per Sample                 : 8
Color Components                : 3
Y Cb Cr Sub Sampling            : YCbCr4:2:2 (2 1)
Aperture                        : 4.0
Image Size                      : 640x480
Megapixels                      : 0.307
Shutter Speed                   : 1/29
Thumbnail Image                 : (Binary data 7112 bytes, use -b option to extract)
Focal Length                    : 5.7 mm
Light Value                     : 8.9

Ohh, what do we have here? Exiftool mentions an audio stream. An audio stream inside the JPEG? How is this possible? The Flashpix format was originally developed by Kodak in which collaborated with HP. This was later added to the EXIF specifications. Below is an screenshot from the Exif Version 2.2 spec.

Exiftool mentioned Flashpix and additional APP2 segments. Lets take a look at the raw file in a hex editor.

Ahhh….. In one of the App2 segments we can see something familiar. A RIFF WAVE header! Lets see if we can extract the WAVE file.

exiftool -b -AudioStream IM000959.JPG > IM000959.WAV

mediainfo IM000959.WAV
General
Complete name                            : IM000959.WAV
Format                                   : Wave
Format settings                          : WaveFormatEx
File size                                : 115 KiB
Duration                                 : 10 s 681 ms
Overall bit rate mode                    : Constant
Overall bit rate                         : 88.2 kb/s

Audio
Format                                   : ADPCM
Codec ID                                 : 11
Codec ID/Hint                            : Intel
Duration                                 : 10 s 681 ms
Bit rate mode                            : Constant
Bit rate                                 : 88.2 kb/s
Channel(s)                               : 1 channel
Sampling rate                            : 22.05 kHz
Bit depth                                : 4 bits
Stream size                              : 115 KiB (100%)

MediaInfo can give us details on the embedded WAVE file, which is pretty terrible quality but is a PCM audio stream.

Embedded audio inside a raster image is not common. Most software which can render a JPEG image will most likely ignore the embedded WAVE and not even give a warning it exists. IM000959.JPG opens fine in Adobe Photoshop, but saving to a new format or making any edits will delete the WAVE file. Imagemagick also will remove the WAVE with any editing with no warning.

In order to ensure the embedded audio stream is preserved we first need to know it is there, this is where tools like exiftool can be used to extract metadata from the file and the image can be associated with having an audio stream and handled differently than any other JPEG file. More work is needed, Exiftool may mention an Audio Stream, but currently does not have the ability to pull any data from the stream.

Greenstreet

During the 1980’s and 90’s, there was an explosion of software created for the PC and Macintosh. When it came to graphic design, Aldus, Adobe, Quark, Serif, and a few others were clearly the best. That didn’t stop other software developers in trying their hand with publishing design software. If you were on a budget, there were plenty of options to choose from. One of them, Timeworks Publisher, was very popular. It was released in 1987 for IBM PC and Atari with later releases for Apple II and Macintosh. The name was later changed to Pressworks. It was published by an interesting software company out of the UK called GST Software, also under the GSP name. They really enjoyed licensing their software.

Desktop Publishing software

TimeWorks Publisher may have been the first, but was definitely not the last. Pressworks was very popular so the software was sold and rebranded to many companies. In 2001 GST merged with eGames Europe as a new company, Greenstreet Software who continued to support the software. Some of which are:

  • FUJI Publisher
  • Global Software Publishing (in Europe) Pressworks, Power Publisher
  • GST Pressworks
  • 1st Press
  • IMSI TurboPublisher
  • Media Graphics Publishers Paradise Page Express
  • MicroVision Vision Publisher 4
  • NEBS PageMagic
  • PersonalSoft Publications (Français)
  • Pushbutton Publish
  • Softkey Publisher DOS
  • Sybex Page (Deutsch)
  • Timeworks Publisher, Publish-it, Publisher Lite, Publish-it Lite
  • VCI Pro Publisher
  • Wizardworks CompuWorks Publisher
  • Instant Home Publisher
  • Greenstreet Publisher
  • Canon Publishing Suite

All the of the software listed above could open and save to the same file format with the extension .DTP with full compatibility, also used TPL for templates. Originally the DTP file format was a single proprietary binary format which had an ascii header of “DTPI” and all seemed to end with the ascii “EODF”. Later the software was enhanced to be OLE compatible and the binary format was wrapped inside. This made it work well for moving objects in and out of the software into other OLE compatible software like Word, but is confusing to format identification software as the header is the same as a Word file. I have added the two versions of the DTP format to PRONOM to help identify them better. They are fmt/1415 and fmt/1416.

Drawing Software

In addition to the popular Desktop Publishing software, there was a companion Drawing software licensed as well. It also had many titles:

  • BHV COLOURDRAW!
  • FUJI Designer
  • Global Software Publishing (in Europe) Designworks, Power Publisher
  • GST (in North America) PressworksDraw
  • 1st Design
  • IMSI TurboDraw
  • Media Graphics Publishers Paradise Design Studio
  • MicroVision Vision Draw
  • NEBS DesignMagic
  • PersonalSoft Création Graphique
  • Pushbutton Design
  • VCI Pro Design
  • Wizardworks CompuWorks Designer, CompuWorks Draw
  • Canon Publishing Suite

The Draw/Design software all used the same file format as well with the extension .ART, also with full compatibility between all the titles. The TEM extension was used for templates. Not to be confused with the AOL Image format, or Asymetrix Compel Image format, or a number of other formats using the ART extension. This format also began as a single proprietary binary format with the ascii header “GST:ART” starting at offset 16. And just like the DTP format it was later wrapped in an OLE container to be more compatible. In fact, the DTP format may have embedded Art objects! This format is not in PRONOM, so lets take a closer look.

You can see from the 1stdgn.art file here, the ascii “GST:ART” string starting at byte 16. This is consistent with all the samples I have. The first 16 bytes seem to vary in each sample and probably have to do with the size of the file and dimensions of the artwork. GST:ART is unique enough and should work well for a signature.

The ART file from a later version of Draw is in the OLE file format. This container format was designed by Microsoft as a universal container to increase compatibility among software. You can see from the hex view above the file looks very similar to the DOC format used by Word. There were many software titles which used this container format, many documented here. One of the easiest ways to look inside an OLE container is to use 7-Zip. A quick listing of the file shows it is a Type = compound and includes three files. The SummaryInformation file is common among many OLE formats and can contain some metadata, but the Contents file is what we are looking for. Examining the Contents file we find it looks identical to the earlier version of the ART format. The same “GST:ART” string starting at byte 16.

A note about the Preview.dib file. It appears to be a Device-Independent Bitmap, similar to a Bitmap file, probably for a thumbnail preview.

Writing a signature for an OLE container format is a bit more tricky. It requires a separate signature file to go along with the regular signature xml. Basically DROID is setup to “trigger” once it discovers either a “ZIP” file or “OLE” format. If it detects one of those formats it then looks into the container signature xml for additional patterns. If it finds a match then it identifies the format, if not it reports back a generic “ZIP” or “OLE” format.

As it turns out there were two different types of OLE file types, one used “Contents” for the internal file and another which used “CONTENTS”. Since the signature is case sensitive, the container signature requires two signatures both mapped to the same PUID.

These two formats were used with quite a few software titles. Hopefully these signatures cover most of them! You can find a couple samples and my signatures on my Github.

LiveCode stack

One of the earliest hypermedia systems which predated the world wide web was called HyperCard on the Macintosh. Within minutes you could have a small application to do just about anything, calendar, address book, interactive books, games, etc. The internet archive has collected many HyperCard stacks and emulates them directly in the browser.

Riding on the success of HyperCard was another hypermedia tool called MetaCard, which later became Runtime Revolution. Today it is known as LiveCode, a cross-platform application development system. LiveCode is often used to quickly create applications which can run on many platforms including iOS. It is popular with students and higher education. The LiveCode source was opened for a time instigated by a successful kickstarter program, but closed in 2021 as the company struggled to keep paying customers.

Each LiveCode version produced unique files for each of the major versions. Currently none of the formats can be identified using preservation tools. Luckily, because the code was open-source for a time, we have details which helps us identify the formats. Let’s take a look:

#define kMCStackFileMetaCardVersionString "#!/bin/sh\n# MetaCard 2.4 stack\n
#define kMCStackFileVersionString_2_7 "REVO2700"
#define kMCStackFileVersionString_5_5 "REVO5500"
#define kMCStackFileVersionString_7_0 "REVO7000"
#define kMCStackFileVersionString_8_0 "REVO8000"
#define kMCStackFileVersionString_8_1 "REVO8100"

I took LiveCode up on their 10 day trial and was able to install software version 9.6.9 to save some samples. The software has a “Save as” option which allows you to save your code to older versions. Although one must be careful as saving to older versions may have some data loss.

The samples I was able to save had matching headers just like in the source code. The REVO string starts right at the beginning of the file making identification easy. Take a look at my GitHub page for samples and signature. Also check out the File Format Wiki Page for more information and more samples!