Embedded WAVE, thanks HP 👋

Digital Preservation is all about identifying risks. This is done through a process which includes identification, validation, and metadata extraction. The more you know about the digital data you need to preserve over time, the more you can do to minimize those risks with the goal of making the data accessible over time.

Many formats are pretty straight forward, they are identifiable through a header and then have some binary bits or plain text that is readable by certain software. Others are more complicated. A common practice for more complex needs is to use a container. Word processing programs started out with plain text with maybe some formatting codes mixed in, then many moved to the Microsoft OLE container so you could have additional content embedded in a single file. Today file formats such as DOCX use a ZIP container, which houses all the text, images, formatting and anything else the format supports. Knowing what the format is and knowing what it may contain is important to preservation.

IM000959.JPG

I collect older digital cameras, specifically cameras with unique file formats, raw and otherwise. When I picked up a HP (Hewlett-Packard) point and shoot camera awhile back, I was initially unimpressed as it would only capture in a JPEG format and only 3 quality settings. While looking at a copy of the manual, I saw the camera was capable of capturing audio clips or voice memos for each photo taken. This can be handy when taking many photos and need a reminder about the context. This was not unique to HP, as many cameras could do this, normally a JPG was captured and the Audio would have the same name connecting the two. But when I recorded some audio on my little HP, placed the SD card in my computer, I couldn’t find the additional audio file. I also not the only one to ask about this.

There are many types of JPG files. Raw Streams, JPEG File Interchange Format (JFIF), and Exchangeable Image File Format (EXIF). Normally these formats have raster image data sprinkled with metadata. I have seen JPEG files embedded into other formats and containers, such as MP3, PDF, etc, but JPEG’s are not container formats. Or so I thought…..

View of HP Photosmart 433 folder in HP Photo & Imaging Gallery

Lets take a look at an image I took with my HP Photosmart 433. We’ll start with identification:

siegfried   : 1.10.1
scandate    : 2023-05-25T12:27:04-06:00
signature   : default.sig
created     : 2023-05-22T08:43:02-06:00
identifiers : 
  - name    : 'pronom'
    details : 'DROID_SignatureFile_V112.xml; container-signature-20230510.xml'
---
filename : 'GitHub/digicam_corpus/HP/Photosmart 433/IM000959.JPG'
filesize : 178922
modified : 2023-05-25T11:23:32-06:00
errors   : 
matches  :
  - ns      : 'pronom'
    id      : 'x-fmt/391'
    format  : 'Exchangeable Image File Format (Compressed)'
    version : '2.2'
    mime    : 'image/jpeg'
    class   : 'Image (Raster)'
    basis   : 'extension match jpg; byte match at [[0 16] [366 12] [178907 2]] (signature 2/2)'
    warning : 

IM000959.JPG was identified as x-fmt/391 which is a compressed Exchangeable Image File Format. version 2.2. Pretty straight forward. Next lets look at validation:

Jhove (Rel. 1.28.0, 2023-05-18)
 Date: 2023-05-25 12:35:16 MDT
 RepresentationInformation: GitHub/digicam_corpus/HP/Photosmart 433/IM000959.JPG
  ReportingModule: JPEG-hul, Rel. 1.5.4 (2023-03-16)
  LastModified: 2023-05-25 11:23:32 MDT
  Size: 178922
  Format: JPEG
  Status: Well-Formed and valid
  SignatureMatches:
   JPEG-hul
  ErrorMessage: Tag 41492 out of sequence
   ID: TIFF-HUL-2
   Offset: 606
  MIMEtype: image/jpeg
  JPEGMetadata: 
   CompressionType: Huffman coding, Baseline DCT
   Images: 
    Number: 1
    Image: 
     NisoImageMetadata: 
      FormatName: image/jpeg
      ByteOrder: big_endian
      CompressionScheme: JPEG
      ImageWidth: 640
      ImageHeight: 480
      ColorSpace: YCbCr
      DateTimeCreated: 2021-11-16T09:04:04
      ScannerManufacturer: Hewlett-Packard
      ScannerModelName: hp PhotoSmart 43x series
      DigitalCameraManufacturer: Hewlett-Packard
      DigitalCameraModelName: hp PhotoSmart 43x series
      FNumber: 4
      ................................
     Exif: 
      ExifVersion: 0220
      FlashpixVersion: 0100
      ColorSpace: sRGB
      ComponentsConfiguration: 1, 2, 3, 0
      CompressedBitsPerPixel: 1.568
      PixelXDimension: 640
      PixelYDimension: 480
      MakerNote: 0, 97, 48, 101, 114, 32, 78, 111, 116, 101, 115, 0, 0, 0, 0, 0
      DateTimeOriginal: 2021:11:16 09:04:04
      DateTimeDigitized: 2021:11:16 09:04:04
   ApplicationSegments: APP1, APP2, APP2, APP2, APP2, APP2, APP2, APP2, APP2, APP2, APP2, APP2, APP2, APP2, APP2, APP2

I removed a few lines to show important parts, but we get some similar information about the format, a JPEG with EXIF version 2.2. We also learn that HP improperly ordered their tags and put Tag 41492 out of sequence, but we can ignore that for now. Looking close at the output does not give us any indication of audio formats. There is a clue when we see the mention of a Flashpix version and additional Application Segments.

Since this is an image with EXIF data, lets also take a look at the output of Exiftool.

ExifTool Version Number         : 12.62
File Name                       : IM000959.JPG
Directory                       : .
File Size                       : 179 kB
File Modification Date/Time     : 2023:05:25 11:23:32-06:00
File Access Date/Time           : 2023:05:25 11:24:42-06:00
File Inode Change Date/Time     : 2023:05:25 11:24:39-06:00
File Permissions                : -rwxr-xr-x
File Type                       : JPEG
File Type Extension             : jpg
MIME Type                       : image/jpeg
Exif Byte Order                 : Little-endian (Intel, II)
Image Description               : IM000959.JPG
Make                            : Hewlett-Packard
Camera Model Name               : hp PhotoSmart 43x series
Orientation                     : Horizontal (normal)
X Resolution                    : 72
Y Resolution                    : 72
Resolution Unit                 : inches
Software                        : 1.400
Modify Date                     : 2021:11:16 09:04:04
Y Cb Cr Positioning             : Co-sited
Copyright                       : Copyright 2002-2003
Exposure Time                   : 1/29
F Number                        : 4.0
ISO                             : 100
Exif Version                    : 0220
Date/Time Original              : 2021:11:16 09:04:04
Create Date                     : 2021:11:16 09:04:04
Components Configuration        : Y, Cb, Cr, -
Compressed Bits Per Pixel       : 1.567552083
Shutter Speed Value             : 1/30
Aperture Value                  : 4.0
Exposure Compensation           : 0
Max Aperture Value              : 4.0
Subject Distance                : 1 m
Metering Mode                   : Average
Light Source                    : Unknown
Flash                           : Auto, Did not fire
Focal Length                    : 5.7 mm
Warning                         : [minor] Unrecognized MakerNotes
Flashpix Version                : 0100
Color Space                     : sRGB
Exif Image Width                : 640
Exif Image Height               : 480
Interoperability Index          : R98 - DCF basic file (sRGB)
Interoperability Version        : 0100
Digital Zoom Ratio              : 1
Subject Location                : 0
Compression                     : JPEG (old-style)
Thumbnail Offset                : 2046
Thumbnail Length                : 7112
Code Page                       : Unicode UTF-16, little endian
Used Extension Numbers          : 1, 31
Extension Name                  : Audio
Extension Class ID              : 10000100-6FC0-11D0-BD01-00609719A180
Extension Persistence           : Always Valid
Audio Stream                    : (Binary data 117820 bytes, use -b option to extract)
Image Width                     : 640
Image Height                    : 480
Encoding Process                : Baseline DCT, Huffman coding
Bits Per Sample                 : 8
Color Components                : 3
Y Cb Cr Sub Sampling            : YCbCr4:2:2 (2 1)
Aperture                        : 4.0
Image Size                      : 640x480
Megapixels                      : 0.307
Shutter Speed                   : 1/29
Thumbnail Image                 : (Binary data 7112 bytes, use -b option to extract)
Focal Length                    : 5.7 mm
Light Value                     : 8.9

Ohh, what do we have here? Exiftool mentions an audio stream. An audio stream inside the JPEG? How is this possible? The Flashpix format was originally developed by Kodak in which collaborated with HP. This was later added to the EXIF specifications. Below is an screenshot from the Exif Version 2.2 spec.

Exiftool mentioned Flashpix and additional APP2 segments. Lets take a look at the raw file in a hex editor.

Ahhh….. In one of the App2 segments we can see something familiar. A RIFF WAVE header! Lets see if we can extract the WAVE file.

exiftool -b -AudioStream IM000959.JPG > IM000959.WAV

mediainfo IM000959.WAV
General
Complete name                            : IM000959.WAV
Format                                   : Wave
Format settings                          : WaveFormatEx
File size                                : 115 KiB
Duration                                 : 10 s 681 ms
Overall bit rate mode                    : Constant
Overall bit rate                         : 88.2 kb/s

Audio
Format                                   : ADPCM
Codec ID                                 : 11
Codec ID/Hint                            : Intel
Duration                                 : 10 s 681 ms
Bit rate mode                            : Constant
Bit rate                                 : 88.2 kb/s
Channel(s)                               : 1 channel
Sampling rate                            : 22.05 kHz
Bit depth                                : 4 bits
Stream size                              : 115 KiB (100%)

MediaInfo can give us details on the embedded WAVE file, which is pretty terrible quality but is a PCM audio stream.

Embedded audio inside a raster image is not common. Most software which can render a JPEG image will most likely ignore the embedded WAVE and not even give a warning it exists. IM000959.JPG opens fine in Adobe Photoshop, but saving to a new format or making any edits will delete the WAVE file. Imagemagick also will remove the WAVE with any editing with no warning.

In order to ensure the embedded audio stream is preserved we first need to know it is there, this is where tools like exiftool can be used to extract metadata from the file and the image can be associated with having an audio stream and handled differently than any other JPEG file. More work is needed, Exiftool may mention an Audio Stream, but currently does not have the ability to pull any data from the stream.

3M Printscape

There are some file formats out there which are confusing. One such file came across my desk awhile back. This file was not identifiable with any tools I threw at it. At first I believed it to be a TIFF file variant.

You can see the TIFF header, but would not open as one, even if the extension was changed from PSC to TIF. The other hint was the phrase “3M Printscape”, I had never heard of it and there wasn’t much information available about it. It seems it was a creative product made by 3M in the early 2000’s. You could buy a package of printable cards, gift bags, etc. The problem was, there was no available software to be found. I searched on the Internet Archive, the Wayback Machine, and many other abandoned software sites. For months I searched, it wasn’t until a year later I came across one of the creative packages at thrift store. I was thrilled. That is until I was able to get the software installed.

After I installed the software in a virtual machine running Windows 98 I tried to open the PSC file but the software was looking for files with the extension STD, which is an unfortunate acronym. Turns out it stands for SureThing Document. SureThing is a software company who develops Label software. After many months of searching I thought I had found the software to render my file, but it was not meant to be.

Many months later I decided to do some more searches. That is when another copy of 3M Printscape showed up in the Internet Archive. 3M Printscape 2.0! It appears 3M decided to design their own software for version 2.0.

The preservation value of the above image is not lost on me. What took me over a year to figure out ended up being a simple pixelated image of a cardinal. Its the journey, not the destination?

From this little adventure I was able to submit two file formats to PRONOM, fmt/1275, fmt/1276. Also I documented the formats and linked to the software on the File Format Wiki. The 3M Printscape version 2 was also released for Macintosh, so the signature had to account for endianness, just like a TIFF file would. With the format having the string “3M Printscape” in the header, it made for an easy signature.

Hopefully, I will be the last to spend this much time on an image of a bird.

Greenstreet

During the 1980’s and 90’s, there was an explosion of software created for the PC and Macintosh. When it came to graphic design, Aldus, Adobe, Quark, Serif, and a few others were clearly the best. That didn’t stop other software developers in trying their hand with publishing design software. If you were on a budget, there were plenty of options to choose from. One of them, Timeworks Publisher, was very popular. It was released in 1987 for IBM PC and Atari with later releases for Apple II and Macintosh. The name was later changed to Pressworks. It was published by an interesting software company out of the UK called GST Software, also under the GSP name. They really enjoyed licensing their software.

Desktop Publishing software

TimeWorks Publisher may have been the first, but was definitely not the last. Pressworks was very popular so the software was sold and rebranded to many companies. In 2001 GST merged with eGames Europe as a new company, Greenstreet Software who continued to support the software. Some of which are:

  • FUJI Publisher
  • Global Software Publishing (in Europe) Pressworks, Power Publisher
  • GST Pressworks
  • 1st Press
  • IMSI TurboPublisher
  • Media Graphics Publishers Paradise Page Express
  • MicroVision Vision Publisher 4
  • NEBS PageMagic
  • PersonalSoft Publications (Français)
  • Pushbutton Publish
  • Softkey Publisher DOS
  • Sybex Page (Deutsch)
  • Timeworks Publisher, Publish-it, Publisher Lite, Publish-it Lite
  • VCI Pro Publisher
  • Wizardworks CompuWorks Publisher
  • Instant Home Publisher
  • Greenstreet Publisher
  • Canon Publishing Suite

All the of the software listed above could open and save to the same file format with the extension .DTP with full compatibility, also used TPL for templates. Originally the DTP file format was a single proprietary binary format which had an ascii header of “DTPI” and all seemed to end with the ascii “EODF”. Later the software was enhanced to be OLE compatible and the binary format was wrapped inside. This made it work well for moving objects in and out of the software into other OLE compatible software like Word, but is confusing to format identification software as the header is the same as a Word file. I have added the two versions of the DTP format to PRONOM to help identify them better. They are fmt/1415 and fmt/1416.

Drawing Software

In addition to the popular Desktop Publishing software, there was a companion Drawing software licensed as well. It also had many titles:

  • BHV COLOURDRAW!
  • FUJI Designer
  • Global Software Publishing (in Europe) Designworks, Power Publisher
  • GST (in North America) PressworksDraw
  • 1st Design
  • IMSI TurboDraw
  • Media Graphics Publishers Paradise Design Studio
  • MicroVision Vision Draw
  • NEBS DesignMagic
  • PersonalSoft Création Graphique
  • Pushbutton Design
  • VCI Pro Design
  • Wizardworks CompuWorks Designer, CompuWorks Draw
  • Canon Publishing Suite

The Draw/Design software all used the same file format as well with the extension .ART, also with full compatibility between all the titles. The TEM extension was used for templates. Not to be confused with the AOL Image format, or Asymetrix Compel Image format, or a number of other formats using the ART extension. This format also began as a single proprietary binary format with the ascii header “GST:ART” starting at offset 16. And just like the DTP format it was later wrapped in an OLE container to be more compatible. In fact, the DTP format may have embedded Art objects! This format is not in PRONOM, so lets take a closer look.

You can see from the 1stdgn.art file here, the ascii “GST:ART” string starting at byte 16. This is consistent with all the samples I have. The first 16 bytes seem to vary in each sample and probably have to do with the size of the file and dimensions of the artwork. GST:ART is unique enough and should work well for a signature.

The ART file from a later version of Draw is in the OLE file format. This container format was designed by Microsoft as a universal container to increase compatibility among software. You can see from the hex view above the file looks very similar to the DOC format used by Word. There were many software titles which used this container format, many documented here. One of the easiest ways to look inside an OLE container is to use 7-Zip. A quick listing of the file shows it is a Type = compound and includes three files. The SummaryInformation file is common among many OLE formats and can contain some metadata, but the Contents file is what we are looking for. Examining the Contents file we find it looks identical to the earlier version of the ART format. The same “GST:ART” string starting at byte 16.

A note about the Preview.dib file. It appears to be a Device-Independent Bitmap, similar to a Bitmap file, probably for a thumbnail preview.

Writing a signature for an OLE container format is a bit more tricky. It requires a separate signature file to go along with the regular signature xml. Basically DROID is setup to “trigger” once it discovers either a “ZIP” file or “OLE” format. If it detects one of those formats it then looks into the container signature xml for additional patterns. If it finds a match then it identifies the format, if not it reports back a generic “ZIP” or “OLE” format.

As it turns out there were two different types of OLE file types, one used “Contents” for the internal file and another which used “CONTENTS”. Since the signature is case sensitive, the container signature requires two signatures both mapped to the same PUID.

These two formats were used with quite a few software titles. Hopefully these signatures cover most of them! You can find a couple samples and my signatures on my Github.

LiveCode stack

One of the earliest hypermedia systems which predated the world wide web was called HyperCard on the Macintosh. Within minutes you could have a small application to do just about anything, calendar, address book, interactive books, games, etc. The internet archive has collected many HyperCard stacks and emulates them directly in the browser.

Riding on the success of HyperCard was another hypermedia tool called MetaCard, which later became Runtime Revolution. Today it is known as LiveCode, a cross-platform application development system. LiveCode is often used to quickly create applications which can run on many platforms including iOS. It is popular with students and higher education. The LiveCode source was opened for a time instigated by a successful kickstarter program, but closed in 2021 as the company struggled to keep paying customers.

Each LiveCode version produced unique files for each of the major versions. Currently none of the formats can be identified using preservation tools. Luckily, because the code was open-source for a time, we have details which helps us identify the formats. Let’s take a look:

#define kMCStackFileMetaCardVersionString "#!/bin/sh\n# MetaCard 2.4 stack\n
#define kMCStackFileVersionString_2_7 "REVO2700"
#define kMCStackFileVersionString_5_5 "REVO5500"
#define kMCStackFileVersionString_7_0 "REVO7000"
#define kMCStackFileVersionString_8_0 "REVO8000"
#define kMCStackFileVersionString_8_1 "REVO8100"

I took LiveCode up on their 10 day trial and was able to install software version 9.6.9 to save some samples. The software has a “Save as” option which allows you to save your code to older versions. Although one must be careful as saving to older versions may have some data loss.

The samples I was able to save had matching headers just like in the source code. The REVO string starts right at the beginning of the file making identification easy. Take a look at my GitHub page for samples and signature. Also check out the File Format Wiki Page for more information and more samples!