Digital Preservation is all about identifying risks. This is done through a process which includes identification, validation, and metadata extraction. The more you know about the digital data you need to preserve over time, the more you can do to minimize those risks with the goal of making the data accessible over time.
Many formats are pretty straight forward, they are identifiable through a header and then have some binary bits or plain text that is readable by certain software. Others are more complicated. A common practice for more complex needs is to use a container. Word processing programs started out with plain text with maybe some formatting codes mixed in, then many moved to the Microsoft OLE container so you could have additional content embedded in a single file. Today file formats such as DOCX use a ZIP container, which houses all the text, images, formatting and anything else the format supports. Knowing what the format is and knowing what it may contain is important to preservation.
I collect older digital cameras, specifically cameras with unique file formats, raw and otherwise. When I picked up a HP (Hewlett-Packard) point and shoot camera awhile back, I was initially unimpressed as it would only capture in a JPEG format and only 3 quality settings. While looking at a copy of the manual, I saw the camera was capable of capturing audio clips or voice memos for each photo taken. This can be handy when taking many photos and need a reminder about the context. This was not unique to HP, as many cameras could do this, normally a JPG was captured and the Audio would have the same name connecting the two. But when I recorded some audio on my little HP, placed the SD card in my computer, I couldn’t find the additional audio file. I also not the only one to ask about this.
There are many types of JPG files. Raw Streams, JPEG File Interchange Format (JFIF), and Exchangeable Image File Format (EXIF). Normally these formats have raster image data sprinkled with metadata. I have seen JPEG files embedded into other formats and containers, such as MP3, PDF, etc, but JPEG’s are not container formats. Or so I thought…..
Lets take a look at an image I took with my HP Photosmart 433. We’ll start with identification:
siegfried : 1.10.1 scandate : 2023-05-25T12:27:04-06:00 signature : default.sig created : 2023-05-22T08:43:02-06:00 identifiers : - name : 'pronom' details : 'DROID_SignatureFile_V112.xml; container-signature-20230510.xml' --- filename : 'GitHub/digicam_corpus/HP/Photosmart 433/IM000959.JPG' filesize : 178922 modified : 2023-05-25T11:23:32-06:00 errors : matches : - ns : 'pronom' id : 'x-fmt/391' format : 'Exchangeable Image File Format (Compressed)' version : '2.2' mime : 'image/jpeg' class : 'Image (Raster)' basis : 'extension match jpg; byte match at [[0 16] [366 12] [178907 2]] (signature 2/2)' warning :
IM000959.JPG was identified as x-fmt/391 which is a compressed Exchangeable Image File Format. version 2.2. Pretty straight forward. Next lets look at validation:
Jhove (Rel. 1.28.0, 2023-05-18) Date: 2023-05-25 12:35:16 MDT RepresentationInformation: GitHub/digicam_corpus/HP/Photosmart 433/IM000959.JPG ReportingModule: JPEG-hul, Rel. 1.5.4 (2023-03-16) LastModified: 2023-05-25 11:23:32 MDT Size: 178922 Format: JPEG Status: Well-Formed and valid SignatureMatches: JPEG-hul ErrorMessage: Tag 41492 out of sequence ID: TIFF-HUL-2 Offset: 606 MIMEtype: image/jpeg JPEGMetadata: CompressionType: Huffman coding, Baseline DCT Images: Number: 1 Image: NisoImageMetadata: FormatName: image/jpeg ByteOrder: big_endian CompressionScheme: JPEG ImageWidth: 640 ImageHeight: 480 ColorSpace: YCbCr DateTimeCreated: 2021-11-16T09:04:04 ScannerManufacturer: Hewlett-Packard ScannerModelName: hp PhotoSmart 43x series DigitalCameraManufacturer: Hewlett-Packard DigitalCameraModelName: hp PhotoSmart 43x series FNumber: 4 ................................ Exif: ExifVersion: 0220 FlashpixVersion: 0100 ColorSpace: sRGB ComponentsConfiguration: 1, 2, 3, 0 CompressedBitsPerPixel: 1.568 PixelXDimension: 640 PixelYDimension: 480 MakerNote: 0, 97, 48, 101, 114, 32, 78, 111, 116, 101, 115, 0, 0, 0, 0, 0 DateTimeOriginal: 2021:11:16 09:04:04 DateTimeDigitized: 2021:11:16 09:04:04 ApplicationSegments: APP1, APP2, APP2, APP2, APP2, APP2, APP2, APP2, APP2, APP2, APP2, APP2, APP2, APP2, APP2, APP2
I removed a few lines to show important parts, but we get some similar information about the format, a JPEG with EXIF version 2.2. We also learn that HP improperly ordered their tags and put Tag 41492 out of sequence, but we can ignore that for now. Looking close at the output does not give us any indication of audio formats. There is a clue when we see the mention of a Flashpix version and additional Application Segments.
Since this is an image with EXIF data, lets also take a look at the output of Exiftool.
ExifTool Version Number : 12.62 File Name : IM000959.JPG Directory : . File Size : 179 kB File Modification Date/Time : 2023:05:25 11:23:32-06:00 File Access Date/Time : 2023:05:25 11:24:42-06:00 File Inode Change Date/Time : 2023:05:25 11:24:39-06:00 File Permissions : -rwxr-xr-x File Type : JPEG File Type Extension : jpg MIME Type : image/jpeg Exif Byte Order : Little-endian (Intel, II) Image Description : IM000959.JPG Make : Hewlett-Packard Camera Model Name : hp PhotoSmart 43x series Orientation : Horizontal (normal) X Resolution : 72 Y Resolution : 72 Resolution Unit : inches Software : 1.400 Modify Date : 2021:11:16 09:04:04 Y Cb Cr Positioning : Co-sited Copyright : Copyright 2002-2003 Exposure Time : 1/29 F Number : 4.0 ISO : 100 Exif Version : 0220 Date/Time Original : 2021:11:16 09:04:04 Create Date : 2021:11:16 09:04:04 Components Configuration : Y, Cb, Cr, - Compressed Bits Per Pixel : 1.567552083 Shutter Speed Value : 1/30 Aperture Value : 4.0 Exposure Compensation : 0 Max Aperture Value : 4.0 Subject Distance : 1 m Metering Mode : Average Light Source : Unknown Flash : Auto, Did not fire Focal Length : 5.7 mm Warning : [minor] Unrecognized MakerNotes Flashpix Version : 0100 Color Space : sRGB Exif Image Width : 640 Exif Image Height : 480 Interoperability Index : R98 - DCF basic file (sRGB) Interoperability Version : 0100 Digital Zoom Ratio : 1 Subject Location : 0 Compression : JPEG (old-style) Thumbnail Offset : 2046 Thumbnail Length : 7112 Code Page : Unicode UTF-16, little endian Used Extension Numbers : 1, 31 Extension Name : Audio Extension Class ID : 10000100-6FC0-11D0-BD01-00609719A180 Extension Persistence : Always Valid Audio Stream : (Binary data 117820 bytes, use -b option to extract) Image Width : 640 Image Height : 480 Encoding Process : Baseline DCT, Huffman coding Bits Per Sample : 8 Color Components : 3 Y Cb Cr Sub Sampling : YCbCr4:2:2 (2 1) Aperture : 4.0 Image Size : 640x480 Megapixels : 0.307 Shutter Speed : 1/29 Thumbnail Image : (Binary data 7112 bytes, use -b option to extract) Focal Length : 5.7 mm Light Value : 8.9
Ohh, what do we have here? Exiftool mentions an audio stream. An audio stream inside the JPEG? How is this possible? The Flashpix format was originally developed by Kodak in which collaborated with HP. This was later added to the EXIF specifications. Below is an screenshot from the Exif Version 2.2 spec.
Exiftool mentioned Flashpix and additional APP2 segments. Lets take a look at the raw file in a hex editor.
Ahhh….. In one of the App2 segments we can see something familiar. A RIFF WAVE header! Lets see if we can extract the WAVE file.
exiftool -b -AudioStream IM000959.JPG > IM000959.WAV mediainfo IM000959.WAV General Complete name : IM000959.WAV Format : Wave Format settings : WaveFormatEx File size : 115 KiB Duration : 10 s 681 ms Overall bit rate mode : Constant Overall bit rate : 88.2 kb/s Audio Format : ADPCM Codec ID : 11 Codec ID/Hint : Intel Duration : 10 s 681 ms Bit rate mode : Constant Bit rate : 88.2 kb/s Channel(s) : 1 channel Sampling rate : 22.05 kHz Bit depth : 4 bits Stream size : 115 KiB (100%)
MediaInfo can give us details on the embedded WAVE file, which is pretty terrible quality but is a PCM audio stream.
Embedded audio inside a raster image is not common. Most software which can render a JPEG image will most likely ignore the embedded WAVE and not even give a warning it exists. IM000959.JPG opens fine in Adobe Photoshop, but saving to a new format or making any edits will delete the WAVE file. Imagemagick also will remove the WAVE with any editing with no warning.
In order to ensure the embedded audio stream is preserved we first need to know it is there, this is where tools like exiftool can be used to extract metadata from the file and the image can be associated with having an audio stream and handled differently than any other JPEG file. More work is needed, Exiftool may mention an Audio Stream, but currently does not have the ability to pull any data from the stream.