RCA-VOC

I wonder sometimes what goes through a software/hardware developers mind when deciding a format to use for a new device. There are so many options our there for audio formats to choose from. I am sure there are pros and cons to using one technology over another but it seems a few decide to go ahead and make their own. I am sure there is some commercial advantage to developing a proprietary audio format, but with all the established choices it seems unnecessary.

Sony developed their own audio compression formats, which I explored in an earlier blog post. I came across a small goofy looking RCA voice recorder, model VR6320.

Many of these RCA VR series recorders can record in a WAV or a VOC file format. The WAV files are pretty run of the mill, but the VOC format is unique to RCA recorders.

The VOC format is not to be confused with another audio format with the same extension. The Creative Voice Format is a bit more well known. It was used with the Creative’s sound cards (Sound Blaster family) many folks had in their Windows computers in the 1990’s. But the RCA file format is different, and because of the same extension needs its own identification so they are not confused with each other.

sf REC00001.VOC 
---
siegfried   : 1.10.1
scandate    : 2023-11-19T23:33:47-07:00
signature   : default.sig
created     : 2023-05-12T09:10:13Z
identifiers : 
  - name    : 'pronom'
    details : 'DROID_SignatureFile_V112.xml; container-signature-20230510.xml'
---
filename : 'REC00001.VOC'
filesize : 47231
modified : 2015-01-09T20:51:10-07:00
errors   : 
matches  :
  - ns      : 'pronom'
    id      : 'UNKNOWN'
    format  : 
    version : 
    mime    : 
    class   : 
    basis   : 
    warning : 'no match; possibilities based on extension are fmt/1736'

The RCA VOC file format seems to be undocumented, there isn’t much available. You can always download a copy of the RCA Digital Voice Manager software, which may or may not run on your current system, and convert the VOC files to WAV or you can use a piece of software coded in 2008 called “devoc“. The developer used to have an online website you could upload the VOC to and it would convert it automatically, but is not longer available. The code can also be found here.

Let’s take a look at the header of a couple of the files I have:

hexdump -C REC00001.VOC | head
00000000  56 43 50 31 36 32 5f 56  4f 43 5f 46 69 6c 65 0c  |VCP162_VOC_File.|
00000010  0f 01 09 14 32 1c 00 00  0b 44 03 00 00 00 00 00  |....2....D......|
00000020  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
000001b0  00 10 00 00 00 00 00 00  00 00 00 10 00 00 00 00  |................|
000001c0  00 00 00 00 00 10 00 00  00 00 00 00 00 00 00 10  |................|
000001d0  00 00 00 00 00 00 00 ff  ff ff ff ff ff ff ff ff  |................|
000001e0  ef 11 14 d3 96 77 57 44  34 33 34 44 43 33 44 43  |.....wWD434DC3DC|
000001f0  43 34 44 34 43 43 34 44  43 43 33 35 43 33 43 34  |C4D4CC4DCC35C3C4|
00000200  34 43 43 24 34 43 43 33  44 51 33 42 14 44 32 43  |4CC$4CC3DQ3B.D2C|

hexdump -C A0000003.VOC | head
00000000  52 50 35 31 32 30 5f 56  4f 43 5f 46 69 6c 65 78  |RP5120_VOC_Filex|
00000010  08 06 16 0a 0f 20 00 04  17 01 03 00 00 00 00 00  |..... ..........|
00000020  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000180  00 03 b9 2f 00 07 62 af  00 0b 0c 2f 00 0e b5 af  |.../..b..../....|
00000190  00 12 5f 2f 00 00 00 00  00 00 00 00 00 00 00 00  |.._/............|
000001a0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000fa0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 e1  |................|
00000fb0  ea eb ea fe df ae 4e a1  1d cd 1c cf 9f de cf 3b  |......N........;|

Most of samples I have show “VCP162_VOC_File” in the header, but I have one sample with “RP5120_VOC_File“. I have heard of others, one being “V432_Voice_File“. There could be more variations. One could assume the header is somehow associated with the model number of the device, but that doesn’t appear to be the case. Although there is a device with the model number “RP 5120“. It might be that the older RP series get one header and the newer VR Series get VCP? I will need more samples to confirm, if you have any send them my way. Also, according to the manuals, there is a SP and LP mode to manage the bitrate of the file to squeeze more minutes on the built in memory of these devices. This doesn’t appear to affect identification, but might be good to differentiate in the future.

For now you can take a look at the signature on my GitHub page.

Adobe Acrobat Capture

During the recent PRONOM Research Week, I noticed a file format with no description and no signature.

x-fmt/217Adobe ACD

All I had to go on was it was an Adobe format and the acronym “ACD”. One of the first results that came up in a google search was a post in the Adobe forums with someone asking what to do with some old ACD and ACI files they found on a disc, circa 2000, labeled “Adobe Capture”. The only thing I remember about Adobe Capture was some scanning tools related to Adobe Acrobat, but I didn’t remember coming across any ACD files related to Acrobat.

Initially it wasn’t easy to find more information on this format. Eventually I was able to narrow it down to stand-alone software adobe released called “Adobe Acrobat Capture”. Originally released in 1995 it was eventually discontinued in 2010. The software was marketed under the ePaper name and connected to Acrobat through the creation of a PDF from scanned images. The software was compatible with many scanner models and would process the scanned images, run Optical Character recognition, and export to a searchable PDF. These tools are built into Adobe Acrobat today.

One of the reasons the software was being so elusive is the fact it was sold with a high price tag and required the use of a hardware key, or dongle, in order to process scans. The hardware key also managed the type of license you purchased which may limit the number of pages you are allowed to scan within a certain period of time. So the software is very difficult to run today, if you do happen to find a copy out there in Internet land.

In order to document these file formats for preservation purposes I needed to find some samples. I was excited to find a demonstration CD on the Internet Archive, but unfortunately it contained no examples of the ACD file format.

A little sleuthing on the Wayback Machine helped me find a few user guides and brochures. I was also able to find there was three versions of Adobe Acrobat Capture. In a Product Brochure, you can see a screenshot of the software with a document open with the ACD extension.

If you are OCD like me you might have noticed the window in this screenshot is typical of the older Windows 3.1 or Windows NT system. So this was indeed an older product released by Adobe.

The Adobe Acrobat Capture 3.0 Demonstration CD-ROM from the Internet Archive luckily has a UserGuide PDF on the disc and was able to help me understand the ACD format a little more.

Looks like the ACD format is an intermediate format used by the software to manage the process between scanning and export to PDF. ACD was also defined as an “Acrobat Capture Document” which makes sense. They were also mentioned as being “multipage files in Acrobat Capture Document (ACD)”. The UserGuide also mentioned an ACP format which it referenced as “one-page files are in Acrobat Capture Page (ACP) format.” So more research is needed.

Lets start with Adobe Acrobat Capture 2.0 as I managed to get a few samples from an installer I found. Here is a hexdump of an ACD file and its corresponding ACI file.

hexdump -C CONTRACT.ACD | head
00000000  02 04 47 47 c9 00 86 b5  01 00 b6 27 02 00 01 00  |..GG.......'....|
00000010  f5 00 5e 00 3b 96 02 00  01 6e 63 6a 00 00 88 68  |..^.;....ncj...h|
00000020  00 00 26 00 44 3a 5c 43  4f 44 45 5c 47 47 5c 50  |..&.D:\CODE\GG\P|
00000030  52 4f 44 55 43 54 2e 33  32 53 5c 49 4e 5c 63 6f  |RODUCT.32S\IN\co|
00000040  6e 74 72 61 63 74 2e 61  63 69 00 00 00 00 00 00  |ntract.aci......|
00000050  7c 33 c0 27 00 40 ff ff  ff 00 03 00 03 00 00 00  ||3.'.@..........|
00000060  00 00 00 00 00 00 40 00  00 00 00 00 00 03 00 00  |......@.........|
00000070  00 00 00 00 00 00 00 40  00 00 00 00 09 00 0a ab  |.......@........|
00000080  04 0b 14 b5 04 39 19 00  40 00 00 00 00 0c 14 b0  |.....9..@.......|
00000090  04 38 19 b0 04 08 00 0a  7f 06 d3 11 89 06 39 17  |.8............9.|

hexdump -C CONTRACT.ACI | head
00000000  49 49 2a 00 b3 0c 02 00  35 80 78 a0 80 35 c0 78  |II*.....5.x..5.x|
00000010  a4 80 35 40 3c 54 40 01  e2 b2 01 e2 b2 01 e2 b2  |..5@<T@.........|
00000020  01 e2 b2 01 e2 b2 01 e2  b2 01 e2 b2 01 e2 b2 01  |................|
00000030  e2 b2 01 e2 b2 01 e2 b2  01 e2 b2 01 e2 b2 01 e2  |................|
00000040  b2 01 e2 b2 01 e2 b2 01  e2 b2 01 e2 b2 01 e2 b2  |................|
00000050  01 e2 b2 01 e2 b2 01 e2  b2 01 e2 b2 01 e2 b2 01  |................|
00000060  e2 b2 01 e2 b2 01 e2 b2  01 e2 b2 01 e2 b2 01 e2  |................|
00000070  b2 01 e2 b2 01 e2 b2 01  e2 b2 01 e2 b2 01 e2 b2  |................|
00000080  01 e2 b2 01 e0 b0 01 e0  b0 01 e0 b0 01 e0 b0 01  |................|
00000090  e0 b0 01 e0 b0 01 e0 b0  01 e0 b0 01 e0 b0 01 e0  |................|

The ACD file is unique, PRONOM and even TrID was unaware of the format. But to the keen observer, the ACI format is very recognizable. You may have seen this header before:

Lets take a closer look at an ACI file to see if they are a true TIFF image or if there is any customization to the format.

tiffinfo CONTRACT.ACI 
=== TIFF directory 0 ===
TIFF Directory at offset 0x20cb3 (134323)
  Subfile Type: (0 = 0x0)
  Image Width: 2544 Image Length: 3295
  Resolution: 300, 300
  Bits/Sample: 1
  Compression Scheme: CCITT RLE
  Photometric Interpretation: min-is-white
  Samples/Pixel: 1
  Rows/Strip: 32
  Planar Configuration: single image plane
  Software: HALO Desktop Imager

exiftool -D CONTRACT.ACI 
    - ExifTool Version Number         : 12.60
    - File Name                       : CONTRACT.ACI
    - Directory                       : TUTORIAL/SAMPOUT
    - File Size                       : 134 kB
    - File Modification Date/Time     : 1995:07:10 16:02:08-06:00
    - File Access Date/Time           : 2023:11:14 15:41:02-07:00
    - File Inode Change Date/Time     : 2023:11:08 08:34:18-07:00
    - File Permissions                : -rwxrwxrwx
    - File Type                       : TIFF
    - File Type Extension             : tif
    - MIME Type                       : image/tiff
    - Exif Byte Order                 : Little-endian (Intel, II)
  254 Subfile Type                    : Full-resolution image
  256 Image Width                     : 2544
  257 Image Height                    : 3295
  258 Bits Per Sample                 : 1
  259 Compression                     : CCITT 1D
  262 Photometric Interpretation      : WhiteIsZero
  273 Strip Offsets                   : (Binary data 625 bytes, use -b option to extract)
  277 Samples Per Pixel               : 1
  278 Rows Per Strip                  : 32
  279 Strip Byte Counts               : (Binary data 448 bytes, use -b option to extract)
  282 X Resolution                    : 300
  283 Y Resolution                    : 300
  305 Software                        : HALO Desktop Imager
    - Image Size                      : 2544x3295
    - Megapixels                      : 8.4

Looks like a true TIFF image with no special tags or unique properties. They are 1-bit TIFF’s compressed with CCITT RLE. Not sure there would be any need to create a special signature for these ACI files.

Looking closer at the ACD file format, we can see they reference ACI files, so probably safe to assume the ACD file doesn’t contain the full raster data for each image:

hexdump -C Report.acd
00000000  02 04 47 47 c9 00 9a 8b  00 00 d4 ce 00 00 03 00  |..GG............|
00000010  f5 02 5f 00 00 61 01 00  01 6e 63 6a 01 00 30 5f  |.._..a...ncj..0_|
00000020  00 00 27 00 63 3a 5c 63  61 70 74 75 72 65 32 5c  |..'.c:\capture2\|
00000030  73 61 6d 70 6c 65 73 5c  6f 75 74 5c 52 65 70 6f  |samples\out\Repo|
00000040  72 74 5f 30 30 30 31 2e  61 63 69 00 00 01 00 00  |rt_0001.aci.....|
00000050  00 00 00 00 00 00 00 00  00 00 e8 03 00 00 01 00  |................|
00000060  01 00 00 00 00 00 00 00  00 00 08 00 52 65 70 6f  |............Repo|
00000070  72 74 30 31 00 00 00 00  70 33 d8 27 00 40 ff ff  |rt01....p3.'.@..|
*
00005f40  07 00 40 6f 00 09 00 40  01 6e 63 6a 02 00 52 2c  |..@o...@.ncj..R,|
00005f50  00 00 27 00 63 3a 5c 63  61 70 74 75 72 65 32 5c  |..'.c:\capture2\|
00005f60  73 61 6d 70 6c 65 73 5c  6f 75 74 5c 52 65 70 6f  |samples\out\Repo|
00005f70  72 74 5f 30 30 30 32 2e  61 63 69 00 00 00 00 00  |rt_0002.aci.....|
00005f80  00 00 00 00 4e 0c fe ff  ff ff e8 03 00 00 01 00  |....N...........|
00005f90  01 00 00 00 00 00 00 00  00 00 08 00 52 65 70 6f  |............Repo|
00005fa0  72 74 30 32 00 00 00 00  4c 31 f0 27 00 40 ff ff  |rt02....L1.'.@..|

From the limited sample set I have access, all the ACD files begin with the same Hex values, “02044747C900”. Along with the common header we can assume there should be at least one ACI file referenced in the first part of the file. Because it is referenced as a filepath, the ACI string would be variable in its offset.

Adobe Acrobat Capture 3.0 turns out to be a different format. But looks familiar………

hexdump -C Contract.acd | head
00000000  50 4b 03 04 14 00 00 00  08 00 3b ba 6e 57 23 9d  |PK........;.nW#.|
00000010  8e b8 3d 00 00 00 3e 00  00 00 09 00 40 00 46 49  |..=...>.....@.FI|
00000020  4c 45 53 2e 4c 53 54 0a  00 20 00 00 00 00 00 00  |LES.LST.. ......|
00000030  00 00 00 80 e6 e9 ca 50  17 da 01 80 e6 e9 ca 50  |.......P.......P|
00000040  17 da 01 80 e6 e9 ca 50  17 da 01 4e 55 18 00 4e  |.......P...NU..N|
00000050  55 43 58 09 00 46 00 49  00 4c 00 45 00 53 00 2e  |UCX..F.I.L.E.S..|
00000060  00 4c 00 53 00 54 00 8b  76 74 76 31 8c e5 e5 f2  |.L.S.T..vtv1....|
00000070  0c 76 f6 f7 0d f0 0f f6  0c 71 b5 0d 09 0a 75 e5  |.v.......q....u.|
00000080  e5 f2 0b f5 75 f3 f4 71  0d b6 35 e4 e5 02 31 fc  |....u..q..5...1.|
00000090  1c 7d 5d 0d 6d 9d f3 f3  4a 8a 12 93 4b f4 12 93  |.}].m...J...K...|

sf Contract.acd 
---
siegfried   : 1.10.1
scandate    : 2023-11-15T09:10:01-07:00
signature   : default.sig
created     : 2023-10-11T15:10:17-06:00
identifiers : 
  - name    : 'pronom'
    details : 'DROID_SignatureFile_V114.xml; container-signature-20230822.xml'
---
filename : 'Contract.acd'
filesize : 79002
modified : 2023-11-14T23:17:53-07:00
errors   : 
matches  :
  - ns      : 'pronom'
    id      : 'x-fmt/263'
    format  : 'ZIP Format'
    version : 
    mime    : 'application/zip'
    basis   : 'byte match at [[0 4] [78886 3] [78980 4]]'
    warning : 'extension mismatch'

Yep, its a zip container file. lets take a peek inside to see what it is composed of.

7z l Contract.acd 
--
Path = Contract.acd
Type = zip
Physical Size = 79002

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2023-11-14 23:17:54 ....A           62           61  FILES.LST
2023-11-14 23:17:54 ....A          410          226  Contract.acd
2023-11-14 23:17:52 ....A       150213        78093  Contract.acp
------------------- ----- ------------ ------------  ------------------------
2023-11-14 23:17:54             150685        78380  3 files

The the Contract ACD file is like a nesting doll, an ACD within an ACD. Lets see what the ACD and ACP is made of.

hexdump -C Contract.acd | head
00000000  00 01 00 00 00 02 04 47  47 2d 01 9a 01 00 00 02  |.......GG-......|
00000010  00 00 00 02 00 01 01 00  00 00 01 00 00 00 04 04  |................|
00000020  00 00 00 09 00 57 69 6e  67 64 69 6e 67 73 05 00  |.....Wingdings..|
00000030  41 72 69 61 6c 0b 00 43  6f 75 72 69 65 72 20 4e  |Arial..Courier N|
00000040  65 77 0f 00 54 69 6d 65  73 20 4e 65 77 20 52 6f  |ew..Times New Ro|
00000050  6d 61 6e 05 01 00 00 00  02 00 00 00 78 01 00 00  |man.........x...|
00000060  0f 00 54 69 6d 65 73 20  4e 65 77 20 52 6f 6d 61  |..Times New Roma|
00000070  6e 00 00 00 20 0b 00 00  c0 0a 00 00 00 00 00 00  |n... ...........|
00000080  00 06 00 00 00 0f 00 54  69 6d 65 73 20 4e 65 77  |.......Times New|
00000090  20 52 6f 6d 61 6e 00 00  00 20 0c 00 00 00 0c 00  | Roman... ......|

hexdump -C Contract.acp | head
00000000  25 50 44 46 2d 31 2e 33  0d 25 e2 e3 cf d3 0d 0a  |%PDF-1.3.%......|
00000010  31 20 30 20 6f 62 6a 0d  3c 3c 20 0d 2f 54 79 70  |1 0 obj.<< ./Typ|
00000020  65 20 2f 43 61 74 61 6c  6f 67 20 0d 2f 50 61 67  |e /Catalog ./Pag|
00000030  65 73 20 32 20 30 20 52  20 0d 2f 53 74 72 75 63  |es 2 0 R ./Struc|
00000040  74 54 72 65 65 52 6f 6f  74 20 34 20 30 20 52 20  |tTreeRoot 4 0 R |
00000050  0d 2f 43 41 50 54 5f 49  6e 66 6f 20 3c 3c 20 2f  |./CAPT_Info << /|
00000060  56 20 33 30 31 20 2f 46  53 20 5b 20 28 57 69 6e  |V 301 /FS [ (Win|
00000070  67 64 69 6e 67 73 29 28  41 72 69 61 6c 29 28 43  |gdings)(Arial)(C|
00000080  6f 75 72 69 65 72 20 4e  65 77 29 28 54 69 6d 65  |ourier New)(Time|
00000090  73 20 4e 65 77 20 52 6f  6d 61 6e 29 5d 20 2f 4c  |s New Roman)] /L|

The ACD has some of the same hex values as the previous version, but with some extra bytes at the beginning and it looks like the ACP is a straight up PDF. But may have some interesting tags, like “CAPT_info”.

The problem we will face when trying to write a signature for this version of ACD is the container signature needs a static file name to reference, and it appears the name of the container is also the name of the ACD file within the container. So every file will be different. I wish there was a way in the PRONOM signature syntax to reference an extension and ignore the filename, but currently there no method to do this. The only thing inside the container which seems to be consistent is the file “FILES.LST”. So lets take a peek inside if it.

hexdump -C FILES.LST | head
00000000  5b 41 43 44 31 5d 0d 0a  49 53 43 4f 4d 50 4f 53  |[ACD1]..ISCOMPOS|
00000010  49 54 45 3d 54 52 55 45  0d 0a 4e 55 4d 46 49 4c  |ITE=TRUE..NUMFIL|
00000020  45 53 3d 31 0d 0a 46 49  4c 45 4e 41 4d 45 31 3d  |ES=1..FILENAME1=|
00000030  43 6f 6e 74 72 61 63 74  2e 61 63 70 0d 0a        |Contract.acp..|

Ok, there seems to be some static information that is unique to the ACD format. I bet the string “[ACD1]” would be sufficient enough to make a solid signature.

This is a good format example of a limited amount of information on the file format used by a well known company which has become obsolete and disappeared. Take a look at my signatures, maybe you have some old ACD files you were unaware of!

Gone in a Flash

This week I am at the annual iPres digital preservation conference. It is an amazing week of meeting colleagues and old friends who share the same passion of digital preservation. Outside of this community and my co-workers, talking about file formats and digital preservation usually bores people to death and I can hear some of them mumble under their breath, “nerd”! I term I am happy to accept.

At the conference, which is in lovely Urbana-Champaign Illinois this year, I am trying to soak in all the amazing talks and conversations about the challenges facing our work. There were a couple great workshops on Persistent Identifiers and Digital Object Storage Criteria. The presentations I made were of course on File Formats, documentation, and obsolescence. One talk before my panel conversation was about the ubiquitous Adobe Flash format.

The paper, “Around for Decades, Gone in a Flash: How we dealt with Flash objects and the National Archives of the Netherlands” was presented by Lotte Wijsman and Marin Rappard. They knew they had flash objects in their web archives and wanted to go through the process of how they might be preserved and accessed. They started out looking for any files with “FLA”, “SWF”, and “FLV” as extensions. This proved problematic as there were references to those extensions within other documents and objects. They then used DROID to identify the flash formats. “SWF” has quite a number of format PUID’s.

PUIDFormat NameFormat VersionExtension
fmt/104Macromedia Flash1swf,
fmt/105Macromedia Flash2swf,
fmt/106Macromedia Flash3swf,
fmt/107Macromedia Flash4swf,
fmt/108Macromedia Flash5swf,
fmt/109Macromedia Flash6swf,
fmt/110Macromedia Flash7swf,
fmt/505Adobe Flash8swf,
fmt/506Adobe Flash9swf,
fmt/507Adobe Flash10swf,
fmt/757Adobe Flash11swf,
fmt/758Adobe Flash12swf,
fmt/759Adobe Flash13swf,
fmt/760Adobe Flash14swf,
fmt/761Adobe Flash15swf,
fmt/762Adobe Flash16swf,
fmt/763Adobe Flash17swf,
fmt/764Adobe Flash18swf,
fmt/765Adobe Flash19swf,
fmt/766Adobe Flash20swf,
fmt/767Adobe Flash21swf,
fmt/768Adobe Flash22swf,
fmt/769Adobe Flash23swf,
fmt/770Adobe Flash24swf,
fmt/771Adobe Flash25swf,
fmt/772Adobe Flash26swf,
fmt/773Adobe Flash27swf,
fmt/774Adobe Flash28swf,
fmt/775Adobe Flash29swf,
fmt/776Adobe Flash30swf,

Even the Macromedia/Adobe Flash Video format has a PRONOM PUID, x-fmt/382.

The format missing from PRONOM is the FLA format. FLA is the native format for Macromedia/Adobe Flash for saving the source project of your Flash document. SWF files are compiled from the FLA source. This means the the SWF will be the most common format found on the web and in public places, but the FLA format might be more often found on local drives and working folders.

The Flash format and software was actually created by Future Wave software in 1996 as FutureSplash Animator, but was shortly bought by Macromedia later that year and Flash 1.0 was born. FutureSplash used the extension .SPA, but Macromedia changed it to FLA.

The format was initially based on the Microsoft Compound File Format or the OLE container format.

oledir Flash4-S01.fla 
oledir 0.54 - http://decalage.info/python/oletools
OLE directory entries in file Flash4-S01.fla:
----+------+-------+----------------------+-----+-----+-----+--------+------
id  |Status|Type   |Name                  |Left |Right|Child|1st Sect|Size  
----+------+-------+----------------------+-----+-----+-----+--------+------
0   |<Used>|Root   |Root Entry            |-    |-    |1    |5       |4416  
1   |<Used>|Stream |Contents              |2    |-    |-    |6       |4013  
2   |<Used>|Stream |Page 1                |-    |-    |-    |0       |329   
3   |unused|Empty  |                      |-    |-    |-    |0       |0     
----+----------------------------+------+--------------------------------------
id  |Name                        |Size  |CLSID                                 
----+----------------------------+------+--------------------------------------
0   |Root Entry                  |-     |597CAA70-72AA-11CF-831E-524153480000  
1   |Contents                    |4013  |                                      
2   |Page 1                      |329   |   

The FLA format stayed with OLE until Adobe Flash CS5, which the format changed to use a ZIP container to store all the content.

Flash5.5-S01.fla
Type = zip
Physical Size = 216632

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2022-07-09 11:57:46 .....           25           25  mimetype
2022-07-09 11:57:46 .....            9            9  Flash5.5-S01.xfl
2022-07-09 11:57:46 D....            0            0  LIBRARY
2022-07-09 11:57:46 D....            0            0  META-INF
2022-07-09 11:57:46 .....        49267         3936  DOMDocument.xml
2022-07-09 11:57:48 .....         9735         1103  META-INF/metadata.xml
2022-07-09 11:57:48 .....         8093         2222  PublishSettings.xml
2022-07-09 11:57:48 .....            0            0  MobileSettings.xml
2022-07-09 11:57:48 D....            0            0  LIBRARY/Mouth shape graphic symbols
2022-07-09 11:57:48 D....            0            0  LIBRARY/Voice
2022-07-09 11:57:48 .....       151006       151006  bin/M 1 1252032698.dat
2022-07-09 11:57:48 .....        99707        15311  LIBRARY/mouth.xml
2022-07-09 11:57:48 .....        16510         4534  LIBRARY/Mouth shape graphic symbols/A I.xml
2022-07-09 11:57:48 .....        14334         4086  LIBRARY/Mouth shape graphic symbols/C D G K N R S Th Y Z.xml
2022-07-09 11:57:48 .....        14531         4040  LIBRARY/Mouth shape graphic symbols/E.xml
2022-07-09 11:57:48 .....        15846         4007  LIBRARY/Mouth shape graphic symbols/F V D Th.xml
2022-07-09 11:57:48 .....        13093         3542  LIBRARY/Mouth shape graphic symbols/L D Th.xml
2022-07-09 11:57:48 .....         2106          751  LIBRARY/Mouth shape graphic symbols/M B P Closed.xml
2022-07-09 11:57:48 .....        14130         3949  LIBRARY/Mouth shape graphic symbols/O.xml
2022-07-09 11:57:48 .....        11082         2951  LIBRARY/Mouth shape graphic symbols/Open_Rest.xml
2022-07-09 11:57:48 .....        14847         4066  LIBRARY/Mouth shape graphic symbols/U.xml
2022-07-09 11:57:48 .....         8139         2202  LIBRARY/Mouth shape graphic symbols/W Q.xml
2022-07-09 11:57:48 .....        15768         3914  LIBRARY/panda.xml
2022-07-09 11:57:48 .....        10477         1064  LIBRARY/sample graph.xml
2022-07-09 11:57:48 .....          538          538  bin/SymDepend.cache
------------------- ----- ------------ ------------  ------------------------
2022-07-09 11:57:48             469243       213256  21 files, 4 folders

The move to a ZIP container included a new format, XFL. This XFL file is a simple text file with the text “PROXY-CS5″. In the DOMDocument.xml file we find an XML namespace, xmlns=”http://ns.adobe.com/xfl/2008/” and a version of the XFL structure, xflVersion=”2.1″.

This ZIP compressed FLA file is still being used in the current Adobe Animate software, which no longer uses the flash technology and uses more modern web formats like HTML5 to display the animations.

I took each version and made a PRONOM signature, which you can find here with samples. These container signatures should cover all the major changes for the format, but there is a problem……..

Listing archive: Flash5.5-S01v5.fla

--
Path = Flash5.5-S01v5.fla
Type = zip
ERRORS:
Headers Error
Physical Size = 216581
Embedded Stub Size = 63
Characteristics = Local

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2022-07-09 11:57:46 .....           25           25  mimetype
2022-07-09 11:57:46 D....            0            0  LIBRARY
2022-07-09 11:57:46 D....            0            0  META-INF
2022-07-09 11:57:46 .....        48556         3742  DOMDocument.xml
2022-07-09 11:57:48 .....        10133         1112  META-INF/metadata.xml
2022-07-09 11:57:48 .....         8115         2219  PublishSettings.xml
2022-07-09 11:57:48 .....            0            0  MobileSettings.xml
2022-07-09 11:57:48 D....            0            0  LIBRARY/Mouth shape graphic symbols
2022-07-09 11:57:48 D....            0            0  LIBRARY/Voice
2022-07-09 11:57:48 .....       151006       151006  bin/M 1 1252032698.dat
2022-07-09 11:57:48 .....        99551        15319  LIBRARY/mouth.xml
2022-07-09 11:57:48 .....        16580         4536  LIBRARY/Mouth shape graphic symbols/A I.xml
2022-07-09 11:57:48 .....        14404         4089  LIBRARY/Mouth shape graphic symbols/C D G K N R S Th Y Z.xml
2022-07-09 11:57:48 .....        14531         4044  LIBRARY/Mouth shape graphic symbols/E.xml
2022-07-09 11:57:48 .....        15848         4008  LIBRARY/Mouth shape graphic symbols/F V D Th.xml
2022-07-09 11:57:48 .....        13024         3546  LIBRARY/Mouth shape graphic symbols/L D Th.xml
2022-07-09 11:57:48 .....         2106          752  LIBRARY/Mouth shape graphic symbols/M B P Closed.xml
2022-07-09 11:57:48 .....        14200         3955  LIBRARY/Mouth shape graphic symbols/O.xml
2022-07-09 11:57:48 .....        11152         2963  LIBRARY/Mouth shape graphic symbols/Open_Rest.xml
2022-07-09 11:57:48 .....        14777         4069  LIBRARY/Mouth shape graphic symbols/U.xml
2022-07-09 11:57:48 .....         8287         2228  LIBRARY/Mouth shape graphic symbols/W Q.xml
2022-07-09 11:57:48 .....        15768         3914  LIBRARY/panda.xml
2022-07-09 11:57:48 .....        10477         1064  LIBRARY/sample graph.xml
2022-07-09 11:57:48 .....          538          538  bin/SymDepend.cache
2022-07-09 11:57:46 .....           25           25  mimetype
2022-07-09 11:58:18 .....            9            9  Flash5.5-S01v5.xfl
------------------- ----- ------------ ------------  ------------------------
2022-07-09 11:58:18             469112       213163  22 files, 4 folders

Turns out majority of the samples I have from many versions of Adobe Flash after CS5 have a ZIP Header error. When using the new signatures in DROID, the samples with the header errors will fail in the DROID’s zip library processing. The DROID logs shows this issue:

Could not process the potential container format (ZIP): file:///Flash5.5-S01v5.fla	
Expected 25 more entries in the Central Directory!

The Central Directory header in a ZIP file is quite important to the proper function of the ZIP container. Wikipedia has a great explanation of the header. You may notice in the listing above the file “mimetype” is shown twice which is probably the extra entries the parser wasn’t expecting.

So currently the identification of majority of these FLA formats is on hold until a way is discovered to ignore the error and continue the container identification in DROID.

Apple Package Format

Let’s talk about Apple’s iWork software. Apple’s Office Suite of applications was first released in 2005 and provided a WordProcessor (Pages), Presentations (Keynote), and a little later, Spreadsheet (Numbers). They are exclusive to the Macintosh and iOS devices.

iWork was released in a few different versions. They get a little confusing as each application has its own version which all seemed to unify and stabilize in 2020. Here is a matrix of major versions.

VersionPackage or ZIP
iWork ’05Package
iWork ’06Package
iWork ’08Package
iWork ’09ZIP
iWork 2013Package
iWork 2014ZIP
iWork 2019ZIP
iWork 2020ZIP

You may already be aware but MacOS can sometimes be weird. I use the term weird in a loving, sometimes proud way, but I admit, there was some “odd” choices made in regards to how applications and documents are used and stored files on a Mac.

On early Macintosh computers Apple used an interesting method of storing resources for applications and some file formats. The Resource Fork for an application contained all the “resources” needed to run in the operating system. It would contain all the icons, warning screens, graphics, sounds, etc. This help true until Mac OS X came along and then Apple started using a bundle or package format. Still in use today, what appears to be a single file or application is actually a folder of all the resources needed to run the application.

Show Package Contents

By right clicking or control clicking on the icon you can open the folder and see all the contents which make up the Application.

Directory listing of Pages.app on MacOS

Nifty right? The MacOS which knows which extensions to treat as a package. If you were to move the application over to another system it would be a folder with the extension “.app”.

For an application I can see how this makes sense as it will only execute in the MacOS environment. The problem comes in when you use the same package method for the documents the application creates.

Contents of Pages version 1 sample file.

So instead of a single “file” with a bytestream, you get a folder of files which make up the file format. Here is Apple’s description:

Document Packages

If your document file formats are getting too complex to manage because of several disparate types of data, you might consider adopting a package format for your documents. Document packages give the illusion of a single document to users but provide you with flexibility in how you store the document data internally. Especially if you use several different types of standard data formats, such as JPEG, GIF, or XML, document packages make accessing and managing that data much easier.

Apple actually defines two similar methods:

Although bundles and packages are sometimes referred to interchangeably, they actually represent very distinct concepts:

  • package is any directory that the Finder presents to the user as if it were a single file.
  • bundle is a directory with a standardized hierarchical structure that holds executable code and the resources used by that code.

A couple years ago a processed digital collection made its way down to me. It had been processed by a new digital archivist and when I went to prepare the collection for preservation, I found a folder with the extension .pages and inside the folder a whole directory of files. Many of which they had renamed and arranged. Needless to say, I had to track down the original disk so I could properly preserve the file.

So looking back at the earlier table, iWork switched back and forth between the package format and a ZIP container. For preservation purposes, the ZIP container is easier to maintain outside the MacOS. Lets look a little closer at each. If you would like to follow along I have copied a few samples onto a hybrid ISO.

iWork ’05 through iWork ’08 used the same package format and structure. Because they are a package format, they are difficult to preserve as original files. I suppose you could zip them up, but probably the best option is to open with a current version of Pages and save to the newer ZIP container format.

tree iWork08/Keynote-06.key 
├── Contents
│   └── PkgInfo
├── QuickLook
│   └── Thumbnail.jpg
├── index.apxl.gz
└── theme-files
    ├── Blue 2.jpg
    ├── Blue 2.tif
    ├── Cool Gray-2.jpg
    ├── Cool Gray.tif
    ├── Green-8.jpg
    ├── Green.tif
    ├── Headlines_bullet.pdf
    ├── Headlines_star.pdf
    ├── Orange 2.tif
    ├── Orange_2.jpg
    ├── Purple-6.jpg
    ├── Purple.tif
    ├── Red.jpg
    ├── Red.tif
    ├── endpoints.pdf
    └── headlines_hi-res.jpg

iWork ’09 changed this practice. The documents saved from Pages, Keynote, and Numbers were contained in a ZIP file and can be identified using the PRONOM registry container signatures.

filename : 'iWork 2013/Pages2013-Sample09.pages'
filesize : 105900
modified : 2019-11-21T20:36:00-07:00
matches  :
  - ns      : 'pronom'
    id      : 'fmt/1439'
    format  : 'Apple iWork Pages'
    version : '09'
    class   : 'Word Processor'
    basis   : 'extension match pages; container name index.xml with byte match at 195, 76' 
Sample09.pages
Type = zip
WARNINGS:
Headers Error
Physical Size = 105900

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2019-11-21 20:36:00 .....       364773        22413  index.xml
2019-11-21 20:36:00 .....         7007         7007  Hardcover_bullet_black.png
2019-11-21 20:36:00 .....        69176        69176  Simple_Noise_2x.jpg
2019-11-21 20:36:00 .....          232          232  buildVersionHistory.plist
2019-11-21 20:36:00 .....         6406         6406  QuickLook/Thumbnail.png
------------------- ----- ------------ ------------  ------------------------
2019-11-21 20:36:00             447594       105234  5 files

Then Apple went back to a Package format with iWork 2013. For reasons unknown. But the content and structure changed. Its a package format with a Index.zip instead of index.xml

Pages2013-Sample.pages
├── Data
│   └── Hardcover_bullet_black-13.png
├── Index.zip
├── Metadata
│   ├── BuildVersionHistory.plist
│   ├── DocumentIdentifier
│   └── Properties.plist
├── preview-micro.jpg
├── preview-web.jpg
└── preview.jpg

3 directories, 8 files

The ZIP within the package contains a new Apple format. IWA

Pages2013-Sample.pages/Index.zip
Type = zip
Physical Size = 39361

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2019-11-21 20:47:14 .....         3860         3860  Index/Document.iwa
2019-11-21 20:47:14 .....           26           26  Index/Tables/DataList.iwa
2019-11-21 20:47:14 .....          336          336  Index/ViewState.iwa
2019-11-21 20:47:14 .....          160          160  Index/CalculationEngine.iwa
2019-11-21 20:47:14 .....          121          121  Index/DocumentStylesheet.iwa
2019-11-21 20:47:14 .....        31931        31931  Index/ThemeStylesheet.iwa
2019-11-21 20:47:14 .....           22           22  Index/AnnotationAuthorStorage.iwa
2019-11-21 20:47:14 .....         1889         1889  Index/Metadata.iwa
------------------- ----- ------------ ------------  ------------------------
2019-11-21 20:47:14              38345        38345  8 files

Luckily Apple came to their senses and went back to the ZIP container format for iWork 2014 and later. The container signature looks for the IWA file Apple started using with iWork 2013.

filename : 'iWork 2014/Pages2014-Sample.pages'
filesize : 66256
modified : 2019-11-22T00:03:56-07:00
errors   : 
matches  :
  - ns      : 'pronom'
    id      : 'fmt/1441'
    format  : 'Apple iWork Document'
    version : '14'
    class   : 'Presentation, Spreadsheet, Word Processor'
    basis   : 'extension match pages; container name Index/Document.iwa with byte match at 16, 6; name Metadata/Properties.plist with name only'
Path = iWork 2014/Pages2014-Sample.pages
Type = zip
Physical Size = 66256

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2019-11-22 00:03:54 .....         3930         3930  Index/Document.iwa
2019-11-22 00:03:54 .....          364          364  Index/ViewState.iwa
2019-11-22 00:03:54 .....          206          206  Index/CalculationEngine.iwa
2019-11-22 00:03:54 .....        33573        33573  Index/DocumentStylesheet.iwa
2019-11-22 00:03:54 .....           22           22  Index/AnnotationAuthorStorage.iwa
2019-11-22 00:03:54 .....           23           23  Index/DocumentMetadata.iwa
2019-11-22 00:03:54 .....         8761         8761  Index/Metadata.iwa
2019-11-22 00:03:54 .....          322          322  Metadata/Properties.plist
2019-11-22 00:03:54 .....           36           36  Metadata/DocumentIdentifier
2019-11-22 00:03:54 .....          273          273  Metadata/BuildVersionHistory.plist
2019-11-22 00:03:54 .....        14611        14611  preview.jpg
2019-11-22 00:03:54 .....          838          838  preview-micro.jpg
2019-11-22 00:03:54 .....         1571         1571  preview-web.jpg
------------------- ----- ------------ ------------  ------------------------
2019-11-22 00:03:54              64530        64530  13 files

Now iWork was not the only Apple software to use the Package/Bundle format for their documents. Be advised the following software may save to the package format.

I remember a few years ago, Trent Reznor (NIN) decided to release a few of his tracks in the Garageband format. A little harder to find these days, but the good old wayback machine kept a copy for us! Grab them here. Be warned, they may be in the package format. Thanks Apple!

DiskDoubler

A few years ago I had someone contact me with a desperate plea. They had a disk which contained years of journal entries and letters to loved ones she could no longer access. She had used a Macintosh in the late 1980’s and early 1990’s to create all these files, but wanted to convert them all to PDF so she could make a book. She said she had tried everything, contacted a lot of people and her son had told her it was a lost cause. In talking with others at my institution, they knew I had a background in older Macintosh formats and so she contacted me. I made no promises, but offered to try.

The files she provided were indeed early Macintosh files. One obvious trait was the lack of an extension. One might think a lack of an extension was poor planning for Apple, but they choose a different method for the operating system to know the relationship between files and applications. They did this through the use of a Type/Creator code. If you were a software developer for the Macintosh you could register a four character “Creator” code, then for all the different files you used with your software you could register a “Type” code. This told the Macintosh operating system exactly which software created the file and the type so it could be opened properly. Unlike today where an extension is defaulted to one application even if it isn’t the software which created the file.

ResEdit view of Hypercard Stack Info

In some ways this was a superior identification method as there was many software titles which could all create the same file format, but this way the correct software would open the file and render it correctly.

Looking at the files provided to me, there was a few which at first seemed like they were damaged somehow, they were extremely small compared to the other files. About half the size. When I opened them in a hex editor this is what I saw.

Usually document formats during this time would keep the text in plain ascii, but these files were different, they had binary data. In the header was the only plain text strings in the file, “WDBNMSWD”. I had seen these codes before, a Microsoft Word Document! But they weren’t….. What are they?

The head of the file has the hex values “ABCD0054”, so I started searching the internet for some help. There were others having the same problem I was having. I finally came across a tool called the “Unarchiver“. Running the command line version of the software “unar”, suddenly I had a file twice the size and could be opened by Microsoft Word!

unar Letter 
Letter: DiskDoubler
"./Letter" already exists.
Successfully extracted to "./Letter-1".

Remember back in the 1990’s when storage was expensive? Instead of dropping another $20 for a 100MB ZIP Disk, you could use Symantec’s DiskDoubler. The software would be installed on your Macintosh and then a window would come up showing you all the files on your drive. With one click you could compress a single file or a directory of files saving you tons of space. When you needed the file, just double click and the software would uncompress on the fly and then open the correct application to edit the file.

With a few clicks I was able to uncompress all the affected files and provide a PDF of all the letters and journals my new friend had tried so desperately for years to open. She was thrilled to say the least.

But why stop there? PRONOM needs to know about this format!

Once I had DiskDoubler installed I could make a few more samples, where is where I found there was a few different compression methods used by the software. They are labeled AD 1 & 2 and DD 1, 2 & 3. Making samples of each of the different types I was able to confirm the first 4 bytes of every file was the hex values “ABCD0054”. I was able to submit the format to PRONOM and it was added and given the PUID fmt/1399.

One of the other features of DiskDoubler was an ability to create a Self Extracting Archive (SEA). An sea file could contain a compressed file but also contained the code to uncompress itself. This was mostly seen with the Stuffit software, but there were many other compression tools which could write to this format. The Stuffit formats have been added to PRONOM which include identification of an SEA created by stuffit, but the SEA created by DiskDoubler is different and needs to be added.

Shockwave Audio

Ok, confession time.

There is only a couple moments in my tech history which had a profound effect on me, enough to sear the memory of the moment into my brain. When I was in college around 1997 I had a decent CD collection and I had learned how to copy those AIFF files off the disc and use them on my trusty PowerCenter Pro. These files were huge, at the time. I knew a regular size song would take up around 50MB on my hard drive. This was a lot of space back in 1997, but I could then mix them with other songs, something I did sometimes for friends I had on the dance team. I didn’t have a CD burner at the time so I would transfer them to cassette tape. I know, but remember this was the 1990’s when everything was changing and expensive.

One night I was exploring the world wide web and I happened across someone sharing a few songs. I assumed they were just clips as they were only 5MB in size, a tenth the size they should be. I downloaded the song, which of course still took a few minutes back in those days. When I played the song, I was dumbfounded, it was the whole song. I was completely confused. How could they take a 4+ minute song and compress it down to under 5MB? This was amazing.

I started grabbing every song I could find. Before long I had quite the collection. And before you judge me for downloading music from the web, this was a couple years before the advertisement we all remember reminding us that we wouldn’t steal a car so why would we steal music.

The files I found on the internet were MP3 files, the same we are familiar with today. Back then creating MP3 files wasn’t easy. MP3 was actually a licensed product so you had to get a little creative in order to make them. On my Macintosh PowerCenter Pro, there were even fewer options. I was already familiar with the sound editing application from Macromedia called SoundEdit 16, it was the tool I used to do all my editing. I found there was a plugin you could add which allowed export to a format called Shockwave Audio. This was meant for use in Macromedia’s Director application to add sound to the growing Flash animation industry. Once I got the plugin and installed I couldn’t stop making files and I made them as fast as I could. For a whole album this could take over an hour on my hardware, but it was worth it. Before long I had a large collection of popular music ready to play at a moments notice. My player of choice was MacAMP, a sibling of the popular WinAMP. I even borrowed some equipment from a friend who DJ’d on the weekends and DJ’d a college dance. I lugged my whole PowerCenter Pro tower and 17in trinitron monitor over to the school. It was so much fun and folks didn’t understand when they asked to see my CD collection.

Enough about transgressions from my youth, lets talk about the Shockwave Audio format.

To create a SWA file you would first need SoundEdit 16 Version 2. Then the plugins to enable export. This would only run on PowerPC computers running Macintosh OS or Classic in Mac OS X. For this post I pulled out my trusty PowerBook G4 Titanium running MacOS 9 and MacOS X 10.2. Installed SoundEdit 16 and the plugins in the Xtras folder and we are good to go.

Before you export you need to set what bitrate you prefer for the final file, giving you the option of 8KBits up to 160KBits per second. The higher the bitrate the longer it took and made larger files.

SoundEdit 16 had a native audio format and also frequently used the SoundDesigner II format to save the uncompressed files. On a Macintosh you had to be careful as these formats did not travel well to other systems on account of the resource forks associated with the data.

Because these SWA files were meant to be used in websites and other non-Mac systems, they did not have a resource fork, but had the Creator/Type codes, SwaT/SHCK. An extension wasn’t necessary for use on your Macintosh, but it was best to use .swa.

Here is what the data looks like for a SWA file.

Even though the SWA format uses MPEG compression, this is not a typical header you might see in a MP3. There was no ID3 tags at the time so not much in terms of metadata.

General
Complete name                            : tone2.swa
Format                                   : MPEG Audio
File size                                : 80.7 KiB
Duration                                 : 5 s 166 ms
Overall bit rate mode                    : Constant
Overall bit rate                         : 128 kb/s
FileExtension_Invalid                    : m1a mpa mpa1 mp1 m2a mpa2 mp2 mp3

Audio
Format                                   : MPEG Audio
Format version                           : Version 1
Format profile                           : Layer 3
Format settings                          : Joint stereo / MS Stereo
Duration                                 : 5 s 172 ms
Bit rate mode                            : Constant
Bit rate                                 : 128 kb/s
Channel(s)                               : 2 channels
Sampling rate                            : 44.1 kHz
Frame rate                               : 38.281 FPS (1152 SPF)
Compression mode                         : Lossy
Stream size                              : 80.7 KiB (100%)
ffprobe -i tone2.swa 
[mp3 @ 0x155704a60] Format mp3 detected only with low score of 25, misdetection possible!
[mp3 @ 0x155704a60] Skipping 324 bytes of junk at 0.
[mp3 @ 0x155704a60] Estimating duration from bitrate, this may be inaccurate
Input #0, mp3, from 'tone2.swa':
Duration: 00:00:05.15, start: 0.000000, bitrate: 128 kb/s
Stream #0:0: Audio: mp3, 44100 Hz, stereo, fltp, 128 kb/s

There are a few consistencies among all my files. They all begin with the hex values “00000140000000030000” for the first 10 bytes and all of them seem to have the string “MACRZ” at offset 36. I haven’t been able to find a open specification for this file format, so we will have to go with what we can find in the samples. According to ffprobe from above, there is 324 bytes of a header before the first MP3 frame starts.

MPEG signatures are difficult, there are no headers, just a sequence of frames. This is why there are often so many identification conflicts with the MP3 format. These SWA files indeed identify as MP3 files, but with a mismatch extension.

filename : 'tone2.swa'
filesize : 82661
modified : 1970-01-01T00:00:00-07:00
errors   : 
matches  :
  - ns      : 'pronom'
    id      : 'fmt/134'
    format  : 'MPEG 1/2 Audio Layer 3'
    version : 
    mime    : 'audio/mpeg'
    class   : 'Audio'
    basis   : 'byte match at 0, 4088 (signature 5/9)'
    warning : 'extension mismatch'

If we wanted to distinguish an SWA from an MP3 we would need to create a new signature and give it priority over the MP3 signature. There is enough of a header this would be possible and easier, but since they are, in reality, just MP3 files does it matter? Trying to play a SWA on a modern computer is only possible if you change the extension to MP3.

If you want to take a look at some samples you can grab a couple I made on my GitHub page or check out some commercially made files for an awesome Star Trek Starship Creator game.

MP4 & 360

Recently I have been exploring the MP4 format, more specifically the ISO Base Media File Format. It appears to be quite the versatile format. Based on the general Box/Atom format. Don’t mean to go much into the format here as there are so many formats which use this structure, like Quicktime MOV, Jpeg2000, to the more recent Canon RAW CR3. I have also been digging into the DASH MP4 format, but we’ll save that for a later time.

One of the more interesting uses of MP4 lately is 360 or spherical video. They are becoming more and more popular with content creators and also used for mapping like Google street view.

A while back I picked up a Insta360 Nano S camera. It attached directly to my iPhone. With a camera on each side it could capture images and video which could later be processed to produce some interesting results.

Of course it needs to be processed first so it doesn’t look like you are peering out of your peephole. Insta360 provides software for you to process the video into a regular video or some fun creative spherical video that makes you look like you are walking on a small globe.

The formats produced by the Insta360 Nano S are plain old JPG and MP4, but uses the extensions .INSP and .INSV respectively. Neither of which are documented in PRONOM yet. But because of the nature of 360 camera’s there is a little more under the hood. If you would like to look at some samples you can find some here.

The INSP file begins like any other EXIF JPEG file, but ends with a little additional info.

The 360 cameras have some additional information from the different gyros and accelerometers, as well as GPS information. The INSP file stores much of this information after the end of the JPG format. You can also see a string of alphanumeric numbers at the end, which is consistent with most of the files I have seen. One python parser of the additional data calls it the magic number. “8db42d694ccc418790edff439fe026bf” would make a good pattern for a signature.

The INSV files are similar, except they use the MP4 base media format.

Mediainfo indeed sees the file as an MPEG-4 with a AVC codec, but with a invalid extension.

Complete name                            : VID_20210222_170428_005.insv
Format                                   : MPEG-4
Format profile                           : JVT
Codec ID                                 : avc1 (avc1/isom)
File size                                : 41.1 MiB
Duration                                 : 7 s 608 ms
Overall bit rate mode                    : Variable
Overall bit rate                         : 45.4 Mb/s
Encoded date                             : UTC 2021-02-22 17:04:18
Tagged date                              : UTC 2021-02-22 17:04:18
IsTruncated                              : Yes
FileExtension_Invalid                    : braw mov mp4 m4v m4a m4b m4p m4r 3ga 3gpa 3gpp 3gp 3gpp2 3g2 k3g jpm jpx mqv ismv isma ismt f4a f4b f4v

In addition to a video and audio track, there is a text track.

Text
ID                                       : 3
Format                                   : Timed Text
Codec ID                                 : text
Duration                                 : 7 s 600 ms
Bit rate mode                            : Constant
Bit rate                                 : 240 b/s
Frame rate                               : 10.000 FPS
Stream size                              : 228 Bytes (0%)
Title                                    : Ambarella EXT
Language                                 : English
Forced                                   : No
Encoded date                             : UTC 2021-02-22 17:04:18
Tagged date                              : UTC 2021-02-22 17:04:18

With a little Exiftool magic, thank you Phil, we can see some of the extra data within the video file.

Serial Number                   : ISS2418ND7XH4H
Model                           : Insta360 Nano S
Firmware                        : v1.17.12.3_build1
Parameters                      : 2 947.866 946.388 964.646 0.000 0.000 90.000 942.993 2891.656 952.520 -0.682 -1.501 89.186 3840 1920 1040
Preview Image                   : (Binary data 578944 bytes, use -b option to extract)
Time Code                       : 62.155
Accelerometer                   : 0.0717358812689781 0.837667405605316 -0.541449248790741
Angular Velocity                : -0.00380666344426572 -0.0143540045246482 0.0170918852090836

Thanks to tools like Exiftool and MediaInfo we can take a peek into some of these formats. New ways of using the existing formats and new formats entirely keep popping up making it hard to know exactly what you have. Initially I just assumed the Insta360 formats didn’t need anything extra as they just used well known format with their own extension, but I needed to look a little closer. Many other cameras are now putting additional data at the end of a standard JPG. It will be interesting to see what new ideas camera developers come up in the coming years.

GoPro has a 360 camera as well and looking at a sample .360 file, I can see it also uses an MP4 base media format, but uses two video tracks to store video from the two cameras. Might need to dig into that format soon as well.

JPG Structure

If you hadn’t been over to see the posters made by Ange Albertini, head over now. Below is his poster on the JPG image file format. This is the basic JFIF file format, which stands for JPEG File Interchange Format. There are also raw JPEG streams and Exif, Exchangeable Image File Format.

The basic format is pretty straight forward. There is a start of image marker FFD8 some format information, then the raster compressed data, then an end of image marker FFD9. Identification of a JPEG file should be pretty straight forward. Knowing the start and end marker values and then the type of JPEG based on the Application data, can be very specific. That is until some software engineers start playing fast and loose with the format specifications.

A while back I received a JPG file which didn’t identify using the latest PRONOM signature. It’s happened before, some new phones came out and started using a newer version of the exif specification so I submitted an update to PRONOM for JPG’s using exif 2.3 and greater. But also may need to submit another signature soon for the newly released Exif 3.0 specification! But this JPG I received wasn’t a new version, it should have been identified with the current PRONOM signature. It started with FFD8 and when I went to look at the end of the file for the end of image marker FFD9, it wasn’t where I expected it to be.

This JPG file had an additional 9632 bytes after the FFD9 end of image marker. But why? The image rendered just fine in multiple JPG viewers. The only warning from Exiftool was for “Unrecognized MakerNotes”, which is not too uncommon. So I went to the JPG Exif specification.

EOI, Recording this marker is mandatory. It shall be recorded in this position.

But reading a little further we see…..

Moreover, Exif/DCF readers should be implemented to operate without interruption even if certain kinds of data have been recorded after EOI of the primary image defined in the Exif standard. Specifically, unknown data after EOI of the primary image should be skipped. (see section 4.7.1)

So the extra data is allowed by specification. Any readers should ignore or skip any data after the EOI (End of Image). Well that makes identification more difficult. All the PRONOM signatures are based on having the EOI marker at the “End”. Some have allowance for padding, but not enough for the worst offenders……

The image referenced above was created on a Huawei MHA-L29 cameraphone. But since finding this image, I have also found many Samsung phones do the same thing. Here is one from a Samsung SM-G975U1. Much less padding but enough to throw off identification.

Apple iPhones are also not exempt from this “feature” either. When using the MacOS ImageCapture tool with the HEIC format, a bug can add an excessive amount of empty data at the end of the converted JPG file.

So, when it comes to identification, if your JPG files don’t seem to identify correctly, look closer at the end of the file, it may have some “extra” data.

Corel ArtShow

File extensions are the easiest way to quickly identify a file format, but they can be misleading. This is the reason in Digital Preservation format identification tools like DROID are important to look closer at the file structure to more accurately identify formats. The other complication is some extensions are used for more than one format. Extensions like .DOC or .ISO can be used with many formats.

The PRONOM registry which DROID uses will list extensions associated with each format signature, but for some, they only have an extension and no signature. It’s nice to have an official ID to go with a format but with no signature it only matches based on extension.

This caused a problem awhile back for me while working with some files with the extension CDX. Which according to PRONOM, there are 5 completely different formats which use the extension, and probably others.

My CDX was related to some indexing software called Cindex. At the time the only format with a signature was for the WARC summary file CDX. The other was for a CorelDraw Compressed format with no signature. Confusing right? When I would run format identification on my Cindex files, they would default to the CorelDraw Compressed format, identified by extension. It was easy enough to create a signature for the Cindex format as I had enough samples to know the patterns needed for correct identification. But I was curious about the CorelDraw format. Should be easy to find, right?

Wrong. Finding a sample of this format was very elusive. All I had to go by was the name given to the format by PRONOM and the extension. I scoured every Corel CD and image I could get my hands on. For months I looked and could never find a single CDX file. Each CorelDraw software I was able to run did not have any ability to save in the CDX format. I scoured clipart discs, other Corel software, like Designer, PrintHouse, Photo-Paint, nada, nothing. I started to wonder if the format even existed. That’s when I noticed in the filters included with CorelDraw a reference to the ability to import a CDX but not write to one.

[CDX]
Signature=CORELFILTER - A
FilterEntry=1
Description=CorelDRAW Compressed (CDX)
FilterFullName=CorelDRAW Import Filter
Version=Version 6.00
Company=Corel Corporation
Copyright=Copyright © 1988-1995 Corel Corporation
Extensions=*.CDX
CorelID=0x704
FilterCapability1=0x9000
FilterCapability2=0x0
NoOfCompressions=0

This led to me finding a reference on the old Corel FTP site for knowledge base number 4550.

It mentioned something called ArtShow, where version 5 supported the file format CDX. ArtShow was a gallery of winning designs released on a CD-ROM and book each year. The first one being ArtShow 91, then ArtShow 3, 4, 5, 6, and finally 7 was the last. Each one released used a different proprietary compressed format for storing all the designs, these formats exist nowhere else. The question remains, why didn’t they use other popular Corel formats like CDR, CMX, or CCX which were used on many other clip art titles.

It took some time but I was finally able to find copies of a few of the Artshow CD-ROM discs, especially numbers 5 & 6. Which had the CDX format and the second generation CPX formats.

Each format had a easy to recognize header making a PRONOM signature easy to create. PRONOM already had the PUID for the two formats CDX & CPX, so sending in the signature added to the registry and hopefully will help distinguish between all the CDX formats!