Adobe Illustrator is a power design tool. Originally released in 1987 for the Macintosh, it has been the vector design tool of choice for many professionals.
Originally the Adobe Illustrator Format (AI) was based on postscript. With each file having a postscript header. This all changed with Illustrator version 9 moving to PDF as its core, released in the year 2000.
Even though AI files begin with a PDF header, there is much more to them which makes them a unique file format. So as Dov Isaacs put it, “PDF files are not Adobe Illustrator files and vice versa”.
Working in digital preservation the need to identify a file format is vital to the process. It is also important to identify when the format changes over time in order to properly maintain that file. Adobe Illustrator files created in version 8 or earlier are substantially different than those created in version 9 and greater and will need different software to render properly.
This is where identification tools come in handy.
Using the “File” command in a CLI we get:
Illustrator9v8-s04.ai: PostScript document text conforming DSC level 3.0
Illustrator9-s04.ai: PDF document, version 1.4, 1 pages
While partly true, we need more specific identification if we want to properly preserve these file in the long term. Enter PRONOM, which is a file identification registry based on signatures to identify file formats. Using a tool like DROID, Siegfried with the PRONOM registry we can get better identification.
siegfried : 1.10.0 scandate : 2023-04-07T11:59:18-06:00 signature : default.sig created : 2023-03-23T15:09:43Z identifiers : - name : 'pronom' details : 'DROID_SignatureFile_V111.xml; container-signature-20230307.xml' --- filename : 'Illustrator9-s04.ai' filesize : 77829 modified : 2023-04-07T11:04:53-06:00 errors : matches : - ns : 'pronom' id : 'fmt/558' format : 'Adobe Illustrator' version : '9.0' mime : 'application/postscript' class : 'Image (Vector)' basis : 'extension match ai; byte match at [[0 8] [1536 557]]' warning : --- filename : 'Illustrator9v8-s04.ai' filesize : 323748 modified : 2023-04-07T11:05:11-06:00 errors : matches : - ns : 'pronom' id : 'fmt/557' format : 'Adobe Illustrator' version : '8.0' mime : 'application/postscript' class : 'Image (Vector)' basis : 'extension match ai; byte match at 0, 673' warning :
This identification is possible because of signatures built for the file format specific to each version. The file format wiki has a list of the current signatures for the Illustrator format. The problem is, the last signature added to PRONOM was for version 16 (CS6). Since then there have been more changes to the format.
If we attempt an identification of a Illustrator file created with current 2023 software we get this result.
filename : 'Illustrator2023-s01.ai' filesize : 1195445 modified : 2023-02-16T12:29:16-07:00 errors : matches : - ns : 'pronom' id : 'fmt/20' format : 'Acrobat PDF 1.6 - Portable Document Format' version : '1.6' mime : 'application/pdf' class : 'Page Description' basis : 'byte match at [[0 8] [1195439 5]]' warning : 'extension mismatch'
Although technically correct, as the Illustrator file has a PDF 1.6 header, identification needs to know this is an Illustrator file. So if we create a new signature by adding the following hexadecimal pattern:
filename : 'Illustrator2023-s01.ai' filesize : 1195445 modified : 2023-02-16T12:29:16-07:00 errors : matches : - ns : 'pronom' id : 'BYUDev/3' format : 'Adobe Illustrator CC 2020' version : '24.2+' mime : 'application/postscript' class : basis : 'extension match ai; byte match at [[0 8] [8766 45] [45347 348]]' warning :
Lets break down the hexadecimal pattern. The “*” is a wildcard indicating there is 0 to many bytes in between.
255044462D312E36 translates to: %PDF-1.6 3C696C6C7573747261746F723A547970653E446F63756D656E743C2F696C6C7573747261746F723A547970653E translates to: <illustrator:Type>Document</illustrator:Type> 252150532D41646F62652D332E30 translates to: %!PS-Adobe-3.0 254149355F46696C65466F726D6174203134 translates to: %AI5_FileFormat 14
Identification is based first on the PDF Header, then some XMP metadata indicating this is an Illustrator document, then the Postscript header, then finally the version identifier. Each Illustrator since version 5 has a file format version, when Adobe switched from the CS labels to CC, they stuck with version 13 until 2020, when the format was changed to version 14. With one catch, when Illustrator version 24 (2020) was first released it was format version 14, but still had the PDF 1.5 header. This was changed in version 24.2 to a PDF 1.6 header which added a bigger Canvas size.
In the current PRONOM signatures going back to version 9, there was some offsets assumed for the space between the PDF header, postscript header, and version number. I have found through many samples there are quite a few which are outside those offsets, especially as the size of the AI file gets larger. Therefore I am suggesting the “*” wildcard between all segments.
One area that still needs a bit more research is with Illustrator versions 9-12 (CS2). These do not include the XMP metadata indicating they are Illustrator Documents, so they will more often get misidentified as PDF. I did find, however, AI files have a string “/AIPrivateData”, while saved as PDF, they have “/AIPDFPrivateData”. So signature will have this added to distinguish.
Another anomaly is some samples I found on the Illustrator 9 CD-ROM. Illustrator 9 was released in June of 2000, but many of these files were created in February of 2000, they have a PDF 1.4 header but have a format version 4, which is what version 8 uses. So these files were probably created with an early build of Illustrator 9 and format was incremented to 5 in the public release.
You can see my submission suggestion on my GitHub page along with the PRONOM Signature and sample files. There is still a couple tweaks I need to make, but let me know what you think.
Note: All Illustrator files and PDF’s saved with Illustrator compatibility checked which include a section of the file called “AI Private Data”, this is where all the Illustrator data lives. It includes a “creator” version and a “container” version which could also be used to identify an Illustrator files version.