TurboTax

With all the different file formats that are found in everyday computing, most formats which find their way to my archive have historical value. We know we can’t keep everything and have to assign value to all we decide to keep in for the long term. Some files have sensitive data and we have to follow guidelines for their proper handling. Identification of files helps us know what type of data might be kept inside the format, so often I need to also identify formats we don’t plan on keeping.

I was recently looking through a large digital collection and a report on the files which did not identify in the initial scan. A few popped out to me because of their extension, TAX. Tax records are one thing we need to identify so we can properly handle them, but not likely keep in our repository.

These tax files come from the popular US based TurboTax software. The software gets a new version for every year as tax laws are constantly changing. The software has also been around since 1984, so there are many versions to be aware of. Add to the fact there are personal and business versions along with DOS, Windows, and Macintosh versions, identification might get complicated. None of which are documented in the PRONOM registry. Wikidata is aware of a couple of the extensions, but does not have any signatures to help in identification.

Luckily, this collection of files I was processing had a number of years worth of records. Using them and a few others I was able to put together a decent timeline of formats used, at least from the early 1990’s on. The format seemed to settle on the .TAX extension around the 1994 Windows version. Before this, a group of files in DOS together stored the data. Let’s look at a sample of the 1994 file from Windows.

% hexdump -C TT1994.TAX | head
00000000 54 75 72 62 6f 54 61 78 0d 0a 46 6f 72 6d 61 74 |TurboTax..Format|
00000010 3d 57 49 4e 0d 0a 56 65 72 73 69 6f 6e 3d 31 33 |=WIN..Version=13|
00000020 0d 0a 45 6e 67 69 6e 65 56 65 72 73 53 74 72 3d |..EngineVersStr=|
00000030 36 2e 30 30 2e 31 0d 0a 46 6f 72 6d 73 65 74 3d |6.00.1..Formset=|
00000040 53 31 39 39 34 55 53 31 30 34 30 0d 0a 43 65 6e |S1994US1040..Cen|
00000050 74 73 3d 59 65 73 0d 0a 53 68 6f 77 43 6f 6d 6d |ts=Yes..ShowComm|
00000060 61 73 3d 59 65 73 0d 0a 53 68 6f 77 43 6f 6c 6c |as=Yes..ShowColl|
00000070 61 70 73 69 62 6c 65 57 6f 72 6b 53 68 65 65 74 |apsibleWorkSheet|
00000080 73 3d 59 65 73 0d 0a 44 61 74 61 56 65 72 73 69 |s=Yes..DataVersi|
00000090 6f 6e 3d 31 0d 0a 46 6f 72 6d 46 69 6c 65 53 75 |on=1..FormFileSu|

I love these easy to identify format headers, but then jump to the next year, 1995, and the format changes.

% hexdump -C TT1995.TAX | head
00000000 c0 45 01 5f 0a 00 00 35 b5 06 36 2e 30 30 2e 31 |.E._...5..6.00.1|
00000010 00 00 c7 00 02 00 02 0d 00 00 00 b4 00 00 00 d9 |................|
00000020 00 0e 53 31 39 39 35 55 53 31 30 34 30 50 45 52 |..S1995US1040PER|
00000030 01 01 01 00 00 00 01 00 01 00 00 35 b5 00 0a c8 |...........5....|
00000040 00 01 00 01 09 00 00 00 cf 00 06 00 06 1d 00 00 |................|
00000050 00 3e 00 00 00 3e 00 00 00 64 00 00 00 64 00 00 |.>...>...d...d..|
00000060 00 7e 00 00 00 ce 13 7a 65 7a 50 65 72 73 69 73 |.~.....zezPersis|
00000070 74 65 6e 74 53 74 61 74 75 73 00 65 00 64 00 01 |tentStatus.e.d..|
00000080 00 00 00 00 00 00 ce 12 7a 74 6c 50 65 72 73 69 |........ztlPersi|
00000090 73 74 46 69 6c 65 44 61 74 61 00 00 00 00 00 00 |stFileData......|

The nice easy to read header is gone, but some other patterns start to appear. It seems most of the files from these early versions also used a code near the beginning that may help. “S1995US1040PER”, is similar to the “S1994US1040” in the 1994 file. One could assume the “1040” is the tax form most Americans are used to, along with “US” preceding the number. Then at the end of the string we see “PER”. This may refer to different versions of the Tax software, a Personal for the individual, and a possibly other versions for business as well. I believe TurboTax also had versions for Canadians as well, so there may be many variations on this string. This could get complex. Let’s jump ahead to a 1999 file.

% hexdump -C TurboTax1999.tax | head 
00000000 c0 45 01 5f 0a 00 00 54 6a 16 4c 39 31 30 32 31 |.E._...Tj.L91021|

00000030 00 0e 53 31 39 39 39 55 53 31 30 34 30 50 45 52 |..S1999US1040PER|
00000040 00 00 01 00 00 00 25 00 00 00 00 00 00 00 00 00 |......%.........|
00000050 01 19 12 8f f1 00 0a 00 00 00 00 00 00 00 00 00 |................|
00000060 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000070 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 c8 |................|
00000080 00 04 00 04 15 00 00 00 ec 05 00 00 c3 07 00 00 |................|
00000090 a3 08 00 00 c4 00 05 46 31 30 34 30 00 00 00 01 |.......F1040....|

The same string is visible, but if course with the year “1999”. We can also see a pattern with the first 4 bytes, “c0 45 01 5f” which seem to be consistent with the 1995 file. The file I have for 1998 is consistent as well. Jumping to the new millennium, we see a change.

% hexdump -C TurboTax2000.tax | head
00000000 c0 45 01 64 0a 00 00 2e 4f 18 4c 30 30 39 32 37 |.E.d....O.L00927|

00000030 00 dc 00 0b 53 32 30 30 30 55 53 31 31 32 30 00 |....S2000US1120.|
00000040 00 01 00 00 00 09 00 00 00 00 00 00 00 00 00 01 |................|

Two changes we see with this file. One, the ASCII string is different. S2000US1120, 1120 being the U.S. Corporation Income Tax Return. So this version of the software was different. The other change is the first 4 bytes. They changed to “c0 45 01 64”, with the last byte changing from 5F to 64. Jumping to 2003, we see the same values.

% hexdump -C TurboTax2003.tax | head 
00000000 c0 45 01 64 0d 00 00 80 1b 26 54 59 30 33 5f 4c |.E.d.....&TY03_L|

00000040 58 03 00 dc 00 0e 53 32 30 30 33 55 53 31 30 34 |X.....S2003US104|
00000050 30 50 45 52 00 00 01 00 00 6a c6 00 00 00 00 00 |0PER.....j......|

Back to a 1040 form, but with the same header as the 2000 file. I am removing some lines, just to be safe and not exposing any personal data. In 2004 we see a major change in the format.

% hexdump -C TurboTax2004.tax | head 
00000000 54 54 46 4e 01 01 6f 68 dc 62 00 00 00 00 4b 01 |TTFN..oh.b....K.|

Again, removing some lines to ensure safety. This header is very different and their is no human readable ASCII in the file, which means it is binary and probably encoded. This header is new, TTFN is what I assume references TurboTax format? file? or possibly, “Turbo Tax Financial Network“?

This header is then used for the next few years ending in 2013, but before we get there, the extension makes a change as well. In 2008, instead of the simple .TAX extension, the software begins to save the tax file with the extension .TAX2008. I don’t have a 2008 document, but I do have a sample 2009 document.

% hexdump -C TurboTax2009.tax2009 | head
00000000 54 54 46 4e 01 01 b5 68 02 24 00 00 00 00 4b 0b |TTFN...h.$....K.|
00000010 01 01 19 13 01 01 01 52 01 01 01 0b 01 01 4e 7a |.......R......Nz|

With the last to use the TTFN header in 2013.

% hexdump -C TurboTax2013.tax2013 | head
00000000 54 54 46 4e 01 01 87 22 6a ec 00 00 00 00 50 bd |TTFN..."j.....P.|

2014 is where I get a little confused. I have one file which uses the TTFN header and another which uses what becomes the standard going forward. But definitely in 2015, the format starts using the ZIP container as a structure for the format. Here is a sample from 2015

% hexdump -C TurboTax2015.tax2015 | head
00000000 50 4b 03 04 2d 00 02 00 08 00 e5 a6 51 48 ba 4d |PK..-.......QH.M|
00000010 43 67 15 06 00 00 10 06 00 00 0c 00 14 00 6d 61 |Cg............ma|
00000020 6e 69 66 65 73 74 2e 78 6d 6c 01 00 10 00 00 00 |nifest.xml......|

If we take a look inside the ZIP container of a 2017 dummy sample.

% 7z l TurboTax2017.tax2017
7-Zip [64] 17.05 : Copyright (c) 1999-2021 Igor Pavlov : 2017-08-28
p7zip Version 17.05 (locale=utf8,Utf16=on,HugeFiles=on,64 bits,8 CPUs LE)

Scanning the drive for archives:
1 file, 769814 bytes (752 KiB)

Listing archive: TurboTax2017.tax2017

--
Path = TurboTax2017.tax2017
Type = zip
WARNINGS:
Headers Error
Physical Size = 769814

Date Time Attr Size Compressed Name
------------------- ----- ------------ ------------ ------------------------
2026-03-28 20:25:38 ..... 576 581 manifest.xml
2026-03-28 20:25:38 ..... 768688 768923 084A702A-CD3D-4623-B8B7-EE4800BB151F
------------------- ----- ------------ ------------ ------------------------
2026-03-28 20:25:38 769264 769504 2 files

Warnings: 1

The files all seem to have a manifest.xml and a unique identifier. 7-Zip also mentions a header issue with the ZIP files. Something maybe done on purpose? Now comes the odd part, the manifest.xml file does not render as an XML file, it is binary.

% hexdump -C TurboTax2017/manifest.xml | head
00000000 a1 b1 fe fb 37 18 dd 9c 08 2d 9c 86 23 00 10 fa |....7....-..#...|
00000010 12 60 92 bb dc 92 a5 df 1a 24 16 4e a9 28 89 80 |.`.......$.N.(..|
00000020 64 33 66 55 c5 93 f0 68 44 d0 7c f9 56 86 42 2c |d3fU...hD.|.V.B,|
00000030 80 ba 8a 95 2a 82 6d 32 75 84 b1 f1 e2 18 93 5c |....*.m2u......\|
00000040 82 4d 18 f9 ed 23 4f dc d6 b5 7f f2 20 1e 30 59 |.M...#O..... .0Y|
00000050 d5 7f 47 7d aa f5 8d bd 8b 10 20 ec 8a c7 43 df |..G}...... ...C.|
00000060 52 90 a9 70 4d 68 b4 76 fa c8 37 85 f5 56 25 82 |R..pMh.v..7..V%.|
00000070 ea 16 06 54 b0 b4 bc 43 16 fb 70 7b 7a 79 a5 8b |...T...C..p{zy..|
00000080 3c 79 7d ef ac 32 fc 35 ce 0f fa a2 6f e7 c3 a4 |<y}..2.5....o...|
00000090 92 a1 a4 c8 83 dd 9f 32 f4 ea d3 1a eb 89 15 a3 |.......2........|

Of the samples I have which have a manifest.xml, they all begin with “a1 b1 fe fb”. Which apparently is the header for an AES CBC encrypted file. A clever person was able to decrypt the file to reveal the actual XML.

TurboTax isn’t sold on physical disk anymore, but you can download the current tax year version from their website. I am not a user of their product so I am not sure if the latest version still saves files in the same way. If you do use it currently, I would love to know if it is still the same.

So to recap, the headers are:

  • 1994 “TurboTax Format=WIN Version=13
  • 1995-99 “C045015F”
  • 2000-03 “C0450164”
  • 2004-13 “TTFN”
  • 2014-current “ZIP Container”

This should be enough to create five new signatures for identification. Extensions will be a problem since they change very year, but we can add them to the list. With these signatures we can now identify all the tax files we have and set them aside if not needed.

Leave a Reply

Your email address will not be published. Required fields are marked *