FASTA & FASTQ

There seems to be a never ending source of file formats out there. Documenting past obsolete formats, one would assume a point at which there are no more to find, but in reality more are re-discovered everyday by the Digital Preservation community. When it comes to more modern formats, it seems more are invented everyday, too many to keep up with identification. Document one, 10 more pop up, it seems never-ending. Such is the case for scientific formats, including sequencing formats.

I was speaking with a colleague from another institution the other day and a file format was mentioned I hadn’t heard of before. It seems many of their scientific data was stored in a format called FASTA “Fast A” (“fast-aye”). This format specifically stores DNA sequence data and is used quite a bit, it seems. I was even more surprised the next day when I went to process some new submissions for our repository only to find one submission contained three FASTA files. I love researching file formats, but sometimes in order to understand the format structure you have to know something about the content as well. Let’s explore the FASTA and FASTQ file formats. If you would like to take a peek at the Human Genome in FASTA, go here.

Both the FASTA and FASTQ formats are text based and have a simple structure. Identification of each of these should be pretty simple, but to avoid conflicts with other formats, the signature might have to be more complex.

The FASTA format is well documented as many in the scientific community use it. Basically the format starts with the greater than “>” character followed by a description, a new line character, then the sequence. For example:

>MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken
MADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID
FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA
DIDGDGQVNYEEFVQMMTAK*

Pretty straight forward, but so much of the format can be variable, a simple signature would clash with too many other formats. There are some rules with what characters can be used in the sequence so it might be possible to limit the signature to only allow certain characters. At first I thought it might only be able to contain the standard characters representative of adenine (A), cytosine (C), guanine (G), and¬†thymine¬†(T), but as it turns out the FASTA format can contain Nucleic Acid Code’s and Amino Acid Code’s. These codes allow more than the four I was expecting, but do limit what can be represented.

Take the NCBI Sequence Viewer for a spin and download some data as FASTA.

The FASTQ format adds more structure and is more limiting, but also presents some challenges. Here is a sample of its structure:

@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

Instead of a greater than symbol, the FASTQ format uses an “@” symbol followed by an identifier. The identifier can be basically anything and as long as needed. There is a newline character followed by the DNA sequence, which is only the four characters I have heard before. It can contain an A, C, G, T, or N. The “N” can represent an unidentified nucleotide or indicate that the software was unable to make a basecall. A newline character again and the “+” symbol. This is place before the fourth line with is a quality score and is the same number of characters as the sequence.

See what I mean when you have to learn about the context of a format in order to make a proper signature!

One of the problems I am left with is how to determine how many of the sequence characters to use in the signature to not have any conflicts. Too few and it might conflict with another format or simple text file. Too many and the signature gets complicated and may exclude a short sequence file. As far as I can tell there is no set minimum or maximum for the sequence. Not sure what the genome for Pinus Taeda would look like in FASTA with 22.18 billion base pairs. The other problem is often times these formats are compressed into a GZIP file, so they need to be extracted before identification.

These two formats are just a couple of the many sequencing formats being used in the bioinformatics community. I am sure others will pop up in the future. Until then, I have with the help of others put together a signature which seems to work well for the samples and data sets we have access to. Take a look at my GitHub for the signature proposal. If you find any issues, let me know!

Interactive Quicktime

One of my favorite legacy formats to explore is any type of multimedia CD-ROM. The 1990’s and early 2000’s were filled with all sorts of multimedia for CD, Web, and Television. It is also one of the most difficult formats to try and preserve for the future. Many CD-ROM’s are filled with executables and/or Macromedia Director media, later having flash content. The operating systems and security needs today make playback almost impossible. For this reason many have built emulation services to mimic the original operation system and software to allow the many historic multimedia CD-ROM’s to once again interact with the user in a way many current systems still struggle with.

Many CD-ROM’s would come as Hybrid disc’s allowing them to be used on a Windows and Macintosh system, sometimes providing two different experiences. Then there were CD-Extra or Enhanced CD‘s as a separate session to an Audio CD which would contain bonus content playable only on a computer.

For fun I took a look back at some of my older Audio CD titles. I came across a couple, one claiming to be a “CD-Extra” and another an “Enhanced CD“. The CD-Extra disc when queried with cd-info claimed to have 12 tracks, with the 12th being a data XA track.

Disc mode is listed as: CD-ROM Mixed
CD-ROM Track List (1 - 12)
#: MSF LSN Type Green? Copy? Channels Premphasis?
1: 00:02:00 000000 audio false no 2 no
2: 02:13:66 009891 audio false no 2 no
3: 05:21:28 023953 audio false no 2 no
4: 08:18:19 037219 audio false no 2 no
5: 12:28:37 055987 audio false no 2 no
6: 16:11:58 072733 audio false no 2 no
7: 19:21:56 086981 audio false no 2 no
8: 23:17:49 104674 audio false no 2 no
9: 26:01:17 116942 audio false no 2 no
10: 28:30:02 128102 audio false no 2 no
11: 31:07:70 139945 audio false no 2 no
12: 37:29:46 168571 XA true no
170: 51:35:07 231982 leadout (520 MB raw, 516 MB formatted)
CD Analysis Report
CD-Plus/Extra
session #2 starts at track 12, LSN: 168571

Mounting the 12th track showed a mix of Macromedia Director (.DIR) files and quite a few Quicktime MOV movies. Playback was not possible on my current computer so I had to resort to using an emulator to experience this bonus content, full of band member photos and biographies.

The other disc I pulled out to explore was a bit different. Using cd-info the disc looked very similar:

Disc mode is listed as: CD-ROM Mixed
CD-ROM Track List (1 - 13)
#: MSF LSN Type Green? Copy? Channels Premphasis?
1: 00:02:00 000000 audio false no 2 no
2: 04:20:08 019358 audio false no 2 no
3: 08:04:27 036177 audio false no 2 no
4: 11:15:62 050537 audio false no 2 no
5: 14:54:32 066932 audio false no 2 no
6: 19:57:73 089698 audio false no 2 no
7: 26:12:36 117786 audio false no 2 no
8: 29:51:59 134234 audio false no 2 no
9: 34:44:00 156150 audio false no 2 no
10: 39:36:62 178112 audio false no 2 no
11: 42:06:01 189301 audio false no 2 no
12: 45:42:26 205526 audio false no 2 no
13: 57:10:54 257154 XA true no
170: 72:56:67 328117 leadout (735 MB raw, 730 MB formatted)
CD Analysis Report
CD-Plus/Extra
session #2 starts at track 13, LSN: 257154

The disc’s, even though were labeled CD-Extra and Enhanced CD, had the same structure and format. The difference was in the type of multimedia used. There was a simple application which launched Quicktime and loaded a single MOV movie. But, this was not your regular Quicktime Movie, this is a highly complex Interactive Quicktime movie.

The Quicktime movie could only be launched from an older operating system using Quicktime 6, and on the Macintosh, only a PPC CPU. The movie would launch with an interactive menu, allowing navigation as you might find on a DVD or Flash website, but all within a single MOV file. When I ran MediaInfo on the MOV file I got back quite a few tracks:

<media ref="/Volumes/VOLCANOECD/ALECD.mov">
<track type="General">
<VideoCount>10</VideoCount>
<AudioCount>1</AudioCount>
<OtherCount>51</OtherCount>
<FileExtension>mov</FileExtension>
<Format>QuickTime</Format>
<Format_Settings>Compressed header</Format_Settings>

Ten video tracks and 51 other tracks. Exploring with Quicktime, I could see the entire list of embedded content:

Quicktime movies, an Audio track, dozens of Flash, Photos, Animations, Sprites, with the possibility of more. These types of Quicktime files had requirements in order to run with Quicktime 6 being the last which could playback all the content correctly. Current versions of Quicktime give a warning on the lack of compatibility.

This Interactive Quicktime movie proudly claims; “Made with LiveStage Pro“, which was an authoring environment for Quicktime made by Totally Hip Software Inc. Started in 1995, but seemed to disappear after 2004 with no new development and by 2014 the website went offline.

If you would like to see a couple of Apple created simple examples see here.

LiveStage Pro was a very powerful authoring tool in its time, another similar tool called Electrifier competed for the interactive Quicktime market. Adobe GoLive also competed, but offered fewer features. The final Quicktime movie exported from LiveStage Pro was the main component, but the software did save a project format with the extension “LSD”. Versions 2 through 4 of LiveStage Pro had a similar header.

hexdump -C LiveStagePro4-s01.lsd | head
00000000 4c 53 41 46 00 00 00 04 00 00 09 16 00 00 00 00 |LSAF............|
00000010 00 00 00 00 00 00 00 00 00 00 09 0a 73 65 61 6e |............sean|
00000020 00 00 00 01 00 00 00 03 00 00 00 00 00 00 00 18 |................|
00000030 56 53 4e 6e 00 00 00 01 00 00 00 00 00 00 00 00 |VSNn............|
00000040 00 00 00 04 00 00 08 84 4d 50 52 4e 00 00 00 01 |........MPRN....|
00000050 00 00 00 49 00 00 00 00 00 00 00 21 6d 4f 55 54 |...I.......!mOUT|
00000060 00 00 00 01 00 00 00 00 00 00 00 00 55 6e 74 69 |............Unti|
00000070 74 6c 65 64 2e 6d 6f 76 00 00 00 00 18 57 6c 65 |tled.mov.....Wle|
00000080 66 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 |f...............|
00000090 00 00 00 00 18 57 74 6f 70 00 00 00 01 00 00 00 |.....Wtop.......|

All the samples from version 2 through 4 have the first four bytes as “LSAF“. It also seems the next four bytes may be version related. Version 1 however has a different header.

hexdump -C contest.lsd | head
00000000 4c 53 50 72 00 00 00 08 00 00 00 00 00 00 02 80 |LSPr............|
00000010 01 e0 00 00 00 00 02 58 00 00 00 01 00 00 00 01 |.......X........|
00000020 00 00 00 02 00 00 00 00 00 08 00 00 00 00 00 00 |................|
00000030 00 00 08 53 02 d9 ff c9 04 76 02 97 01 00 44 00 |...S.....v....D.|
00000040 0b 02 fb 03 c9 00 00 00 01 00 00 00 01 00 00 00 |................|
00000050 00 07 41 63 74 69 6f 6e 73 00 00 00 00 00 00 00 |..Actions.......|
00000060 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000070 00 00 00 00 00 00 00 00 05 00 00 00 01 50 49 43 |.............PIC|
00000080 54 ff ff 00 00 c1 ff 03 72 65 64 65 6e 6e 41 79 |T.......redennAy|
00000090 98 05 41 77 78 00 00 01 7a 00 10 00 00 31 fc 30 |..Awx...z....1.0|

Identification of a LiveStage project should be simple enough, but identifying and rendering back a Quicktime movie made by this software takes some work. In fact there are many “Enhanced CD’s” and CD-Extra titles out there with quite a few system requirements. If we are not careful, many of these little gems might get more difficult to experience or lost completely.

If you would like to explore the Quicktime Movie from the Enhanced CD mentioned here, send me a message. You can also take a look at my signature proposal and samples files on my Github for LiveStage.