data set – Obsolete Thor

There seems to be a never ending source of file formats out there. Documenting past obsolete formats, one would assume a point at which there are no more to find, but in reality more are re-discovered everyday by the Digital Preservation community. When it comes to more modern formats, it seems more are invented everyday, too many to keep up with identification. Document one, 10 more pop up, it seems never-ending. Such is the case for scientific formats, including sequencing formats.

I was speaking with a colleague from another institution the other day and a file format was mentioned I hadn’t heard of before. It seems many of their scientific data was stored in a format called FASTA “Fast A” (“fast-aye”). This format specifically stores DNA sequence data and is used quite a bit, it seems. I was even more surprised the next day when I went to process some new submissions for our repository only to find one submission contained three FASTA files. I love researching file formats, but sometimes in order to understand the format structure you have to know something about the content as well. Let’s explore the FASTA and FASTQ file formats. If you would like to take a peek at the Human Genome in FASTA, go here.

Both the FASTA and FASTQ formats are text based and have a simple structure. Identification of each of these should be pretty simple, but to avoid conflicts with other formats, the signature might have to be more complex.

The FASTA format is well documented as many in the scientific community use it. Basically the format starts with the greater than “>” character followed by a description, a new line character, then the sequence. For example:

>MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken
MADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID
FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA
DIDGDGQVNYEEFVQMMTAK*

Pretty straight forward, but so much of the format can be variable, a simple signature would clash with too many other formats. There are some rules with what characters can be used in the sequence so it might be possible to limit the signature to only allow certain characters. At first I thought it might only be able to contain the standard characters representative of adenine (A), cytosine (C), guanine (G), and thymine (T), but as it turns out the FASTA format can contain Nucleic Acid Code’s and Amino Acid Code’s. These codes allow more than the four I was expecting, but do limit what can be represented.

Take the NCBI Sequence Viewer for a spin and download some data as FASTA.

The FASTQ format adds more structure and is more limiting, but also presents some challenges. Here is a sample of its structure:

@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

Instead of a greater than symbol, the FASTQ format uses an “@” symbol followed by an identifier. The identifier can be basically anything and as long as needed. There is a newline character followed by the DNA sequence, which is only the four characters I have heard before. It can contain an A, C, G, T, or N. The “N” can represent an unidentified nucleotide or indicate that the software was unable to make a basecall. A newline character again and the “+” symbol. This is place before the fourth line with is a quality score and is the same number of characters as the sequence.

See what I mean when you have to learn about the context of a format in order to make a proper signature!

One of the problems I am left with is how to determine how many of the sequence characters to use in the signature to not have any conflicts. Too few and it might conflict with another format or simple text file. Too many and the signature gets complicated and may exclude a short sequence file. As far as I can tell there is no set minimum or maximum for the sequence. Not sure what the genome for Pinus Taeda would look like in FASTA with 22.18 billion base pairs. The other problem is often times these formats are compressed into a GZIP file, so they need to be extracted before identification.

These two formats are just a couple of the many sequencing formats being used in the bioinformatics community. I am sure others will pop up in the future. Until then, I have with the help of others put together a signature which seems to work well for the samples and data sets we have access to. Take a look at my GitHub for the signature proposal. If you find any issues, let me know!

Tag: data set

FASTA & FASTQ