FASTQ file content
FASTQ files are submitted as they come off the sequencing instrument to allow
for maximal decision making of downstream users. The files are accompanied by
documentation detailing how the sequencing libraries were constructed to inform
the end-user about how they might want to process the data, the strengths and
limitations of the various options of data processing, and how these may apply
according to the user's biological questions of interest.
ENCODE produces replicate data for most experiments to quantify reliability.
Biological replicates involve different biological samples, e.g., different tissue
preparations for cell growth and expansion when cell lines are used. Biological replicates are
contrasted with technical replicates, for which different sequencing libraries are prepared from
the same sample, or different sequencing lanes for the same library. Reads from different
replicates are stored in separate files and should include flow cell and lane ID. If
multiple lanes are used for the same biological or technical replicate, they are stored
in the same file (after a QC check to eliminate failed lanes), with information on flow cell and
lane ID included. For experiments that produce paired-end reads, the two reads in each
pair are stored in two separate files, with the reads in the same order in the two files.
The reads in FASTQ files are unfiltered, i.e., barcodes, adapter sequences, and
spike-ins remain in the files. For Illumina sequencing, the barcodes that are in the
so-called third read position should not be present in the sequence. Spike-in reads
are kept. For bisulfite sequencing experiments, the raw FASTQ files are presented,
wherein most unmethylated cytosines are converted to thymines.
Reads are not "clipped" (no bases are removed). For example, in the case of small RNAs that are
shorter than the read-length, there may be adapters flanking these reads—these adapter
sequences remain in the FASTQ file. Some libraries are constructed in a way such that the
barcode is read out in the sequence (CSHL small RNAs were made this way during phase II of ENCODE)
and will appear in the FASTQ. Even though these barcodes would need to be trimmed off prior to
mapping, they are still included in the FASTQ file because different users may choose
different trimming algorithms.
FASTQ Sequencing quality
FASTQ uses four lines for each sequence with the fourth line denoting the sequencing
quality in each position. The consortium reports the Phred quality score from 0 to 93 using ASCII
33 to 126, i.e., Phred score plus 33. This is used by the newest versions of the
Illumina pipeline, Sanger and SRA. The Phred score of a
base[3][4] is defined as -l0
log10 (e) where e is the estimated probability for a base to be erroneous.
- Introductory information on the FASTQ format