Bioinformatics is the intersection of computer science, statistics and biology used to collect, store, organize and analyze life science data. Using these techniques, a large amount of data produced from NGS equipment can be rapidly and effectively analyzed to obtain useful human-readable data. A typical analysis pipeline used for NGS data analysis is as follows.
1. Raw NGS data
The sequencing data produced by the instrument is stored as a FASTQ format file that can both record the sequence information and sequencing accuracy at one file.
- Sequence ID: The first and third lines contain the unique identifier of each sequencing reads. The ID information in the third line is usually omitted. The format of ID depends on the sequencing platform and includes information about the sequencing instrument. In case of paired-end sequencing, forward and reverse reads use the same ID and an additional tag is labeled to differentiate read direction.
- Sequence: The second line corresponds to the nucleotide sequence read from the instrument.
- Quality score: The fourth line represents the sequencing accuracy of each nucleotide sequence. It is calculated using the Phred quality score and recorded following the ASCII code.
2. Quality filter & Adapter trimming
This process is intended to improve the overall quality of sequencing data analysis. First, low-quality parts of the read from both ends are trimmed, and if the average quality score of the read is low or if there are too many ambiguous bases (N), the entire read is removed. When the insert size is shorter than the NGS read length, the NGS adapter sequence is also read together at the end of the read. As these sequences are not part of the genome sequence and is therefore should also be trimmed from the data.
3. Alignment Mapping and Sorting
This is a process of comparing the sequence information of each NGS read to the reference genome sequence in order to determine where the read comes from. The reads are then sorted in chromosome and coordinated for further analysis.
4. PCR duplicates removal
PCR duplicate is a set of the amplified products originated from the same DNA template through PCR. To ensure that the sequencing data reflects only the nucleic acid molecules originated from the actual sample and to prevent false positive variants, removing PCR duplicates is performed. This program estimates PCR duplicates by identical mapped coordinate of paired reads. To increase accuracy and efficiency of duplicate removal, additional information such as molecular barcodes can be used. This process cannot be applied to PCR amplicon sequencing data as the process generates multiple identical paired reads from different templates due to the nature of the technique.