Celemics, Inc.

Blogs

Discover our Innovative Stories

NGS Key Terminology Guide Part 3: Bioinformatics

  • Post category:Blogs

NGS Key Terminology Guide Part 2: Sequencing

Glossary of common NGS terms

The foundation of a successful NGS experiment extends beyond sequencing—it requires robust bioinformatics analysis to turn raw data into biological insight.
This post introduces essential bioinformatics terms such as FASTA, FASTQ, Phred scores, and variant calling. From quality control and alignment to de novo assembly and annotation, this guide helps you understand how sequencing data is interpreted—and why each step is critical for generating meaningful results.

The application of computational tools and algorithms to extract meaningful biological insights from raw sequencing data.

This may include:
– Variant annotation
– Gene expression quantification
– Pathway enrichment analysis
– Fusion gene detection
– Microbial profiling, and more

A Phred Score is a logarithmic quality score that quantifies the accuracy of individual base calls in sequencing data.
It is calculated as Q = -10 log10(P), where P is the probability of an incorrect base call.
Higher Phred scores indicate greater base call accuracy.
This score is widely used in FASTQ files to represent sequencing quality and to guide downstream analysis and filtering.

A FASTA file is a text-based format used to represent nucleotide or protein sequences.
Each entry begins with a header line starting with “>”, followed by a sequence name or description.
The actual sequence is written in the lines below.
It does not contain quality scores, and is commonly used for reference genomes, gene sequences, or protein databases.

A FASTQ file stores raw sequencing reads along with per-base quality scores. Each read is represented by four lines:

@ read identifier

raw nucleotide sequence

+ separator

ASCII-encoded Phred quality scores

FASTQ is the standard output format for most next-generation sequencing (NGS) platforms, including Illumina.

A SAM (Sequence Alignment/Map) file is a text-based format that stores sequencing reads aligned to a reference genome.
It includes fields such as read name, alignment position, CIGAR string, mapping quality, and optional tags for additional metadata.
A BAM file is the binary (compressed) version of a SAM file, designed to reduce file size and enable faster computational processing.

QC involves assessing sequencing data quality and removing low-quality reads, adapter sequences, and technical artifacts.
These steps ensure that only high-confidence data are retained for downstream analysis, improving reliability and accuracy.

The size profile of DNA fragments post-shearing influences hybridization efficiency and sequencing performance.
For example, a fragment size of around 300 bp is ideal for 150 bp paired-end sequencing, optimizing capture efficiency and cluster generation.

A reference genome is a representative, assembled DNA sequence of a species that serves as a standard framework
for aligning, mapping, and comparing individual sequencing data. It is typically derived from one or a few individuals
and reflects a composite consensus of genomic regions, including both coding and non-coding sequences.

The computational process of mapping sequencing reads to a reference genome or transcriptome.
This step is fundamental for identifying genetic variants, structural alterations, and expression profiles.

De novo assembly refers to the process of assembling a genome from raw sequencing reads without the aid of a reference genome.
The process involves constructing contigs by overlapping reads and subsequently organizing them into scaffolds to reconstruct the entire genome sequence.
This method is especially critical for studying newly discovered organisms, highly divergent species, or novel viruses and microbes.

A contig is a continuous stretch of DNA assembled by merging overlapping sequencing reads into an accurate, gap-free sequence.
Contigs are fundamental units in both de novo genome assembly and transcriptome reconstruction (e.g., in RNA-Seq analysis).

A scaffold is a higher-order structure formed by ordering and orienting multiple contigs using supplementary information such as paired-end reads,
mate-pair data, or physical mapping techniques.
Unlike contigs, scaffolds may contain gaps (typically represented as ‘N’s) between contigs, indicating regions where the exact sequence is unknown.
Scaffolds enable the reconstruction of larger genomic segments and represent an essential intermediate step toward complete genome assembly.

A metric indicating how many times a specific genomic region has been sequenced.

  1. Depth of coverage:
    The average number of reads covering a base position (e.g., 30x means each base is covered on average 30 times).
  2. Breadth of coverage:
    The proportion of the target region covered by sequencing reads above a certain threshold (e.g., ≥20x).
    High coverage improves accuracy in variant detection and confidence in the sequencing results.

A measure of how evenly sequencing reads are distributed across all targets.
Higher uniformity ensures consistent variant detection sensitivity across regions, reducing the risk of missed variants due to dropouts.

The analytical process of identifying genomic variations, including single nucleotide polymorphisms (SNPs),
insertions, and deletions (InDels), by comparing sequencing data to a reference sequence

Variant annotation is the process of adding biological context and functional information to genetic variants identified through sequencing.
It involves predicting the effect of variants on genes, transcripts, and proteins, and linking them to known databases
of disease associations, population frequencies, and clinical significance.

Contact Us