Genomic Data Analysis Tools and Methods

Ege Altun
Dec 10, 2025
8 min read

Genomic data analysis refers to the collection of methods and tools used to interpret the massive amount of data generated by next-generation sequencing (NGS) technologies, which have revolutionized biological research. Genomic data encompass a wide range of information, including DNA sequences, RNA transcripts, metagenomic samples, and epigenetic modifications. Extracting biological meaning from these data requires a multi-step bioinformatics workflow that spans from processing raw sequencing reads to interpreting biological results. A typical NGS analysis begins with quality control, followed by alignment of the data to a reference genome, detection of variants, functional annotation, and visualization of the results. Different tools are used at each step of this process. For example, a common genomic analysis workflow may include quality control with FastQC, alignment with BWA, variant calling with GATK, and visualization with IGV. In this article, after discussing genomic data types and their acquisition methods, we will examine the fundamental steps applied in NGS data analysis and the widely used tools associated with each stage.

Genomic Data Types

When genomic data are mentioned, the first concept that comes to mind is DNA sequencing data. Genome sequencing (genomics) aims to decode all the DNA information (the genome) of an organism and allows us to examine the structure, function, and variations of genetic material. Through DNA sequencing, either the entire genome (whole genome sequencing) or the exonic regions (whole exome sequencing) can be read with high throughput. RNA sequencing data (RNA-Seq), on the other hand, are directed toward transcriptome analysis; by sequencing all RNA molecules (the transcriptome) produced at a given time in a cell or tissue, it reveals gene expression profiles. Transcriptomic studies enable the measurement of which genes are active and to what extent, and analyses in this field focus on identifying differences in gene expression. Metagenomic data contain DNA sequences from all microorganisms present in an environmental sample or microbial community. With a metagenomic approach, the genetic diversity and functional potential of microorganisms found in environments such as soil, water, or the human gut microbiome can be examined. Epigenomic data refer to information generated by measuring chemical and structural modifications that regulate gene expression without altering the DNA sequence (e.g., DNA methylation, histone modifications, chromatin accessibility) across the genome. Epigenomic analyses aim to map epigenetic changes to understand the mechanisms governing gene expression regulation. In terms of data acquisition methods, genomic datasets today are largely produced using next-generation sequencing (NGS) technologies. For more detailed information about NGS technologies, you can visit our article at: https://www.insilicodesign.com/en/post/dna-sequencing-methods.

Quality Control and Preprocessing of Raw Data

Raw data obtained from next-generation sequencing platforms are typically provided in FASTQ format. FASTQ files contain the sequenced reads along with quality scores for each base. Before proceeding with downstream analysis, a quality control (QC) step is applied to evaluate the reliability of the raw data and identify potential errors. One of the most commonly used tools for this purpose is the FastQC program (1). FastQC analyzes various metrics in raw sequencing data—including base quality distribution, GC content, position-dependent quality drops, adapter contamination, overrepresented sequences, and k-mer repeats—offering researchers an initial and comprehensive assessment of overall data quality.

To improve raw data quality, preprocessing steps are usually performed after QC. The most important of these steps include adapter trimming and removal of low-quality bases. Adapter sequences, which are added during library preparation, may remain at the read ends and must be removed because they can interfere with downstream analyses. Tools commonly used for this task include Trimmomatic (2) and Cutadapt (3). After quality control and trimming, the overall data quality is reassessed.

If data from multiple samples are available, all FastQC reports can be combined into a single summary report using MultiQC (4). MultiQC scans dozens of FastQC outputs and provides an aggregated quality summary for all samples, making it easy to detect whether any sample deviates from the others. For example, if one sample has noticeably lower read quality or if adapter contamination is present only in specific samples, the MultiQC report highlights these issues and alerts the researcher.

Figure 1. FastQC “Per Base Sequence Quality” plot. The distribution shows the decline in base quality across the read (5). — **Figure 1.** FastQC “Per Base Sequence Quality” plot. The distribution shows the decline in base quality across the read (5).

Sequence Alignment and Mapping

High-quality reads obtained after preprocessing are typically subjected to an alignment step against a reference genome. The aim of this step is to determine the original genomic location from which each short read was derived. Algorithms developed for alignment map sequencing reads to the most suitable positions within the reference genome. With the rapid increase in NGS data volume, numerous alignment tools capable of performing this task quickly and accurately have been developed. One of the most widely used DNA read alignment tools is BWA (Burrows–Wheeler Aligner) (6). BWA efficiently indexes the reference genome and uses a Burrows–Wheeler transform–based algorithm to rapidly align short reads (especially 100–150 bp Illumina reads). Alternatively, Bowtie2 is also commonly preferred for similar purposes (7). While the earlier version of Bowtie is extremely efficient in memory usage—making it well-suited for small genomes or narrow target regions—Bowtie2 provides more flexible alignment capabilities, handling gapped reads effectively. After DNA alignment, the resulting mappings are saved in SAM/BAM format, which stores not only the genomic location of each read but also additional information such as alignment quality. Alignment poses an additional challenge in RNA-Seq data: due to the exon–intron structure of eukaryotic transcripts, cDNA reads generated from mRNA produce split alignments on the reference genome. Therefore, splice-aware alignment tools must be used. Among the most widely used aligners for RNA-Seq are HISAT2 (8) and STAR (9). HISAT2 models exon–intron boundaries (splice junctions) and aligns mRNA reads to their correct genomic positions by skipping introns. STAR (Spliced Transcripts Alignment to a Reference), on the other hand, is known for its remarkable speed performance.

After alignment, several additional steps are performed to evaluate alignment quality and prepare the data for downstream analyses. First, quality control on the aligned BAM files must be conducted. Tools such as SAMtools, Picard, and Qualimap are used to calculate metrics including alignment rate, average coverage depth, the percentage of uniquely mapped reads, and PCR duplication rate. For example, Picard’s MarkDuplicates (10) tool identifies multiple copies originating from the same original DNA fragment that were generated during PCR and marks these reads as duplicates. A high duplication rate typically indicates low library complexity or excessive PCR amplification. Because duplicate reads can lead to false-positive results in variant analyses, marking and ignoring these reads in downstream steps is generally preferred. Qualimap provides comprehensive reports visualizing potential biases such as the genomic distribution of alignments, uniformity of coverage, and GC-content–dependent mapping biases. These evaluations help determine whether any issues occurred during the alignment step. For instance, an alignment rate that is significantly below expectations may indicate that the wrong reference genome was used or that the data contain substantial contamination.

In GATK-based workflows for DNA variant analysis, several preprocessing steps are recommended after alignment. One such step is Base Quality Score Recalibration (BQSR), which models systematic errors introduced by sequencing instruments and adjusts base quality scores accordingly, thereby reducing false positives during variant calling. In earlier versions of GATK, local realignment—performed as a separate step—was used to correct misalignments in problematic indel regions by realigning them locally. However, modern GATK’s HaplotypeCaller performs this correction internally. Once these post-alignment preparation steps are completed, a reliable and well-processed alignment dataset suitable for variant analysis is obtained.

Variant Calling and Genotyping

One of the primary goals of genome-scale DNA sequencing is to identify the variants present in an individual’s genome. Variant calling is the process of detecting nucleotide changes in the genome from aligned sequence data. These changes mainly consist of single nucleotide polymorphisms (SNPs) and small insertions/deletions (indels), whereas larger copy number variations and structural variants require dedicated tools. Today, one of the most widely used software packages for variant calling is GATK, developed by the Broad Institute. With its “Best Practices” workflow—particularly the HaplotypeCaller tool—GATK accurately identifies SNPs and indels (11). This workflow includes pre-processing steps such as duplicate marking and base quality recalibration (BQSR), followed by extraction of raw variants using HaplotypeCaller. The tool reconstructs local haplotypes from clusters of reads, which results in higher accuracy than classical pileup-based methods, especially for detecting indels.

The raw variant calling process produces a Variant Call Format (VCF) file containing the genomic coordinates of each variant, reference and alternate alleles, and various evaluation metrics such as read depth and quality scores. Filtering this raw variant list is essential to distinguish high-confidence calls. The VariantRecalibrator tool (VQSR) in GATK re-scores variants using a model-based approach built on quality metrics. For smaller datasets, hard-filtering can be applied, removing variants that fall below thresholds related to depth, quality score, or strand bias. Variants with extremely low coverage or low confidence are typically discarded because they are more likely to represent sequencing artifacts. The filtered set represents the reliable mutations found in the individual’s genome relative to the reference.

To interpret the biological significance of called variants, the next step is functional annotation. Variant annotation aims to determine which gene or genomic region the variant resides in, whether it causes an amino acid change at the protein level, whether it has been previously reported in literature or databases, and its potential functional impact. Tools such as ANNOVAR, SnpEff, and Ensembl VEP (Variant Effect Predictor) are widely used for this purpose. These tools match the VCF file against reference gene annotations and report each variant’s genomic location and predicted biological effect. Using these tools, one can determine whether a variant lies within a gene, exon, intron, or regulatory region. For coding regions, effect categories such as missense (amino acid–altering), nonsense (stop-gain), or synonymous (silent) changes can be automatically reported.

Figure 2. Variant Calling Workflow: A workflow illustrating the steps of Alignment, BAM Cleanup, Variant Calling, Filtering, and Functional Annotation (12). — **Figure 2.** Variant Calling Workflow: A workflow illustrating the steps of Alignment, BAM Cleanup, Variant Calling, Filtering, and Functional Annotation (12).

In conclusion, genomic data analysis provides a powerful framework at the center of modern biological and medical research, enabling us to understand complex biological systems. This multi-step process—ranging from quality control of raw sequencing data to alignment, variant detection, functional interpretation, and visualization—not only requires selecting the appropriate tools but also structuring the analysis workflow in accordance with the underlying biological question. As next-generation sequencing technologies continue to advance, genomic analysis methods are becoming increasingly sensitive, faster, and more comprehensive.

References

1.Park, S. J., Kim, J. H., Yoon, B. H., & Kim, S. Y. (2017). A ChIP-Seq Data Analysis Pipeline Based on Bioconductor Packages. Genomics & informatics, 15(1), 11–18. https://doi.org/10.5808/GI.2017.15.1.11

2.Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics (Oxford, England), 30(15), 2114–2120. https://doi.org/10.1093/bioinformatics/btu170

3.Martin, M. (2011). Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet. journal, 17(1), 10-12.

4.Ewels, P., Magnusson, M., Lundin, S., & Käller, M. (2016). MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics (Oxford, England), 32(19), 3047–3048. https://doi.org/10.1093/bioinformatics/btw354

5. Andrews S. FastQC: A quality control tool for high throughput sequence data [Internet]. Babraham Bioinformatics; 2010 [cited 2025 Nov 30]. Available from: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/

6.Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics (Oxford, England), 25(14), 1754–1760. https://doi.org/10.1093/bioinformatics/btp324

7.Langmead, B., & Salzberg, S. L. (2012). Fast gapped-read alignment with Bowtie 2. Nature methods, 9(4), 357–359. https://doi.org/10.1038/nmeth.1923

8.Kim, D., Paggi, J. M., Park, C., Bennett, C., & Salzberg, S. L. (2019). Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nature biotechnology, 37(8), 907–915. https://doi.org/10.1038/s41587-019-0201-4

9.Dobin, A., Davis, C. A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., Batut, P., Chaisson, M., & Gingeras, T. R. (2013). STAR: ultrafast universal RNA-seq aligner. Bioinformatics (Oxford, England), 29(1), 15–21. https://doi.org/10.1093/bioinformatics/bts635

10.Broad Institute. Picard command-line tools: MarkDuplicates [Internet]. 2025 Nov 30 [cited 2025 Nov 30]. Available from: https://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicates

11.Broad Institute. (2025, November 30). About the GATK Best Practices. GATK Documentation. https://gatk.broadinstitute.org/hc/en-us/articles/360035894711-About-the-GATK-Best-Practices

12.HBC Training & Outreach. (2025, November 30). Read Alignment for Next-Generation Sequencing. In-Depth NGS Data Analysis Course. https://hbctraining.github.io/In-depth-NGS-Data-Analysis-Course/sessionVI/lessons/01_alignment.html

Genomic Data Analysis Tools and Methods

Recent Posts

Comments