From Biology to Insight: Genomic Analysis Workflow

Ceyda Güven
5 days ago
6 min read

In genomic data analyses, the first step involves clearly and precisely defining the biological question that forms the objective of the study. The selection of experimental approaches and data analysis strategies depends on the central focus of the research. Isolating high-quality DNA or RNA from the target organism's samples is of great importance. Any contamination during experimental procedures can directly affect the results at later stages of analysis.

The next step involves selecting a suitable sequencing platform for the isolated nucleic acids. When choosing a platform, parameters such as read length, error rate, cost, and data output should be considered. Illumina is widely used in short-read sequencing technologies, while Oxford Nanopore is among the preferred platforms for long-read sequencing. For more information on DNA sequencing methods, you can refer to our article on DNA Sequencing Techniques.

Upon completion of the sequencing process, raw data for analysis is obtained. These data are mostly in FASTQ format, which includes nucleotide sequences and quality scores for each read. A FASTQ file consists of four lines. As shown in Figure 1, the first line starting with the “@” symbol represents the read name. The second line contains the read sequence, the third line has a separator marked with a “+” to assist with readability, and the fourth line provides the quality scores (1). The quality of the raw data is critically important for the accuracy and reliability of the results in subsequent stages of data analysis.

Figure 1. FASTQ File Format Overview (2) — **Figure 1.** FASTQ File Format Overview (2)

Quality score information for each sequence in a FASTQ file is found in the fourth line. To assess the quality of sequencing data, a bioinformatics tool called FASTQC is commonly used. This tool enables rapid quality control analysis of raw sequencing data and helps identify potential errors or issues. Based on the quality control results, it may be necessary to trim low-quality bases to ensure that only high-quality bases are used in the analysis. Adapter sequences are frequently encountered, especially in short-read technologies, and such contaminants must be removed. Once the data is cleaned, the analysis and reporting stages can proceed (Figure 2).

Figure 2. Simple Genomic Data Analysis Workflow (2). — **Figure 2.** Simple Genomic Data Analysis Workflow (2).

After preprocessing steps are completed on the target genome to be analyzed, the process proceeds to the sequence alignment stage. If a reference genome is available for the organism under investigation, the process becomes more straightforward. The main goal at this stage is to align the read sequences to their most appropriate positions on the reference genome. This step determines the genomic locations of short or long reads and lays the foundation for interpreting downstream analyses such as variant calling and gene expression quantification.

The alignment strategy is determined based on factors such as read length, required sensitivity, error profile, and speed. Depending on the sequencing method used, different tools may be applied for short or long reads. For short reads, commonly used tools include BWA (Burrows–Wheeler Aligner) and Bowtie2, while Minimap2 is preferred for long reads due to its strong performance. BWA utilizes the Smith-Waterman algorithm and is specifically optimized for short-read platforms like Illumina. Minimap2, designed for long reads such as those produced by Oxford Nanopore and PacBio, is considered a fast and efficient aligner in this category (3).

In the next stage, alignment data is saved in SAM (Sequence Alignment/Map) format to enable visualization. This format, known as the Sequence Alignment Map, is essential for visualizing DNA reads. Since these files are typically large in size, they are often converted into BAM (Binary Alignment/Map) format to facilitate analysis. BAM files are designed to be efficiently read by computers. This conversion can be performed using tools such as Samtools or Picard.

In variant analysis, the goal is to identify genetic differences between samples. These differences may include single nucleotide polymorphisms (SNPs), small insertions or deletions (indels), or structural variations. Variant calling is performed using the file containing the aligned sequences. At this stage, positions that show mismatches between the reads and the reference genome—or between compared sample genomes—are identified. These positions are recorded in VCF (Variant Call Format) files (4). Commonly used tools for this purpose include GATK HaplotypeCaller and BCFtools.

The next step involves filtering out low-quality or incorrectly called variants. Criteria such as QUAL (quality score) and MQ (mapping quality) in the VCF file are evaluated. This filtering process can be performed using tools like GATK VariantFiltration or vcftools. The resulting files can then be visualized using alignment viewers such as Integrative Genomics Viewer (IGV). IGV allows users to assess variant reliability by examining parameters like read depth, base quality, and alignment consistency.

Identifying variants alone is often not sufficient. In such cases, variant annotation becomes a crucial step to biologically interpret the detected genetic variations. Annotation provides information such as whether the variant is located within a gene region or in an intergenic (i.e., between genes) region. It also reveals important details like the name of the affected gene, its known associations with diseases, and the variant’s potential impact at the protein level. Tools such as SnpEff (5) and ANNOVAR (6) are commonly used for this purpose.

The interpretation of the functions of genes containing variants and the biological processes in which they are involved is referred to as functional analysis. Functional analysis enables the interpretation of genomic data within a biological context. After annotation, functional enrichment analysis is performed to evaluate which biological functions are significantly associated with the identified genes. Gene Ontology (GO) classifies genes into three main categories: biological process, molecular function, and cellular component. KEGG (Kyoto Encyclopedia of Genes and Genomes), on the other hand, maps genes to biochemical pathways (7). As a result of functional analysis, it is possible to associate differentially expressed genes with disease or infection states. For example, in Figure 3, a visual representation supports the shared association of the investigated genes with both breast and bladder cancer.

Figure 3. Visualization showing the association of selected genes with breast and bladder cancer (8).

To validate the findings obtained from genomic analyses, statistical tests are conducted. These analyses primarily aim to assess the significance of variants in comparisons between sample groups. Statistical methods such as p-value, odds ratio, and false discovery rate (FDR) are commonly used. The p-value is typically used to test whether a variant occurs more frequently in the disease group compared to the healthy control group (9). The odds ratio indicates the likelihood that a variant is associated with a particular phenotype. When testing thousands of variants across the genome, many statistically significant results can arise purely by chance. The false discovery rate is a method used to control the proportion of false-positive findings that may occur due to multiple testing (10).

The deposition and open sharing of genomic data in international databases is of great importance for the reusability of the data and for guiding future research. In the final steps of the study, raw sequencing data are uploaded to repositories such as NCBI SRA (Sequence Read Archive) or the European Nucleotide Archive (ENA). Gene expression profiles are stored in platforms like GEO (Gene Expression Omnibus). Variant data are shared in VCF format and can be submitted to specific projects such as ClinVar.

In addition, reporting and sharing the findings with the scientific community through publications is also an essential part of the process. Scientific articles must transparently present details such as the methods used, the version of the reference genome, the software tools applied, and any parameter settings. Genomic data analysis is a multi-step and systematic process that guides us from a biological question to biological insights derived from raw sequencing data. This process—ranging from quality control of raw data to alignment, variant analysis, and interpretation—requires specialized expertise. Finally, open data sharing and scientific reporting make this process more transparent, accessible, and productive, allowing genomic information to be evaluated by broader scientific communities.

References

Burian, A. N., Zhao, W., Lo, T.-W., & Thurtle-Schmidt, D. M. (2021). Genome sequencing guide: An introductory toolbox to whole-genome analysis methods. Biochemistry and Molecular Biology Education, 49(5), 815-825. https://doi.org/10.1002/bmb.21561
Hatem Mohamed Elshazly. (2016). Optimizing bioinformatics variant analysis pipeline for clinical use (Master’s thesis). https://doi.org/10.13140/RG.2.2.14653.67040
Taşar, O., Çınar, E., & Onay, H. (2018). Diagnostic Next Generation Sequencing Data Analysis for Variant: Requirements and a Proposition. CERU-WS. org, 2201, 12.
Broad Institute. (2024). VCF (Variant Call Format). GATK. https://gatk.broadinstitute.org/hc/en-us/articles/360035531692-VCF-Variant-Call-Format Son erişim: 01.05.2025
Cingolani, P. et al. (2012). A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff. Fly, 6(2), 80–92. https://doi.org/10.4161/fly.19695
Wang, K., Li, M., & Hakonarson, H. (2010). ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Research, 38(16), e164. https://doi.org/10.1093/nar/gkq603
Lu Shi Jing, Muzaffar Shah, F. F., Mohamad, M. S., Moorthy, K., Deris, S., Zakaria, Z., & Napis, S. (2015). A review on bioinformatics enrichment analysis tools towards functional analysis of high throughput gene set data. Current Bioinformatics, 12(1), 14–27. https://doi.org/10.2174/157016461201150506200927
Yu, G. Chapter 5: enrichplot. In Biomedical Knowledge Mining Book. https://yulab-smu.top/biomedical-knowledge-mining-book/enrichplot.html Son erişim: 02.05.2025
Balding, D. J. (2006). A tutorial on statistical methods for population association studies. Nature Reviews Genetics, 7(10), 781–791. https://doi.org/10.1038/nrg1916
Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological), 57(1), 289–300. https://www.jstor.org/stable/2346101

From Biology to Insight: Genomic Analysis Workflow

References

Recent Posts

Comments