top of page

An Overview of Genomic Databases

Updated: May 8

As the amount of data generated in genomic research continues to grow rapidly, genomic databases are increasingly used to store and interpret this information. Various databases are utilized to quickly access different types of data such as genome sequences, variants, gene expression profiles, and disease associations. This article highlights the core purposes, data types provided, and key features of several major genomic databases.


NCBI (National Center for Biotechnology Information)

NCBI is the National Center for Biotechnology Information in the United States, providing scientists with access to biomedical and genomic data. Rather than being a single database, NCBI is a platform that hosts multiple important databases. For example, GenBank is a fundamental public database containing all available DNA sequences. Similarly, RefSeq offers curated reference records for genomic and protein sequences; the Gene database includes comprehensive gene-level information; and GEO (Gene Expression Omnibus) stores high-dimensional gene expression data such as microarray and RNA-seq datasets. NCBI also includes literature databases like PubMed, providing an integrated access point for biomedical knowledge.

The significance of NCBI lies in its ability to provide various types of data under one roof. Through NCBI, researchers can search for a gene and access its sequence, variants, related literature, and more. For example, a gene’s DNA sequence or protein sequence can be accessed via the Nucleotide and Protein databases, while its functional annotations can be found in the Gene database. NCBI features a user-friendly web interface and supports automation through RESTful API services like E-utilities. Additionally, large-scale datasets can be downloaded via FTP. In summary, NCBI is an indispensable starting point and data repository for genomic research (1). NCBI platform can be accessed at https://www.ncbi.nlm.nih.gov/.

Figure 1. NCBI is a bioinformatics portal providing access to genetic data and analysis tools.
Figure 1. NCBI is a bioinformatics portal providing access to genetic data and analysis tools.

ClinVar

ClinVar is a database within NCBI that compiles information on the clinical significance of genomic variants. It aggregates data shared by various laboratories, clinicians, and researchers to provide comprehensive insights into the potential health impact of specific variants. At its core, ClinVar focuses on the relationship between a variant and human phenotypes, particularly diseases.

For instance, a variant may be classified as pathogenic (disease-causing), likely pathogenic, benign, or of uncertain significance. When a mutation is detected in a genetic test, ClinVar can be used to check whether it has been previously reported and how it was interpreted. For example, if a change in the BRCA1 gene is identified in a patient, a search in ClinVar may reveal that it has been previously reported as a pathogenic mutation associated with breast cancer.

In summary, ClinVar provides a reliable platform for researchers and clinicians to rapidly assess the clinical relevance of genomic variants (2). ClinVar database can be accessed at https://www.ncbi.nlm.nih.gov/clinvar/.

Figure 2. ClinVar search results and clinical interpretations of BRCA1 gene variants.
Figure 2. ClinVar search results and clinical interpretations of BRCA1 gene variants.

OMIM

OMIM (Online Mendelian Inheritance in Man) is a continuously updated, catalog-style database that contains extensive information on human genes and genetic disorders. For each gene, OMIM provides detailed summaries about associated diseases or phenotypes. These summaries typically include the gene's function, known pathogenic variants, clinical features, inheritance patterns (such as autosomal dominant or recessive), and key references from the scientific literature. The practical value of OMIM lies in its ability to quickly convey gene-disease relationships. For instance, by searching the name of a hereditary disorder or its MIM number (a unique identifier assigned to each entry by OMIM), users can identify the gene(s) responsible for the disease and understand the underlying mutation mechanisms. Likewise, entering the name of a gene will generate a list of known disorders associated with that gene.


As OMIM is organized in an encyclopedia-like format, entries are presented as readable pages. However, the information is often interconnected with other databases such as NCBI, Ensembl, or ClinVar. For example, a mutation listed in OMIM may link directly to its corresponding ClinVar record.

OMIM provides a fast and user-friendly search experience via its website, and the database is updated daily to incorporate the latest findings from the scientific literature (3). OMIM database can be accessed at https://omim.org/.

Figure 3. Gene and phenotype entries associated with PKM2 (OMIM).
Figure 3. Gene and phenotype entries associated with PKM2 (OMIM).

Ensembl

Ensembl is a comprehensive genome browser and database developed in collaboration with the European Bioinformatics Institute (EMBL-EBI) and the Wellcome Sanger Institute. Ensembl is specifically designed to support research in comparative genomics, evolution, sequence variation, and transcriptional regulation. This platform hosts genomes of hundreds of species, including humans, and provides extensive data for each genome, such as gene annotations, variant data, homologous gene information, and regulatory elements. One of Ensembl's key features is the Compara database, which enables the comparison of genes and genomic regions between different species in a comparative genomic context.

Ensembl is supported by both a web interface and powerful tools: the Ensembl Genome Browser allows detailed graphical visualization of genes or chromosomal regions; the BioMart tool enables large-scale queries to extract custom data subsets; and the REST API allows programmers to directly access Ensembl data. Ensembl also offers integrated analysis tools such as BLAST/BLAT and the Variant Effect Predictor (VEP), which help users interpret the potential effects of variants using inter-species data. In summary, Ensembl provides researchers with a rich platform to conduct comparative genomic analysis across multiple species (4). Ensembl database can be accessed at https://www.ensembl.org/index.html.

Figure 4. Ensembl homepage user interface.
Figure 4. Ensembl homepage user interface.

GTEx

GTEx (Genotype-Tissue Expression) project is a large-scale NIH initiative designed to examine the relationships between genotype and tissue-specific gene expression. The primary goal of this project is to measure gene expression levels in tissue samples from a large number of healthy individuals, thereby revealing how natural variants in DNA affect gene expression. The latest published data set (v8) contains gene expression profiles from 17,382 RNA-seq samples taken from 54 different tissues and 2 cell lines, along with genotype data from 838 individuals. This allows for the analysis of baseline expression levels of each gene across various tissues, as well as eQTL (expression quantitative trait loci) analyses, which are publicly available.

One of the most important outputs of GTEx has been the mapping of gene expression differences between tissues. For example, the expression levels of a gene in different organs, such as the liver, brain, or heart, can be explored interactively through GTEx portal. Researchers can observe the effects of a variant on gene expression (through eQTL analysis) using GTEx data. For instance, it might be discovered that a specific SNP increases the expression of a gene in the thyroid gland. In conclusion, GTEx is an excellent resource for researchers aiming to understand the genetic basis of gene expression, providing information about how much a gene is expressed in different tissues and its relationship to DNA variants (5). GTEx database can be accessed at https://www.gtexportal.org/home/.

Figure 5. Graph showing the expression levels of the TP53 gene in different human tissues on GTEx Portal.
Figure 5. Graph showing the expression levels of the TP53 gene in different human tissues on GTEx Portal.

UCSC Genome Browser

UCSC Genome Browser, offered by UC Santa Cruz, is a popular web-based genome browser for exploring genomes graphically. This browser allows the user to quickly visualize any genomic region at the desired scale and displays multiple annotation layers ("tracks") aligned simultaneously. All information about the relevant region is collected on a single screen; for example, gene predictions, alignments of mRNA and EST sequences with the genome, known SNP variants, DNA methylation, histone modifications as epigenetic marks, phenotypic variations, and comparative genomic alignments between species can all be viewed in the same window. This approach enables researchers to perform multifaceted analyses of specific genomic regions. Users can instantly observe the status of a particular gene or genomic region across different datasets. For instance, mutations in a specific gene region within a cancer genome can be examined alongside the normal variation frequencies (from 1000 Genomes or gnomAD), gene structure (exon-intron organization), and conservation level (multi-species comparison) within the same window. The Table Browser tool provides access to the underlying database tables seen in the browser; users can query or download the raw data from these tables.

In terms of access methods, the UCSC Genome Browser is available directly via its web interface, and for large datasets, it also offers download options (FTP or direct MySQL queries) and programmatic access (REST API) (6). UCSC Genome Browser database can be accessed at https://genome.ucsc.edu/.

Figure 6. Genomic information layers for the TP53 gene in UCSC Genome browser.
Figure 6. Genomic information layers for the TP53 gene in UCSC Genome browser.

The leading genomic databases discussed in this article, such as NCBI, ClinVar, OMIM, Ensembl, GTEx, and UCSC, form the building blocks of modern biological research. With their wide range of data, spanning from genetic sequences and gene expression to the interpretation of clinical variants and comparative genomic analyses, these resources enable researchers to interpret genetic data and provide faster and more effective solutions to biological questions. Each of these databases is designed to address a different need, and when used together, they pave the way for deeper and more comprehensive analyses, highlighting the potential of genomic information in both basic science and medical applications.


Comparative Summary of Genomic Databases

The fundamental features of the genomic databases discussed above can be summarized in the table below. This table presents a comparative overview of each source's purpose and the types of data they contain.

Veri Tabanı

Amacı / İçeriği

Veri Türleri

 

NCBI (General)

Integrated repository of genomic data; hosts several sub-databases (GenBank, PubMed, dbSNP, etc.).

DNA/RNA sequences, protein sequences, genome annotations, variants, literature, etc.

UCSC Genome Browser

Graphical visualization of genomes, visual integration of multiple data 'tracks' in a region.

Gene locations and exon-intron structures, mRNA/EST alignments, SNPs and other variations, epigenetic markers, multi-species comparisons.

Ensembl

Annotated genome browser for the genomes of multiple species; multi-species comparison, variation, and gene regulation data

Gene and transcript annotations, variants (SNP, indel), comparative genomic alignments, regulatory elements.

ClinVar

A database of clinically significant genomic variants (NCBI).

Variant records (genomic location), clinical classifications (pathogenic, benign, VBUS), associated disease/phenotype information.

OMIM

A database providing reliable and detailed information about Mendelian hereditary diseases and the genes responsible for these diseases

Textual summaries for each gene and disease: phenotype descriptions, associated genes, mutation examples, inheritance patterns, references.

GTEx

A project mapping the relationship between genotype and tissue-specific gene expression.

RNA-seq gene expression levels in various tissues; eQTL analysis results (variant-expression effect data); donor genotypes.

 

 REFERENCES

  1. National Center for Biotechnology Information (NCBI) [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information; [1988] – [cited 2025 Apr 12]. Available from: https://www.ncbi.nlm.nih.gov/

  2. ClinVar [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information; [cited 2025 Apr 12]. Available from: https://www.ncbi.nlm.nih.gov/clinvar/

  3. Online Mendelian Inheritance in Man, OMIM® [Internet]. McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University (Baltimore, MD); [cited 2025 Apr 12]. Available from: https://omim.org/

  4. Ensembl [Internet]. European Bioinformatics Institute; [cited 2025 Apr 12]. Available from: https://www.ensembl.org/

  5. Genotype-Tissue Expression (GTEx) Portal [Internet]. Bethesda (MD): National Institutes of Health (US); [cited 2025 Apr 12]. Available from: https://gtexportal.org/

  6. UCSC Genome Browser [Internet]. Santa Cruz (CA): University of California, Santa Cruz; [cited 2025 Apr 12]. Available from: https://genome.ucsc.edu/

 

Comentários

Avaliado com 0 de 5 estrelas.
Ainda sem avaliações

Adicione uma avaliação
bottom of page