Chemical Databases: Digital Guides to Molecular Discovery
- Emre Can Buluz

- Apr 28
- 11 min read
Updated: Jun 10
Chemical databases are digital resources that provide great convenience to researchers in many fields such as chemistry, biochemistry, drug discovery and materials science. These databases systematically present critical information such as the structures, physicochemical properties, biological activities and synthesis methods of chemical compounds, thus providing accuracy and speed in scientific studies. Chemical databases, which have a wide range of uses from academic research to industrial applications, are becoming increasingly important with big data analytics and artificial intelligence-supported discovery processes. In this article, we will discuss the most widely used chemical databases, the information they provide and their roles in research.
PubChem
The PubChem database is one of the largest open access chemical databases containing the structures, physical and biological properties, toxicity data, bioactivity information and calculated properties of chemical compounds (1). Users can access chemical representations of compounds in SMILES and InChI formats, get information about biological activity tests (bioassays) and access structure files suitable for molecular dynamics simulations. The PubChem database can be accessed at https://pubchem.ncbi.nlm.nih.gov/ . PubChem was created by the US National Institutes of Health (NIH) and contains more than 100 million molecules. Reaching millions of users each month, PubChem is a popular resource serving a wide audience, including researchers, chemical health and safety officers, patent attorneys, educators, and students.

Although the majority of chemicals included in PubChem are small molecules, it also includes other chemical compounds such as siRNA, miRNA, lipids, carbohydrates, and chemically modified biopolymers. This data is organized into multiple data collections, such as Substance, BioAssay, Compound, Protein, Gene, Pathway, Cell Line, Taxonomy, and Patent (2,3). The Substance collection archives chemical descriptions provided by repositories, while Compound is a dataset of unique chemical structures extracted from the Substance collection (3). BioAssay contains descriptions and test results of biological assay experiments provided by repositories (4). The Protein, Gene, Pathway, Cell Line, and Taxonomy collections provide targeted chemical information for a specific protein, gene, pathway, cell line, and taxon, respectively (2). The Patent collection contains information about the chemicals, proteins, genes, and taxa mentioned in each patent document.
ChEMBL
ChEMBL is a large, open-access database aimed at capturing medicinal chemistry data and information in the drug discovery process (5). Information on small molecules and their biological activities is extracted from full-text articles in a variety of major medicinal chemistry journals and integrated with data on approved drugs and clinical development candidates, such as mechanism of action and therapeutic indications. Bioactivity data is also shared with other databases such as PubChem BioAssay (6) and BindingDB (7) so that users can benefit from a wider range of information.

ChEMBL database has many practical applications, including: identifying chemical tools for a given target, assessing compound selectivity, training machine learning models for target prediction, helping to generate drug repurposing hypotheses, assessing target accessibility, and integrating with other drug discovery resources (8,9). ChEMBL database is available at https://www.ebi.ac.uk/chembl/.
DrugBank
DrugBank was initially designed as a comprehensive, up-to-date, and free web resource containing detailed drug, drug-target, drug mechanism of action, and drug interaction information for approved and experimental drugs. By providing high-quality, primary source content free of charge, DrugBank aims to make the discovery of new drugs, the repurposing of old drugs, the understanding of drug mechanisms, and the monitoring of drug interactions easier for academics, medicinal chemists, pharmacists, and pharmaceutical companies. DrugBank has become one of the world’s most widely used reference drug resources; It receives more than 30 million web accesses per year and receives more than 5000 citations per year (10). DrugBank, which has undergone many updates in recent years, has become a large information resource with important additions such as pharmacogenomic data, drug metabolism data, and comprehensive ADMET (absorption, distribution, metabolism, excretion, and toxicity) information. DrugBank 5.0, released in 2018, significantly expanded the scope of the database by adding more experimental drug data, new drug-drug interaction data, new pharmaco-omics data types, and a significant amount of drug spectral data (mass spectrometry (MS) and nuclear magnetic resonance (NMR) data) (11). DrugBank version 6.0 includes many improvements and developments compared to the previous version. These include; This includes a 72% increase in the number of FDA-approved drugs, a 38% increase in the number of experimental drugs, a massive increase of approximately 300% in cataloged drug-drug interactions, and a 200% increase in monitored drug-food interactions. In addition to this expansion in database size, thousands of new colorful and richly annotated pathways have been added to visualize drug mechanisms and drug metabolism. In addition, thousands of more accurate and newly predicted spectral data for 1D ¹H and ¹³C NMR, LC-MS, GC-MS, and related chromatographic data have been integrated into the system.

DrugBank’s interface has also been updated to include new data fields and different types of search options. The DrugBank database can be accessed at https://go.drugbank.com/ .
ZINC
While identifying and purchasing new small molecules for testing in biological assays is an important step in molecule discovery, searching this space becomes a major challenge as the purchasable chemical space continues to grow to tens of billions of molecules based on inexpensive compounds produced on demand. ZINC is a public database that brings together commercially available and annotated compounds. ZINC provides downloadable 2D and 3D versions of molecules and a website that allows for rapid molecule search and similarity searches. ZINC has grown from less than 1 million compounds in 2005 to nearly 2 billion. The database design has evolved over time in response to needs in the field, the growth of the purchasable chemical space, and advances in software and hardware. The ZINC website is used by thousands of researchers every month, and terabytes of data are downloaded every week (12).

The focus of ZINC is molecular docking, and an important application is a practical approach based on the principle of molecular similarity called analog-by-catalog (ABC). This method allows the identification of similar compounds for which structure-activity relationships can be discovered. The ZINC database can be accessed at https://zinc.docking.org/.
Cambridge Structural Database (CSD)
Cambridge Structural Database (CSD) has been a core activity of the Cambridge Crystallography Data Centre (CCDC) since its establishment in 1965. The CCDC is dedicated to providing a permanent archive of crystal structures and making them available to the public. The CSD contains all published organic and metal-organic small molecule crystal structures that have been determined using crystallography techniques. It also serves as a publication tool for structure determinations without an accompanying article. In particular, the CSD includes X-ray and neutron diffraction analyses from single-crystal studies or powder studies; Cell parameters and atomic coordinates are also reported. Cell parameters and all available data are included, even if coordinates are not available, to ensure a comprehensive presentation of single crystal data. A single archive of all structures allows crystallographers to avoid inadvertent redetermination of structures and also provides a mechanism for archiving the output of their work to a specialist data centre for their own future use. This enables researchers to easily share their work, which significantly broadens their reach (13). The Cambridge Structural Database (CSD) is available at https://www.ccdc.cam.ac.uk/structures/

DrugCentral
DrugCentral is a public database that has been collecting drug information since 2016. Three major regulatory agencies are continuously monitored: the Food and Drug Administration (FDA) in the United States, the European Medicines Agency (EMA) in Europe, and the Medicines and Medical Devices Agency (PMDA) in Japan. This database provides accurate, high-quality data for preclinical research and clinical applications. Chemical structures, molecular physicochemical descriptors, and patent status are linked to bioactivity data and molecular targets. Approved therapeutic drug uses, off-label uses, and contraindications are manually curated from drug labels and scientific literature. Mechanisms of action targets and bioactivities are described where available. In addition to pharmacodynamic data, DrugCentral also provides a variety of standardized pharmacokinetic descriptors (14).

DrugCentral tracks new drug approvals and standardizes drug information. With the 2023 update, it includes 285 drugs (131 for human use). New additions include the integration of veterinary drugs (154 drugs for animal use only), the addition of 66 documented off-label uses, and the identification of adverse drug reactions for pediatric and elderly patients from pharmacovigilance data. Additional enhancements include chemical substructure searching using SMILES and “Target Cards” based on UniProt accession codes (15). The DrugCenter database is accessible at https://drugcentral.org/ .
SwissSidechain
SwissSidechain database is a unified and integrated resource that provides access to curated biochemical and structural information for 210 unnatural side chains. Many of these data are unique to SwissSidechain (e.g. rotamers and biomolecular parameters), while the rest are collected from existing sources and integrated for rapid access. The SwissSidechain database also includes visualization and molecular modeling tools. In addition to the 20 natural amino acids, the SwissSidechain database contains molecular and structural data for 210 unnatural alpha amino acid side chains in both L- and D-configurations (16). These amino acids were selected based on two main criteria. First, the unnatural side chains are present in publicly available protein structures in the Protein Data Bank (PDB) (17) and second, they are commercially available.

Each amino acid in SwissSidechain is given a three- or four-letter code. The existing three-letter code has been retained for all side chains found in the PDB. New four-letter codes have been created for other side chains. These include many D-amino acids found in the L-configuration PDB. For these amino acids, a ‘D’ is simply added to the beginning of the three-letter code (e.g. NLE stands for L-norleucine and DNLE for D-norleucine). For the rest, four-letter codes have been chosen to be as reminiscent of their chemical names as possible (e.g. AZDA for L-azido-alanine). For unnatural side chains in the D-configuration, the codes always start with the letter ‘D’ (e.g. DZDA for D-azido-alanine) (16). The SwissSidechain database is accessible at https://www.swisssidechain.ch/ .
SuperNatural
The SuperNatural database was first published as an open access database in 2006 (18) and its second update (SuperNatural II) was made in 2015 (19). The SuperNatural database has successfully supported researchers from the broad scientific research community and has enabled the screening of new molecules and biological activities in order to use natural products as an important resource for drug discovery. It has also been successfully used in the identification of lead structures (20, 21).

SuperNatural 3.0 is a database that brings together information collected from various sources and scientific literature containing freely available natural compounds. It also includes information on reliable chemical suppliers offering natural products. In this 3.0 version, information on 449,058 unique compounds has been collected and presented as downloadable files. The database has been meticulously curated using various editing criteria, and compounds have been assigned a confidence score depending on their level of annotation. All natural compounds with taxonomy or supplier information and linked to at least three open access natural product (NP) databases were assigned a confidence score of 1. A confidence score of 0.5 was given to compounds with no taxonomy information but linked to at least one NP database outside the SuperNatural database. The database also includes information on physicochemical properties, toxicity class, mechanism of action (MoA), therapeutic pathways, targeted library, taste information, and disease indications (22). The SuperNatural database can be accessed via https://bioinf-applied.charite.de/supernatural_3/.
COCONUT Database
COCONUT database is open and free to the public; no login is required for access. The web interface allows simple searches by molecule name, InChI, InChI key, SMILES, drawn structure and molecular formula. Advanced searches by molecular features, substructure and similarity searches can also be performed. Users can download the entire dataset or search results in different formats. The database can be queried programmatically via the REST API, allowing COCONUT to be integrated into various workflows. The web interface, backend system and database are distributed as Docker containers and can be easily ported to other natural product datasets and implemented as local installations. COCONUT is compiled from a large number of chemical data sources, and the natural products (NPs) obtained from these sources are meticulously extracted, organized, processed and annotated. The resulting collection of natural products is presented in a fully-fledged chemical database specifically developed for this purpose (23). The COCONUT database can be accessed at https://coconut.naturalproducts.net/ .
Chemical databases are one of the fundamental building blocks of modern chemistry and biotechnology research. They guide research by providing comprehensive data on the structures, properties, interactions, and biological activities of molecules. These databases are used in a wide range of areas, from the design of new compounds to drug discovery. In addition, rapid access to chemical information allows researchers to work more efficiently. As a result, chemical databases are indispensable tools for accelerating research, supporting scientific findings, and developing innovative solutions. Therefore, scientists' effective use of these resources will contribute to the acceleration of scientific progress.
References
1. Kim, S., Chen, J., Cheng, T., Gindulyte, A., He, J., He, S., Li, Q., Shoemaker, B. A., Thiessen, P. A., Yu, B., Zaslavsky, L., Zhang, J., & Bolton, E. E. (2023). PubChem 2023 update. Nucleic acids research, 51(D1), D1373–D1380. https://doi.org/10.1093/nar/gkac956
2. Kim,S., Cheng,T.J., He,S.Q., Thiessen,P.A., Li,Q.L., Gindulyte,A. and Bolton,E.E. (2022) PubChem Protein, Gene, Pathway, and Taxonomy data collections: bridging biology and chemistry through Target-Centric Views of PubChem data. J. Mol. Biol., 434, 167514.
3. Kim,S., Thiessen,P.A., Bolton,E.E., Chen,J., Fu,G., Gindulyte,A., Han,L.Y., He,J.E., He,S.Q., Shoemaker,B.A. et al. (2016) PubChem Substance and Compound databases. Nucleic Acids Res., 44, D1202–D1213.
4. Wang,Y.L., Bryant,S.H., Cheng,T.J., Wang,J.Y., Gindulyte,A., Shoemaker,B.A., Thiessen,P.A., He,S.Q. and Zhang,J. (2017) PubChem BioAssay: 2017 update. Nucleic Acids Res., 45, D955–D963.
5. Bento,A.P., Gaulton,A., Hersey,A., Bellis,L.J., Chambers,J., Davies,M., Kruger,F.A., Light,Y., Mak,L., McGlinchey,S. et al. (2014) The ChEMBL bioactivity database: an update. Nucleic Acids Res., 42, D1083–D1090.
6. Wang,Y., Bryant,S.H., Cheng,T., Wang,J., Gindulyte,A., Shoemaker,B.A., Thiessen,P.A., He,S. and Zhang,J. (2017) PubChem BioAssay: 2017 update. Nucleic Acids Res., 45, D955–D963.
7. Gilson,M.K., Liu,T., Baitaluk,M., Nicola,G., Hwang,L. And Chong,J. (2016) BindingDB in 2015: a public database for medicinal chemistry, computational chemistry and systems pharmacology. Nucleic Acids Res., 44, D1045–D1053.
8. Koscielny,G., An,P., Carvalho-Silva,D., Cham,J.A., Munoz-Pomer Fuentes,A., Fumis,L., Gasparyan,R., Hasan,S., Karamanis,N., Maguire,M. et al. (2016) Open Targets: a platform for therapeutic target identification and validation. Nucleic Acids Res., 45, D985–D994.
9. Tym,J.E., Mitsopoulos,C., Coker,E.A., Razaz,P., Schierz,A.C., Antolin,A.A. and Al-Lazikani,B. (2016) canSAR: an updated cancer research and drug discovery knowledgebase. Nucleic Acids Res., 44, D938–D943.
10. Knox, C., Wilson, M., Klinger, C. M., Franklin, M., Oler, E., Wilson, A., Pon, A., Cox, J., Chin, N. E. L., Strawbridge, S. A., Garcia-Patino, M., Kruger, R., Sivakumaran, A., Sanford, S., Doshi, R., Khetarpal, N., Fatokun, O., Doucet, D., Zubkowski, A., Rayat, D. Y., … Wishart, D. S. (2024). DrugBank 6.0: the DrugBank Knowledgebase for 2024. Nucleic acids research, 52(D1), D1265–D1275. https://doi.org/10.1093/nar/gkad976
11. Wishart,D.S., Feunang,Y.D., Guo,A.C., Lo,E.J., Marcu,A., Grant,J.R., Sajed,T., Johnson,D., Li,C., Sayeeda,Z., et al. (2018) DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res., 46, D1074–D1082.
12. Irwin, J. J., Tang, K. G., Young, J., Dandarchuluun, C., Wong, B. R., Khurelbaatar, M., Moroz, Y. S., Mayfield, J., & Sayle, R. A. (2020). ZINC20-A Free Ultralarge-Scale Chemical Database for Ligand Discovery. Journal of chemical information and modeling, 60(12), 6065–6073. https://doi.org/10.1021/acs.jcim.0c00675
13. Groom, C. R., Bruno, I. J., Lightfoot, M. P., & Ward, S. C. (2016). The Cambridge Structural Database. Acta crystallographica Section B, Structural science, crystal engineering and materials, 72(Pt 2), 171–179. https://doi.org/10.1107/S2052520616003954
14. Ursu,O., Holmes,J., Knockel,J., Bologa,C.G., Yang,J.J., Mathias,S.L., Nelson,S.J. and Oprea,T.I. (2016) DrugCentral: online drug compendium. Nucleic Acids Res., 45, D932–D939.
15. Sorin Avram, Thomas B Wilson, Ramona Curpan, Liliana Halip, Ana Borota, Alina Bora, Cristian G Bologa, Jayme Holmes, Jeffrey Knockel, Jeremy J Yang, Tudor I Oprea, DrugCentral 2023 extends human clinical data and integrates veterinary drugs, Nucleic Acids Research, Volume 51, Issue D1, 6 January 2023, Pages D1276–D1287, https://doi.org/10.1093/nar/gkac1085
16. Gfeller, D., Michielin, O., & Zoete, V. (2012). SwissSidechain: a molecular and structural database of non-natural sidechains. Nucleic acids research, 41(D1), D327-D332.
17. Rose,P.W., Beran,B., Bi,C., Bluhm,W.F., Dimitropoulos,D., Goodsell,D.S., Prlic,A., Quesada,M., Quinn,G.B., Westbrook,J.D. et al. (2011) The RCSB Protein Data Bank: redesigned web site and web services. Nucleic Acids Res., 39, D392–D401.
18. Dunkel,M., Fullbeck,M., Neumann,S. and Preissner,R. (2006) SuperNatural: a searchable database of available natural compounds. Nucleic Acids Res., 34, D678–D83.
19. Banerjee,P., Erehman,J., Gohlke,B.-O., Wilhelm,T., Preissner,R. and Dunkel,M. (2015) Super Natural II– a database of natural products. Nucleic Acids Res., 43, D935–D939.
20. Abel,R., Paredes Ramos,M., Chen,Q., P´erez-S´ anchez,H., Coluzzi,F., Rocco,M., Marchetti,P., Mura,C., Simmaco,M., Bourne,P.E. et al. (2020) Computational prediction of potential inhibitors of the main protease of SARS-CoV-2. Front. Chem., 8, 590263.
21. Attia,Y.A., Alagawany,M.M., Farag,M.R., Alkhatib,F.M., Khafaga,A.F., Abdel-Moneim,A.-M.E., Asiry,K.A., Mesalam,N.M., Shafi,M.E., Al-Harthi,M.A. et al. (2020) Phytogenic products and phytochemicals as a candidate strategy to improve tolerance to coronavirus. Front. Vet. Sci., 7, 573159.
22. Gallo, K., Kemmler, E., Goede, A., Becker, F., Dunkel, M., Preissner, R., & Banerjee, P. (2023). SuperNatural 3.0—a database of natural products and natural product-based derivatives. Nucleic Acids Research, 51(D1), D654-D659.
23. Sorokina, M., Merseburger, P., Rajan, K., Yirik, M. A., & Steinbeck, C. (2021). COCONUT online: collection of open natural products database. Journal of Cheminformatics, 13(1), 2.



Comments