Introduction to Chemoinformatics and Basic Concepts

Emre Can Buluz
Apr 28, 2025
10 min read

Updated: Jun 10, 2025

Chemoinformatics is a branch of science that combines the fundamental principles of chemistry with computer science to analyze the structural properties, biological effects, and interactions of molecules. Chemoinformatics plays an important role in many areas such as modern drug design, materials science, and environmental chemistry, and provides a powerful tool for the management and analysis of large data sets. This discipline has a wide range of applications, from the creation of molecular databases to the discovery of new compounds using artificial intelligence and machine learning techniques.

Figure 1. Image represents the role of chemistry and computer science in modern drug design, materials science, and environmental chemistry. (Image generated using AI) — **Figure 1.** Image represents the role of chemistry and computer science in modern drug design, materials science, and environmental chemistry. (Image generated using AI)

In 1998, Dr. Brown defined chemoinformatics as follows: “The use of information technology and management has become a critical part of the drug discovery process. Chemoinformatics is the process of bringing together these sources of information to transform data into information and information into information for faster and better decision making in the field of drug lead identification and regulation.” (1).

History of Chemoinformatics

In 1946, King and colleagues published a paper demonstrating the generation of spinning spectra of asymmetric rotors using IBM’s business accounting machines (2). This process was accomplished by evaluating mathematical equations for line position and line density. This work was one of the earliest attempts to apply computer technology to chemistry, and for this reason, 1946 can be considered the birth year of chemoinformatics. Early pioneering work in chemoinformatics focused on converting printed chemical data collections, such as mass spectra and chemical literature, into electronic formats and developing related database search systems. The first method involving the coding of spectral data using punched cards for searching mass spectra was described by Zemany (3) in 1950. In 1951, Kuentzel (4) developed punched card systems for storing IR spectra and finding spectral matches using sorters or collators.

One of the important algorithms in chemoinformatics, the substructure matching algorithm, was described by Ray and Kirsch in 1957 (5). This algorithm is based on a link table and a backtracking algorithm for representing chemical structures. This algorithm, which is still widely used, is often referred to as "atom-atom matching". In 1959, Opler and Baird described the first graphical representation of chemical structures as computer output on a cathode ray tube (CRT) surface (6).

In the 1960s, significant progress began to be made in the development of chemical database access systems. In 1964, several reports were published on the use of computer techniques to search the ASTM database, which contained thousands of IR spectra (7). In 1965, the British government's Atomic Weapons Research Establishment (AWRE) initiated a project to create a global database of mass spectra, which led to the establishment of the Mass Spectrometry Data Centre (MSDC) at Aldermaston. A few years later, the Laboratory of Chemistry of the US National Institutes of Health (NIH) began developing a computer-based library access system based on the mass spectrum databases of the MSDC and a database belonging to Professor Biemann of MIT (8). Another important development was the establishment in 1965 of a new database collecting crystal structures, the Cambridge Structural Database (CSD), in Cambridge, England (9). The CSD became the first and most important source of experimental three-dimensional structure data. For example, many 3D structural generators, such as CONCORD and CORINA, derived their templates from this database. CSD has played an increasingly important role in solving challenging problems in structural chemistry and drug design. One of the most exciting pioneering areas of research in chemoinformatics in the 1960s was the discovery of artificial intelligence and its applications in chemistry, particularly chemical expert systems. An expert system is an artificial intelligence application program used to perform a specific task that requires expertise. It attempts to find solutions to narrow-scope problems by imitating human experts. The first expert system in chemistry was developed in 1965 by the DENDRAL project. This project arose from Lederberg’s work at Stanford University on the scope of structural isomerism and the representation of chemical structures by mathematical models (10). The aim of the DENDRAL project was to develop an expert system to automatically determine the structure of an unknown compound from its mass spectrum.

In 1971, Dr. Walter Hamilton established the Protein Data Bank (PDB) at Brookhaven National Laboratory. The PDB is a database containing experimentally determined crystallographic data (3D structures) of biological macromolecules such as proteins and nucleic acids (11). In 1974, Gund and coworkers described a system for three-dimensional structure searching. This work provided the basis for further development of 3D structure searching systems. In 1972, Erni and Clerc reported the first system that allowed searches of combined spectral databases of 1H NMR, IR, and mass spectra. In 1973, Kwok and coworkers described the STIRS system for obtaining substructures from mass spectral searches (12). In 1979, Dubois and Bonnet described the DARC Pluridata system for 13C NMR databases (13). In 1973, Adamson and Bush investigated the possibility of using the number of common substructure fragments to measure the similarity of two chemical structures (14). The method was later extended to QSAR studies by Carhart and colleagues in 1985 and to database searches by Willett and colleagues in 1986 (15). In the 1980s, Willett's team reported on studies on pharmacophore pattern matching in three-dimensional chemical structure files. These studies included selection of interatomic distance screens (16), evaluation of search performance (17), and comparison of geometric search algorithms (18). In 1987, Brint and Willett published work on various algorithms for identifying three-dimensional maximal common substructures.

In 1992, Brown and colleagues reported on studies investigating the possibility of using a hyperstructure model to represent a group of chemical structures in order to improve substructure search performance (19). In the early 1990s, Artymiuk and coworkers published studies examining three-dimensional structural similarities using graph-theoretical techniques (20). These studies included the analysis of structural similarities between leucine aminopeptidase and carboxypeptidase in 1992 and between ribonuclease H and the binding domains of HIV reverse transcriptase in 1993 (21). In 1992, Dalby and coworkers described a series of chemical structure file formats developed over the years by Molecular Design Limited (now Elsevier MDL) to store and transfer chemical structure information (22). In 1993, Martin and coworkers described the first pharmacophore mapping system, called Disco (23). In the same year, Grindley announced a method for determining tertiary structural similarities of proteins using the maximum common subgraph isomorphism algorithm (24). In 1997, Allen and Hoy reported the first library of knowledge bases, IsoStar (25), developed by the Cambridge Crystallographic Data Center (CCDC). IsoStar contained comprehensive and systematic noncovalent interaction information derived from the CSD (Cambridge Structural Database) and Brookhaven Protein Data Bank (PDB) databases, along with selected interaction energies. These energies were calculated by ab initio molecular orbital methods. IsoStar can be used in rational drug design and crystal engineering applications.

A number of freely available databases have become available in the 2000s. In 2004, the NIH released a publicly available chemical structure database called PubChem (26). PubChem provides information on the biological activities of small molecules and is part of the NIH's Molecular Libraries Roadmap Initiative. In 2005, Irwin and Shoichet reported a database called ZINC (27). This free database contains 2.7 million commercially available compounds that have been prepared for use in molecular docking programs. ZINC provides an important resource for scientists searching for lead compounds for drug discovery. In the same year, Girke and colleagues described a compound mining database called ChemMine (28). ChemMine aims to facilitate drug and agrochemical discovery and chemical genomic screenings.

Basic Concepts in Chemoinformatics

Chemical Information Representation: The general name of methods that allow chemical compounds to be expressed in a computer-aware manner. These representations express the atomic structure, bonds, geometric properties and other chemical properties of molecules in digital formats (29).

Molecular Descriptors: Mathematical indicators that summarize the chemical structure, properties and behaviors of a molecule in a numerical or symbolic form. These descriptors are used to model the relationships between chemical structure and physicochemical properties, biological activities or other molecular properties (30).

Chemical Databases: Digital platforms where chemical compounds, biomolecules or related information are stored, organized and presented to researchers. These databases provide access to chemical structures, biological activities, physicochemical properties and other information (31).

Chemical Space: An abstract representation of all possible chemical compounds. This concept generally refers to the sum of all potential structures that a molecule or chemical compound can form (32).

Chemical Diversity: The concept of chemical diversity is a feature that researchers target when discovering new compounds. Especially in drug design, a wide variety of compounds can be valuable for potential drug candidates that may act on new biological targets. This diversity offers more discovery opportunities and potential therapeutic methods (33).

Chemical Similarity: Measures how similar two or more molecules are in terms of structural or physicochemical properties. It is calculated with mathematical metrics such as the Tanimoto coefficient (34).

Chemical Fragments: The separation of molecules into smaller, structurally meaningful parts (35).

ADMET Properties: Parameters that evaluate the pharmaceutical potential of a molecule. Here A stands for absorption, D for distribution, M for metabolism, E for elimination/excretion and T for toxicity (36).

Pharmacophore: A set of properties that enable a molecule to interact with its biological target (37).

QSAR/QSPR (Quantitative Structure-Activity/Property Relationships): QSAR and QSPR are methods that examine the mathematical relationships between the chemical structure of a molecule and its biological activity or physicochemical properties (38).

References

1. Chemoinformatics: Past, Present, and Future, by William Lingran Chen. Journal of Chemical Information and Modeling 2006 46 (6), 2230-2255. https://doi.org/10.1021/ci060016u

2. King, G. W.; Cross, P. C.; Thomas, G. B. The Asymmetric Rotor. III. Punched-Card Methods of Constructing Band Spectra. J. Chem. Phys. 1946, 14, 35-42. https://doi.org/10.1063/1.1724059

3. Zemany, P. D. Punched Card Catalog of Mass Spectra Useful in Qualitative Analysis. Anal. Chem. 1950, 22, 920-922. https://doi.org/10.1021/ac60043a021

4. Kuentzel, L. E. New Codes for Hollerith-Type Punched Cards. Anal. Chem. 1951, 23, 1413-1418. https://doi.org/10.1021/ac60058a016

5. Ray, L. C.; Kirsch, R. A. Finding Chemical Records by Digital Computers. Science 1957, 126, 814-819. https://doi.org/10.1126/science.126.3278.814

6. Opler, A.; Baird, N. Display of Chemical Structural Formulas as Digital Computer Output. Am. Doc. 1959, 10, 59-63.

7. Sparks, R. A. Storage and RetrieVal of Wyandotte-ASTM Infrared Spectral Data Using an IBM 1401 Computer; ASTM: Philadelphia, PA, 1964.

8. Heller, S. R. Mass Spectrometry Databases and Search Systems. In Computer-Supported Spectroscopic Databases; Zupan, J., Ed.; Ellis Horwood Limited: New York, 1986; Chapter 6, pp 118-132.

9. Allen, F. H.; Hoy, V. J. Cambridge Structural Database. In The Encyclopedia of Computational Chemistry; Schleyer, P. v. R., Allinger, N. L., Clark, T., Gasteiger, J., Kollman, P. A., Schaefer, H. F., Schreiner, P. R., Eds.; J. Wiley & Sons: Chichester, 1998; pp 155-167.

10. Lederberg, J. Topological Mapping of Organic Molecules. Proc. Natl. Acad. Sci. U.S.A. 1965, 53, 134-139. https://doi.org/10.1073/pnas.53.1.134

11. Sussman, J. L., Lin, D., Jiang, J., Manning, N. O., Prilusky, J., Ritter, O., & Abola, E. E. (1998). Protein Data Bank (PDB): database of three-dimensional structural information of biological macromolecules. Acta crystallographica. Section D, Biological crystallography, 54(Pt 6 Pt 1), 1078–1084. https://doi.org/10.1107/s0907444998009378

12. Kwok, K.-S.; Venkataraghavan, R.; McLafferty, F. W. ComputerAided Interpretation of Mass Spectra. III. Self-Training Interpretive and Retrieval System. J. Am. Chem. Soc. 1973, 95, 4185-4194. https://doi.org/10.1021/ja00794a014

13. Dubois, J. E.; Bonnet, J. C. The DARC Pluridata System: The 13C NMR Data Bank. Anal. Chim. Acta 1979, 112, 245-252. https://doi.org/10.1016/S0003-2670(01)83553-4

14. Adamson, G. W.; Bush, J. A. A Method for the Automatic Classification of Chemical Structures. Inf. Storage RetrieV. 1973, 9, 561-568. https://doi.org/10.1016/0020-0271(73)90059-4

15. Carhart, R. E.; Smith, D. H.; Venkataraghavan, R. Atom Pairs as Molecular Features in Structure-Activity Studies: Definition and Applications. J. Chem. Inf. Comput. Sci. 1985, 25, 64-73. https://doi.org/10.1021/ci00046a002

16. Jakes, S. E.; Willett, P. Pharmacophoric Pattern Matching in Files of Three-Dimensional Chemical Structures. Selection of Interatomic Distance Screens. J. Mol. Graphics 1986, 4, 12-20. https://doi.org/10.1016/0263-7855(86)80088-1

17. Jakes, S. E.; Watts, N.; Willett, P.; Barden, D.; Fisher, J. D. Pharmacophoric Pattern Matching in Files of 3D Chemical Structures: Evaluation of Search Performance. J. Mol. Graphics 1987, 5, 41-48. https://doi.org/10.1016/0263-7855(87)80044-9

18. Brint, A. T.; Willett, P. Pharmacophoric Pattern Matching in Files of Three-Dimensional Chemical Structures: Comparison of Geometric Searching Algorithms. J. Mol. Graphics. 1987, 5, 49-56. https://doi.org/10.1016/0263-7855(87)80045-0

19. Brown, R. D.; Downs, G. M., Willett, P.; Cook, A. P. F. A Hyperstructure Model for Chemical Structure Handling: Generation and Atom-by-atom Searching of Hyperstructures. J. Chem. Inf. Comput. Sci. 1992, 32, 522-531. https://doi.org/10.1021/ci00009a020

20. Artymiuk, P. J., Grindley, H. M., Park, J. E., Rice, D. W., & Willett, P. (1992). Three-dimensional structural resemblance between leucine aminopeptidase and carboxypeptidase A revealed by graph-theoretical techniques. FEBS letters, 303(1), 48–52. https://doi.org/10.1016/0014-5793(92)80475-v

21. Artymiuk, P. J., Grindley, H. M., Kumar, K., Rice, D. W., & Willett, P. (1993). Three-dimensional structural resemblance between the ribonuclease H and connection domains of HIV reverse transcriptase and the ATPase fold revealed using graph theoretical techniques. FEBS letters, 324(1), 15–21. https://doi.org/10.1016/0014-5793(93)81523-3

22. Dalby, A.; Nourse, J. G.; Hounshell, W. D.; Gushurst, A. K. I.; Grier, D. L.; Leland, B. A.; Laufer, J. Description of Several Chemical Structure File Formats Used by Computer Programs Developed at Molecular Design Limited. J. Chem. Inf. Comput. Sci. 1992, 32, 244-255. https://doi.org/10.1021/ci00007a012

23. Martin, Y. C., Bures, M. G., Danaher, E. A., DeLazzer, J., Lico, I., & Pavlik, P. A. (1993). A fast new approach to pharmacophore mapping and its application to dopaminergic and benzodiazepine agonists. Journal of computer-aided molecular design, 7(1), 83–102. https://doi.org/10.1007/BF00141577

24. Grindley, H. M., Artymiuk, P. J., Rice, D. W., & Willett, P. (1993). Identification of tertiary structure resemblance in proteins using a maximal common subgraph isomorphism algorithm. Journal of molecular biology, 229(3), 707–721. https://doi.org/10.1006/jmbi.1993.1074

25. Allen, F. H.; Hoy, V. J. Cambridge Structure Database. In The Encyclopedia of Computational Chemistry; Schleyer, P. v. R., Allinger, N. L., Clark, T., Gasteiger, J., Kollman, P. A., Schaefer, H. F., Schreiner, P. R., Eds.; J. Wiley & Sons: Chichester, 1998; pp 155-167.

26. Bolton,E.E., Wang,Y., Thiessen,P.A. and Bryant,S.H. (2008) PubChem: integrated platform of small molecules and biological activities. In: Wheeler,RA and Spellmeyer,DC (eds). Annual Reports in Computational Chemistry. Elsevier, Amsterdam, Vol. 4 , pp. 217–241. https://doi.org/10.1016/S1574-1400(08)00012-1

27. Irwin, J. J., & Shoichet, B. K. (2005). ZINC--a free database of commercially available compounds for virtual screening. Journal of chemical information and modeling, 45(1), 177–182. https://doi.org/10.1021/ci049714+

28. Girke, T., Cheng, L. C., & Raikhel, N. (2005). ChemMine. A compound mining database for chemical genomics. Plant physiology, 138(2), 573–577. https://doi.org/10.1104/pp.105.062687

29. Warr, W. A. (2011). Representation of chemical structures. Wiley Interdisciplinary Reviews: Computational Molecular Science, 1(4), 557-579. https://doi.org/10.1002/wcms.36

30. Consonni, V., & Todeschini, R. (2010). Molecular descriptors. Recent advances in QSAR studies: methods and applications, 29-102. https://doi.org/10.1007/978-1-4020-9783-6_3

31. Wishart D. S. (2007). Introduction to cheminformatics. Current protocols in bioinformatics, Chapter 14, . https://doi.org/10.1002/0471250953.bi1401s18

32. Reymond, J. L., Van Deursen, R., Blum, L. C., & Ruddigkeit, L. (2010). Chemical space as a source for new drugs. MedChemComm, 1(1), 30-38. https://doi.org/10.1039/C0MD00020E

33. Pearlman, R. S., & Smith, K. M. (2002). Novel software tools for chemical diversity. In 3D QSAR in Drug Design: Ligand-Protein Interactions and Molecular Similarity (pp. 339-353). Dordrecht: Springer Netherlands. https://doi.org/10.1007/0-306-46857-3_18

34. Willett, P., Barnard, J. M., & Downs, G. M. (1998). Chemical similarity searching. Journal of chemical information and computer sciences, 38(6), 983-996. https://doi.org/10.1021/ci9800211

35. Chakravarti S. K. (2018). Distributed Representation of Chemical Fragments. ACS omega, 3(3), 2825–2836. https://doi.org/10.1021/acsomega.7b02045

36. Norinder, U., & Bergström, C. A. (2006). Prediction of ADMET Properties. ChemMedChem, 1(9), 920–937. https://doi.org/10.1002/cmdc.200600155

37. Yang S. Y. (2010). Pharmacophore modeling and applications in drug discovery: challenges and recent advances. Drug discovery today, 15(11-12), 444–450. https://doi.org/10.1016/j.drudis.2010.03.013

38. Liu, P., & Long, W. (2009). Current mathematical methods used in QSAR/QSPR studies. International journal of molecular sciences, 10(5), 1978–1998. https://doi.org/10.3390/ijms10051978