top of page

Machine learning in chemoinformatics and drug discovery

Chemoinformatics is an established discipline focusing on extracting, processing, and extrapolating meaningful data from chemical structures. With the rapid explosion of chemical ‘big’ data from HTS and combinatorial synthesis, machine learning has become an indispensable tool for drug designers to mine chemical information from large compound databases to design drugs with important biological properties. To process the chemical data, they first reviewed multiple processing layers in the chemoinformatics pipeline followed by the introduction of commonly used machine learning models to drug discovery and QSAR analysis.

Machine learning is currently one of the most important and rapidly evolving topics in computer-aided drug discovery. In contrast to physical models that rely on explicit physical equations like quantum chemistry or molecular dynamics simulations, machine learning approaches use pattern recognition algorithms to discern mathematical relationships between empirical observations of small molecules and extrapolate them to predict chemical, biological, and physical properties of novel compounds. Also, in comparison to physical models, machine learning techniques are more efficient and can easily be scaled to big datasets without the need for extensive computational resources. One of the primary application areas for machine learning in drug discovery is helping researchers understand and exploit relationships between chemical structures and their biological activities or SAR.

For instance, given a hit compound from a drug screening campaign, they might wish to know how its chemical structure can be optimized to improve its binding affinity, biological responses, or physiochemical properties. Fifty years ago, this type of problem could only be addressed through numerous costly, time-consuming, labor-intensive cycles of medicinal chemistry synthesis and analysis. Today, modern machine learning techniques can be used to model QSAR, or quantitative structure-property relationships (QSPR), and develop artificial intelligence programs that accurately predict in silico how chemical modifications might influence biological behavior.

Many physicochemical properties of drugs, such as toxicity, metabolism, drug-drug interactions, and carcinogenesis, have been effectively modeled by QSAR techniques. Early QSAR models, such as Hansch and Free–Wilson analysis, used simple multivariate regression models to correlate potency (logIC50) with substructure motifs and chemical properties like solubility (logP), hydrophobicity, substituent pattern, and electronic factors. Although groundbreaking and successful, these approaches were ultimately limited by the unavailability of experimental data and the linearity assumption made in modeling. Therefore, advanced chemoinformatics and machine learning techniques capable of modeling nonlinear datasets, as well as big data of increasing depth and complexity, are needed.

Chemoinformatics is a broad field that encompasses computer science and chemistry with the goal of utilizing computer information technology to solve problems in the field of chemistry such as chemical information retrieval and extraction, compound database searching, and molecular graph mining. Other areas of chemoinformatics related to drug discovery also include computer-aided drug synthesis (a very broad field with >50 years’ history), chemical space exploration, pharmacophore, and scaffold analysis, library design, among others. Converting a compound structure into chemical information applicable for machine learning tasks requires multilayer computational processing from chemical graph retrieval, descriptor generation, fingerprint construction to similarity analysis, in which each layer is built upon the successful development of previous layers and often has a substantial impact on the quality of the chemical data for machine learning. Chemical graph theory to understand how the structures of chemicals influence their biological activities, it is imperative to review the foundations of chemical graph theory.

A chemical graph, also known as a ‘molecular graph’ or structural graph’, is a mathematical construct comprising an ordered pair G = (V,E), where V is a set of vertices (atoms) connected by a set of edges (bonds) E. Chemical graph theory maintains that, because chemical structures are fully specified by their graph representations, they contain the information necessary to model and provide insight into a wide range of biological phenomena. Several variations of chemical graphs have been proposed. Weighted chemical graphs assign values to edges and vertices to indicate bond lengths and other atomic properties. Chemical pseudographs or reduced graphs use multiple edges and self-loops to capture detailed bond valence information. Regardless of flavor, chemical graphs represent atomic connectivity using a bond adjacency matrix, or topological distance matrix, which supports the computation of several topological indices useful for chemoinformatics modeling. GarciaDomenech et al. demonstrated the application of chemical graphs for chemometric analysis. In their study, they proposed an equation that combined pseudograph vertex degree derived from the adjacency matrix with two key parameters from the complete graph to model the electronegativity of 30 elements from the main group of the periodic table. More recently, Fourches and Tropsha developed the advanced dataset graph analysis (ADDAGRA) approach. In this work, they combined multiple graph indices from bond connectivity matrices to compare and quantify chemical diversity for large compound sets using chemical space networks in high-dimensional space.

The study showed that the ADDAGRA approach could uncover shared chemical space between chemical databases to improve SAR analysis. Chemical descriptors are numerical features extracted from chemical structures for molecular data mining, compound diversity analysis, and compound activity prediction. Chemical descriptors can be one-dimensional (0D or 1D), 2D, 3D or 4D . One-dimensional descriptors are scalars that describe aggregate information such as atom counts, bond counts, molecular weight, sums of atomic properties, or fragment counts. Although simple to compute, 1D descriptors suffer from degeneracy problems where distinct compounds are mapped to identical descriptor values for a given descriptor. Thus, 1D descriptors are usually used in concert with higher-dimensional descriptors or expressed as a vector of multiple 1D descriptors. 2D chemical descriptors are the most frequent descriptor type reported in the literature and include topological indices, molecular profiles, and 2D autocorrelation descriptors. An important feature of 2D descriptors, which makes them useful for structure differentiation, is graph invariance where descriptor values are unaffected by the renumbering of graph nodes (vertices). To facilitate analysis of the large space of 2D descriptors, Hong et al. reported the Mol2 system that rapidly generates up to 200 types of 2D descriptors for large compound datasets. Other commercial software packages commonly used in the descriptor generation include the DRAGON system, which can generate up to 5000 types of descriptors as part of several QSAR studies [20,21]. 3D chemical descriptors extract chemical features from 3D coordinate representations and are considered the most sensitive to structural variations. Well-known 3D descriptors include autocorrelation descriptors, substituent constants, surface: volume descriptors and quantum–chemical descriptors. 3D chemical descriptors are useful for identifying ‘scaffold hops’ – distinct chemical scaffolds with similar binding activities.

A key limitation of 3D chemical descriptors in QSAR analysis is the computational complexity of conformer generation and structure alignments; which are absent of any guarantees that predicted conformations correspond to relevant bioactive conformations. 4D chemical descriptors are an extension of 3D chemical descriptors that simultaneously consider multiple structural conformations. Ash and Fourches applied molecular dynamics simulation on ERK2 kinase to compute 3D descriptors over a grid box based on the 20 ns trajectory and showed that such 4D chemical descriptors can effectively differentiate the most active ERK2 inhibitors from the inactive ones with superior enrichment rates. Chemical fingerprints are high-dimensional vectors, commonly used in chemometric analysis and similarity-based virtual screening applications, the elements of which are chemical descriptor values. MACCS substructure fingerprints are 2D binary fingerprints (0 and 1), with each of 166 bits indicating the presence or absence of particular substructure keys. Daylight fingerprints and extended connectivity Q3 fingerprints (ECFP) extract chemical patterns of up to a specified length or diameter from a chemical graph. In comparison to the predefined substructure keys of MACCS, these fingerprints can dynamically index features using hash functions and often yield higher specificity when searching complex structures.

The latest development in 2D fingerprints are continuous kernel and neural embedded fingerprints – internal representations learned by support vector machines (SVMs) and neural networks. Duvenaud et al. extended the convolution concept to molecules represented as 2D molecular graphs for extracting molecular representation. The architecture generalizes the fingerprint computation such that the representation can be learned via backpropagation in a data-driven manner, and improves predictions of solubility, drug efficacy, and organic photovoltaic efficiency. 3D fingerprints commonly used in 3D-QSAR studies include chemical features based on pharmacophoric patterns, surface properties, molecular volumes of molecular interaction fields. One of the best-known 3D fingerprints is the molecular interaction field (MIF), as implemented in the GRID program by Goodford. The MIF-based fingerprint places the ligand in a rectangular grid with a fixed interval and calculates the electronic, steric and hydrophobic contribution independently at each grid point. The resulting MIF-based fingerprints can then be used in comparative molecular field analysis (CoMFA) by deriving relationships between 3D grid points and compound activities. The dependency upon the relative orientation of the molecules within the grid box is a major limitation of 3D-QSAR techniques such as CoMFA analysis. To remove the dependency of ligand orientation in 3D-QSAR analysis, Baskin and Zhokhova recently introduced the continuous molecular field (CMF) approach which replaced grid points with continuous function to represent molecular fields and showed that its simplest form provides either comparable or enhanced predictive performance in comparison with state-of-the-art CoMFA methods.

Chemical similarity analysis similarity search is a fundamental technique for ligand-based drug discovery. Its objective is to identify and return database compounds with structures and bioactivities similar to query compounds. The chemical similarity principle, which states that compounds with similar structures will probably have similar bioactivities, is an underlying assumption of similarity-based virtual screening. However, this assumption might not always be valid. For example, ‘activity cliffs’ where minor modification of functional groups causes an abrupt change in activity to violate this principle and can cause the failure of QSAR models. The structural similarity of two molecules is most commonly evaluated by computing the Tanimoto coefficient (Tc) of their chemical fingerprints. The Tc, also known as the Jaccard index, is a measure of similarity between sets that compute a similarity score as the fraction of bits shared by two feature vectors. High Tc values indicate two compounds are similar but do not provide information dimensions of similarity, such as which specific chemical groups they share.

Chemical similarity can also be evaluated based on 3D structural Q4 features of compounds. The 3D Tanimoto index is a common 3D similarity metric that computes the fraction of shared molecular volumes between two comparing ligands. Examples of volume-based similarity implementation include the rapid overlap of chemical structures (ROCS) program – the most popular shape similarity approach in drug discovery based on Gaussian representation of molecular shape. An alternative 3D similarity metric is the pharmacophoric similarity, which considers only the volume overlap between crucial functional groups. Lo et al. developed the ShapeAlign program that combines 2D and 3D metrics based on the Obabel PF2 fingerprint, shapes, and pharmacophoric points for unsupervised 3D chemical similarity clustering. The validation study using 20 known drug classes retrieved from the directory of useful decoys (DUD) showed that the combined metrics outperformed either 2D or 3D metrics and successfully detected shared 3D features between several structural distinct HIV reverse transcriptase (HIVRT) inhibitors. A similar concept related to pharmacophoric similarity is molecular Q5 filed similarity as implemented in the FieldAlign tool (by the CRESSET company), which uses energetic probes to identify similar ligands that might not have explicit structural overlap. Recently, Ferreira and Couto developed a new similarity measure called chemical semantic similarity to classify chemical compounds based on their semantic characterization such as drug annotation in the ChEMBL database.

The study showed that comparing compounds by their functional roles improved predictions of several drug properties by complementing existing compound classification systems. The analog analysis seeks to characterize chemical transformations, which are defined over pairs of molecules. Recently, the matched molecular pairs (MMP) formalism has emerged as a way to define a specific type of transformation or relationship, non-ring singlebond substitutions, and facilitate the development of methods for indexing and searching analog relationships. The fragment indexing algorithm developed by Hussain and Rea is currently the most widely used MMP search method but does not support a similarity search. Rensi and Altman developed a method for computing the similarity of chemical transformations using Tanimoto kernel embedded fingerprints and extended a fuzzy search capability to the MMP framework. They demonstrated the capability to query MMP relationships at multiple levels of contextual abstraction with stable results over a range of dataset sizes of over four orders of magnitude from 103 high-impact pharmacological targets.

Machine learning models in QSAR Machine learning techniques can be broadly classified as supervised or unsupervised learning. For supervised learning, labels are assigned to the training data and, once trained, the model can predict labels for given data inputs. Supervised machine learning models include regression analysis, k-nearest neighbor (kNN), Bayesian probabilistic learning, SVMs, random forests, and neural networks. Unsupervised machine learning techniques learn underlying patterns of molecular features directly from unlabeled data. A special case of supervised learning is semisupervised learning or transductive learning, in which a small amount of labeled data is mixed with labeled data in the training process to improve the learning accuracy for modeling a small and unbalanced dataset. Unsupervised methods include dimensionality reduction techniques such as principal components analysis (PCA), independent components analysis (ICA), and several supervised methods that can also support unsupervised learning, such as SVMs, probabilistic graphical models, and neural networks. Clustering algorithms represent another family of unsupervised algorithms, where the dataset is first divided by predefined distance metrics in high-dimensional space and the labels are later assigned based on the number of observed categories. Modern machine learning techniques offer a powerful suite of techniques to explore nonlinear SAR relationships with high accuracy and precision.

Naive Bayes Naive Bayes classifiers are probabilistic models based on Bayes’ rule [56–58]. They estimate the probability that a given item of data is correctly assigned to a certain label based on the prior probability distribution(priors)representing the relative proportions of labels in training sets. If multiple labels are presented, then the probability associated with each label is conditionally independent. A well-known example of this approach is the PASS program for predicting drug activities. In the PASS program, the priors are first established for a set of biologically active compounds based on the proportions of chemical substructures in the active and inactive class. Then, a variant of naive Bayes is used to estimate the drug activity based on the query structures using the prior probability distribution. Chen et al. demonstrated their efficiency in large-scale virtual screens for important pharmacological properties such as cytochrome P450 inhibition, human plasma protein binding, and bioavailability in animal models (rattus norvegicus).

Regression analysis

Regression analysis can refer to linear regression modeling for continuous data or logistic regression analysis modeling for categorical data [61]. Given a set of training data points, the goal of linear regression analysis is to find a linear function of a set of predictor variables, such that the fitted line minimizes the distances to the data points along the dimensions of a set of outcome variables. Early QSAR techniques like Hansch and Free–Wilson analysis make extensive use of multivariate linear regression. However, correlations between features and high-dimensional feature spaces present challenges for the application of linear regression models in QSAR. Several techniques such as regularization, dimensionality reduction and genetic algorithms are available to combat the twin curses of dimensionality and collinearity, which result in model overfitting and coefficient coupling which confound accuracy and interpretability. L1 regularization methods and evolutionary algorithms shrink the number of variables explicitly by selecting small subsets that are most relevant to the outcome being predicted by the QSAR model. By contrast, L2 regularization methods like Gaussian processes and ridge regression reduce the ‘effective’ number of variables (VC-dimensionality) without changing their actual number. Recently, Algamal et al. demonstrated the utility of an adaptive least absolute shrinkage and selection operator (LASSO) variable selection approach for predicting the anticancer potency of imidazopyridine derivatives. In another study, Helguera et al. used evolutionary variable selection to model the activity and selectivity of monoamine oxidase inhibitors. By contrast, dimensionality reduction techniques such as principal components analysis (PCA) transform large sets of correlated variables into smaller sets of uncorrelated features.

In a seminal study on QSAR classification, Gao et al. used PCA to decorrelate features for the prediction of estrogen receptor binding. More recently, Rensi and Altman demonstrated performance improvements over LASSO regression for predicting activity against a broad set of pharmacological protein targets using kernel principal components analysis and nonlinear variant of PCA. Another popular regression method is partial least squares (PLS), which couples dimensionality reduction with multivariate regression to transform predictors into uncorrelated variables that are maximally correlated with the Q6 activity or property of interest. Erikkson et al. recommend PLS as a first-line approach to QSAR modeling for its superior efficiency and accuracy relative to explicitly combing unsupervised dimensionality with multivariate regression, and PLS is used extensively in 3D-QSAR. However, the tight coupling of dimensionality reduction and model fitting can limit utility in unsupervised or semi-supervised problems where knowledge of the outcome variable is missing or incomplete. Although linear regression analysis has been successfully applied in many drug optimization problems, underlying linearity and vector space assumptions which are not valid for most QSAR problems are a significant limitation. Thus, careful selection of the features and model system, although crucial, is sometimes insufficient to ensure the success of linear regression models. k-Nearest neighbors In kNN, the data containing labeled and unlabeled nodes are represented in a high-dimensional feature space and the labels from the closest nodes are transferred to the query using a majority-voting rule. Here, the value k specifies the number of closest neighbors participating in the voting system. kNN in ligand-based virtual screening can be thought of as an extension of chemical similarity search to supervised learning, where a chemical similarity metric such as Tc is used as a measure of distance between compounds, and bioactivities are predicted from the top search results. However, there is no principled way of choosing the number of nearest neighbors to use, and values of k that are too high or low can yield unfavorable false-positive or false-negative rates. This was addressed by the similarity ensemble approach (SEA), which compares chemical similarity values to a randomized background score similar to that used in a BLAST sequence similarity search. Lo et al. proposed another approach for large-scale compound drug-target profiling called chemical similarity network analysis pull-down (CSNAP). Instead of defining nearest neighbor values, the CSNAP approach used a threshold network to cluster compounds based on a predefined Tc cut-off. After an initial clustering step, query compounds were assigned the most probable drug targets by ranking the shared targets among the first-order neighbors. Recently, Huang et al. developed the most-similar ligand-based target inference (MOST) approach which utilizes explicit bioactivity of the most-similar ligands to predict targets of the query compound. They showed that the MOST approach could alleviate false-positive predictions associated with the common nearest neighbor similarity search.

Random forest is an ensemble learning method where multiple decision trees are built based on the training data and a majorityvoting scheme similar to kNN is used to make classification or regression predictions for new inputs.

Svetnik et al. demonstrated the utility of random forest models in QSAR classification and regression for a number of important pharmacological transporters, targets, and properties such as P-glycoprotein (PGP), cyclooxegenase-2 (COX2) and blood-brain barrier permeability. They achieved accuracy comparable to SVMs and neural networks with superior interpretability. Support vector machines SVMs solve the classification problem by using nonlinear kernel functions to map data into high-dimensional space by finding an optimally separating hyperplane. The hyperplane is fit to maximize the margin between support vectors, points nearest to the decision boundary and is expressed as a linear combination of data points. Liu et al. used SVMs in a QSAR study of transcription factors activator protein (AP)-1 and nuclear factor (NF)-kB by ethyl 2-[(3-methyl-2,5-dioxo(3-pyrrolinyl))amino]-4-(trifluoromethyl) pyrimidine-5-carboxylate derivatives.

More recently, Nekoei et al. combined a genetic variable selection Q7 variable approach with SVMs to identify a number of structural features of aminopyrimidine-5-carbaldehyde oxime derivatives that are responsible for a strong vascular endothelial growth factor (VEGF)-2 inhibition activity. Neural networks and deep learning Artificial neural networks (ANNs) are a family of machine learning algorithms, inspired by the operations of neurons in the brain. Each neuron in an ANN receives numerous input signals (analogous to dendrites), performs a weighted sum of the inputs, generates an activation response through a nonlinear activation function (analogous to cell body), and passes the output signals to subsequently connected neurons (analogous to axons). Multilayer ANNs can be constructed by organizing neurons into different layers and connecting neurons in consecutive layers. The combination of nonlinear units enables ANNs to learn highly complex functions of the inputs. ANNs have been widely applied to all branches of chemoinformatics, including modeling QSAR/QSPR properties of small molecules as well as performing pharmacokinetic and pharmacodynamic analysis. They refer the readers to the work by Baskin et al. for the latest comprehensive review of ANN-based methods in chemoinformatics.

Deep learning networks are a recent extension of ANNs, which utilize deep and specialized architectures to learn useful features from raw data. The recent success of deep learning provides an opportunity to develop tools for automatically extracting task-specific representations of chemical structures. Deep convolutional neural networks (CNNs) comprise a subclass of deep learning networks. In CNNs, local filters scan through the input space to search for recurring local spatial patterns that are useful for the classification performance. Owing to the unique local spatial properties of images, CNNs have achieved great success in the computer vision community, and have recently been applied to the biomedical field. For example, Torng and Altman viewed protein structures as ‘3D images’ with four different atom type channels and used 3D-CNNs to analyze amino acid microenvironment similarities and to predict the effects of mutations in proteins.

Graph convolutional networks (GCNs) are variants of CNNs that have been commonly applied to 2D molecular graph analysis. GCNs employ similar concepts of local spatial filters, but operate on graphs, to learn features from graph neighborhoods. Following the first application of GCNs in QSAR analysis by Baskin et al., different graph convolutional architectures for learning small molecule representations have also been proposed, each defining local graph neighborhoods and convolution operations in different ways. For example, Duvenaud et al. used different ‘degree filters’ to learn features for nodes with different degrees. Kearns et al. employed ‘Weave modules’, that integrate information from all atoms and atom pairs to learn molecular features.

More recently, Hechtlinger et al. used random walks to define the local neighborhood of each node in a graph. Recurrent neural networks (RNNs) are another major family of deep neural networks that have been widely used in natural language processing. Long-short-term-memory (LSTM) networks are a subclass of RNNs that use gated units and memory cells to capture long- and short-term temporal dependencies within input sequences. LSTM networks have been applied to de novo drug design, where the LSTM model is trained to learn ‘grammatical structures’ within SMILES strings and to output novel molecules following the learned rules. Variational autoencoders (VAEs), generative adversarial networks (GANs), and deep reinforcement learning have also been applied to learning latent representations of molecules and to generating new compounds with desired molecular properties.

QSAR modeling

The general protocol for constructing QSAR models for drug discovery has been systematized and consists of several modular steps involving the chemoinformatics and machine learning techniques previously discussed. The first step is ‘molecular encoding’ where the chemical features and properties are derived from chemical structures or lookup of experimental results. Second, a feature selection step is performed where unsupervised learning techniques are used to identify the most relevant properties and reduce the dimensionality of the feature vector. Finally, in the learning phase, a supervised machine learning model is applied to discover an empirical function (either explicitly or implicitly) that can achieve an optimal mapping between the input feature vectors and the biological responses. Building an accurate QSAR model also requires careful consideration and selection of the SAR datasets used for training and model validation. This includes strict separation of training and test sets for initial model creation and the test sets for final model performance evaluation. The performances of the QSAR models are commonly evaluated by standard metrics such as sensitivity, specificity, precision, and recall. For unbalanced datasets, area-under-curve (AUC) derived from receiver-operating-characteristics (ROC) curves can be used. Although 3D-QSAR methods like CoMFA consider structural conformation, the approach necessitates substantial computation resources and is subject to uncertainty generated from confirmation prediction, ligand orientation, and structural alignment. Thus, the 2D-QSAR model can be competitive and sometimes even superior to 3D-QSAR approaches.

Concluding remarks and future directions Machine learning techniques have been widely applied in the field of chemoinformatics to discover and design new drugs with superior biological activities. Mathematical mining of chemical graphs enables the derivation of a constellation of 2D or 3D chemical descriptors, which are packaged as chemical fingerprints in a diverse array of machine learning models and predictive tasks. A key area of innovation in the field is the marriage of big data and machine learning to predict wider ranges of biological phenomena. Traditional drug design methods based on simple ligand-protein interactions are no longer sufficient for meeting clinical drug safety criteria. High drug attrition rates from severe side effects often involve biological pathways and systematic responses at higher levels. Consequently, incorporating multiple data types and sources, also known as ‘data fusion techniques, that aggregate structural, genetic, and pharmacological data from the molecular to organism level, will be crucial for the discovery of safer and more effective drugs. Likewise, novel machine learning models capable of processing big data at high volume, velocity, and veracity with great versatility are also needed. Recent evolution in deep learning networks has proven to be a promising architecture for efficient learning from massive datasets for modern drug discovery campaigns. Other aspects of machine learning techniques such as increased data interpretability to prove mechanistic hypothesis as well as methods preventing overfitting are also important topics that warrant further development in the field of machine-learning-based drug discovery1.

  1. Lo, Y.-C., Rensi, S. E., Torng, W., & Altman, R. B. (2018). Machine learning in chemoinformatics and drug discovery. Drug Discovery Today, 23(8), 1538–1546. doi:10.1016/j.drudis.2018.05.010


Machine learning in chemoinformatics
Download PDF • 445KB


bottom of page