Among issues to be considered when applying AI techniques are: the complexity of the model, as roughly defined by the type of its discriminatory function (e.g., linear vs. non-linear) and the number of free parameters to be optimized; design of appropriate (representative and non-redundant) training and control sets; and careful validation of the results. Examples of parameters to be optimized are the weights of connections between the nodes (“neurons”) in NNs, or the coefficients defining a separating hyperplane in case of SVM. Typically, free parameters are optimized with the goal of minimizing the misclassification error in the training. Other criteria may involve estimates of generalization capabilities in order to avoid over-fitting. The importance of the choice of an appropriate model and representation for the problem at hand are illustrated here using several well established problems and examples of applications of AI methods in structural bioinformatics. Structural bioinformatics deals primarily with protein and other macromolecular structure and involves, for example, protein structure and function prediction. Therefore, prediction methods developed in structural bioinformatics are relevant for drug design and discovery and are being integrated with modeling and docking techniques.

AI Techniques for Structural Bioinformatics

Finding representations capable of capturing the underlying principles and correlations is critical for the success of applications of any AI technique. In order to illustrate the above point, let us revisit some classical problems in structural bioinformatics, such as the prediction of secondary structure of an amino acid residue in a protein. In fact, secondary structure prediction was one of the first successful applications of machine learning techniques in the field of protein structure prediction. In their pioneering work, Rost and Sander demonstrated the importance of multiple alignment representation and used a neural network to train a successful classifier capable of assigning each residue to one of the three classes (helix, beta strand or coil) with over 70% classification accuracy. Subsequent developments pushed the accuracy of secondary structure prediction to about 80%, which proved to be sufficient for many important applications, e.g., to protein folding simulations and prediction of protein 3D structure. At the same time, however, these studies showed clearly that the multiple alignment representation is far more important than the type of a classifier used. In particular, differences in accuracy between top performing methods, based on NNs, SVMs or HMMs, are not statistically significant. Since then many different pattern recognition and machine learning techniques have been devised to improve protein structure and function prediction. For example, AI-based protein annotation protocols are being used to provide proteome-wide prediction and annotation of membrane domains, solvent accessibility, protein-protein interactions, post-translational modifications and other attributes that can be used to facilitate drug design and discovery. As a further illustration of some of the issues arising when applying machine learning and AI techniques, let us consider the problem of predicting which amino acid residues can undergo phosphorylation due to the enzymatic activity of protein kinases. In fact, kinases are targeted by many rational drug design efforts since phosphorylation plays an important role in cancer and other diseases by modulating structure and function of specific proteins, e.g., by affecting their interactions with co-factors. The computational prediction of phosphorylation and other post-translational modification sites, both from the structure and the primary amino acid sequence is an active field of research.

Examples of methods for phosphorylation site prediction are NetPhos, a NN-based predictor, Scansite , a sequence-motif based predictor, and DISPHOS, which uses indicators of intrinsic disorder, in addition to sequence information. It should be also noted that some “negative” sites might include those which have not (yet) been reported as phosphorylated, suggesting the importance of the so-called “one-class” machine learning protocols (using positive examples only) for prediction of phosphorylation . Next, we specifically address some of the above considerations in the context of the drug design field. Representation and Model Selection for Drug Design Learning from examples may be simple if relations between attributes and outcomes are almost linear. However, in drug design the number of parameters that may have influence on biological activity is typically high and relations may be strongly non-linear. Moreover, with limited number of experimental data, the predictive ability of statistical learning systems may suffer from the “curse of dimensionality”; if m points are sufficient to obtain a reasonable approximation in each dimension, then mN points are required in N dimensions. For example, if m=10 points are sufficient to approximate the relation for each parameter, then for just N=20 parameters the number of examples that are required is equal to 1020. With a small number of examples a very large number of N-parameter functions may perfectly fit the data. This problem is addressed in computational learning theory, where methods of selection of appropriate data models are provided.

The choice of appropriate model is also related to the learning algorithm used and the representation of data. For example, model selection is very important for QSAR approaches that rely on solving explicitly the underlying (multiple) regression. As mentioned before, information about chemical compounds and other structured objects may be better represented in the form of labeled graphs, rather than vectors. One common approach to represent molecular structures is to use “fingerprints”, i.e. long bit-strings (100-1000) encoding yes/ no answers about the presence or absence of various features, including substructures within the molecular structure of a chemical compound. For each atom of the chemical compound, a depth-first search for substructures may be used. Each bit set to one in the string represents the presence of particular substructure in the search tree, or several bits are assigned to substructures using hashing techniques. “Molecular holograms” is a variant of fingerprinting techniques that uses integers to denote the number of fragments of a particular type. Some chemical databases include parameters to generate useful fingerprint strings.

The Tanimoto distance between pairs of bit strings generated by molecular fingerprinting is widely used to measure similarity in database searches. However, recent calculations on five bio -logical activity classes showed strong influence of the compound class-specific effects on the results. Although molecular fingerprinting captures some structural similarities, based on it representations of chemical structures lack the information about the geometry and global properties of the molecule. On the other hand, coupling the universal encoding with classification tasks should increase the efficacy in this context. For a review of other approaches to measure chemical similarity the reader is referred to. Chemical information may also be encoded into kernel functions K(R,S) in SVMs. These kernels effectively measure similarity between structures R and S. Several such kernels have been proposed recently, based on adjacency matrices, sequences of labels in subgraphs, the number of shared walks in subgraphs, and may be combined with molecular fingerprinting. Many variants of such kernels have been constructed and proved to be quite useful in screening for drugs that inhibit cancer cell growth. Trees and directed acyclic graphs may also be used to represent chemical entities, often in conjunction with sequential processing, fragment after fragment, by neural nodes with recurrent connections.

Recurrent connections are needed to preserve information about graph fragments that have already been processed. Each node s of the graph has some features X associated with it, and the function implemented by a neuron depends on W. X+h(X), where the extra term represents the sum of weighted outputs from nodes that are connected to s. Such generalized re-current neurons are used to transform input graphs G into vectors X, and conventional NNs are used in this space for the classification. To use this approach for drug design all molecular structures must be decomposed, and become fragments of directed acyclic graphs (DAGs). The number of recurrent connections is equal to the maximum number of bonds for a given atom represented by the node. However, the decomposition process is not unique and depends on the starting point and orientation (for covalent bonds) chosen initially. Inductive Logic Programming (ILP) is another approach that may be used to address some of the issues in model selection, especially in the context of symbolic attributes of molecular systems. ILP uses an expressive language and enables construction of new, interesting features. However, due to computational costs and other difficulties in learning, ILP-based models did not show clear advantage over QSAR in applications to mutagenicity, cancerogenicity, toxicology and biodegradability. Summary and bibliography of ILP applications to 3D protein structure, discovery of neuropeptides, chemical compound structure elucidation, diagnosis, mutagenesis, pharmacophore discovery for ACE inhibition, carcinogenicity, toxicology and other areas may be found at


QSAR Approach to Drug Design In the past, drug identification was done largely through random experimentation and it was clearly not very effective. Furthermore, the mechanism of action of a successful drug would typically remain obscured. It was in the 1960s that alternative approaches to drug design were beginning to be explored, with focus on Quantitative Structure-Activity Relationships (QSAR). The idea behind QSAR approaches is to use the known responses (activities) of simple compounds (structures) to predict the responses of complex compounds, made from different combinations of basic modules. Only compounds predicted to have desired properties would then be tested. QSAR typically relies on electronic, hydrophobic and steric attributes of a molecule, as well as structural, quantum mechanical and other descriptors. A classical example is the Hammett’s study of the effect that different substituents have on the ionization of the carboxylic group in benzoic acid . The effect on ionization was measured in terms of equilibrium (dissociation) constant for the reaction. Hammett discovered that equilibrium constants were higher for more electron-withdrawing substituents. This led to the formulation of linear free energy relationships (LFERs) that motivated subsequent development of many similar approaches for quantitatively describing structure-activity relationships.

Hydrophobicity and lipophilicity play a vital role in many biological processes, especially in receptor-ligand interactions. Therefore, QSAR methods often employ hydrophobicity-related descriptors. Another important set of QSAR descriptors are those related to steric effects, such as the molar refraction (MR) index, various parameters accounting for the shape of a compound and descriptors indicating the presence or absence of certain structural features. Another QSAR approach that has gained a lot of momentum is the use of quantum-chemical descriptors, which theoretically can account (at least in principle) for other properties, both electronic and geometric. The increase in the number of parameters required the use of AI approaches to obtain correlations between the molecular and other attributes and observed activities. A typical QSAR study, would involve Hammett’s constants, partition coefficients, molar refractivity and many other descriptors. Statistical and machine learning techniques, such as multiple linear regression (MLR), principal component analysis (PCA) or partial least squares (PLS) would then be used to solve the problem. It should be mentioned that MLR is still one of the most widely used AI techniques in QSAR studies.

Oftentimes, the structure of the target protein is not known. In such cases, the potential drugs may be analyzed using experimental techniques and common structural features called pharmacophores identified. Such pharmacophore (or ligand)-based methods include the 3D-QSAR techniques, for instance . Examples of popular 3D-QSAR methods are the comparative molecular field analysis (CoMFA) , the comparative molecular similarity indices analysis (CoMSIA)

and GRID. For a more comprehensive list of representative methods and programs within this category, the reader is referred to. The basic idea behind CoMFA is that the biological activity of molecules is related to its electrostatic and steric interactions. The molecules (ligands) that are being studied are aligned structurally on a 3D grid. Using a probe atom, electrostatic and steric fields are determined at every point in the grid. CoMSIA, on the other hand, also takes into account hydrophobic parameters. Such obtained descriptors are then analyzed using statistical methods, such as partial least squares (PLS), to obtain correlations between activity and the fields leading to a 3D-QSAR model of the ligand.

GRID is similar to CoMFA and may also be used to determine the interaction energies between the probe and the ligand. In addition, GRID can also be used to calculate hydrogen bonding energies. 3D-QSAR methods have been employed to design anti-HIV-1 drugs, matrix metalloproteinase inhibitors, therapeutic agents for Alzheimer’s and Parkinson’s diseases and anti-tuberculosis agents, to name a few. Furthermore, 3D-QSAR has been applied along with molecular modeling and molecular dynamics in the design of pteridine-derived therapeutic agents, indolomorphinan derivatives, and in vaccinology studies. A detailed review on the applications of CoMFA and CoMSIA has been presented by Bordas et al. In 4D-QSAR, the fourth dimension represents an ensemble of conformations, orientations, or protonation states for each molecule. This reduces the bias that may come from the ligand alignment, but requires identification of the most likely bioactive conformation and orientation (or protonation state), frequently obtained using evolutionary algorithms.

The 5-D QSAR carries this one step further, allowing for changes in the receptor binding pocket and ligand topology. Adding solvation effects leads to 6D-QSAR, which allowed, in combination with flexible docking, for relatively accurate identification of the endocrine-disrupting potential associated with a drug candidate. AI Methods in QSAR Ligands may be represented by multiple structural and other descriptors. Thus, selection of key descriptors is an important step in any QSAR study. Another important step is the identification of patterns (predictive fingerprints or combinations of features) that correlate with activity. Furthermore, compounds exhibiting promising properties may be compared with other candidates in order to identify other potential drugs that share critical features. Therefore, it is evident that the AI approaches to feature-selection, pattern recognition, classification and clustering can be applied to the problems posed above. In fact, various clustering methods (e.g., hierarchical divisive clustering, hierarchical agglomerative clustering, nonhierarchical clustering and multi-domain clustering) have been applied to such problems. For example, clustering receptor proteins, based on their structural similarity, has been shown to improve docking studies and drug design .

Applications of clustering techniques and genetic algorithms towards predicting molecular interactions have been reviewed in. The role of feature selection in QSAR has also been reviewed recently. Furthermore, another method called consensus k-nearest neighbor (kNN) QSAR has been developed towards predicting estrogenic activity. The concept behind this approach is that the activity can be estimated by averaging activities over k-nearest neighbors. Multiple models, each making use of different sets of descriptors, are then used to make the consensus prediction. Many other studies have made use of machine learning techniques to address similar problems. In particular, NNs have been widely used to solve many problems in drug design. A comprehensive review of the applications of NNs in variety of QSAR problems, has been presented by Winkler. The review discusses how NNs can be applied to the prediction of physicochemical, toxicological and pharmacokinetic parameters. In another study, the NN methods were compared with statistical approaches . Self-organized maps (SOM) have been used for studying molecular diversity and employed in drug design.

In particular, the SOM-based method of comparative molecular surface analysis (CoMSA) has been presented in detail. In recent years, SVMs have become relatively widely used. For example, Zhao et al. made use of SVMs for predicting toxicity and found that this method yielded improved performance compared to multiple linear regression and radial basis function NNs . A new method called least squares support vector machine (LSSVM) was employed to screen calcium channel antagonists in a QSAR study . SVMs were also used (providing accuracies competitive with that of other QSAR approaches) to predict oral absorption in humans involving molecular structure descriptors and to calculate the activity of certain enzyme inhibitors, as well as many other investigations of similar type. Furthermore, evolutionary QSAR techniques that employ genetic algorithms are being developed for docking and related studies. One example is the Multi-objective genetic QSAR (MoQSAR) that has been used to study neuronal nicotinic acetylcholine ion channel receptors (nAChRs) .

Other examples involve the use of genetic algorithms for prediction of binding affinities of receptor ligands and in classification-based SAR. Another technique similar to evolutionary methods and inspired by biological phenomena is Particle Swarm Optimization (PSO). This technique has been employed in many QSAR studies for biomarker selection. Bayesian networks have also been used to solve different problems in the context of drug design.

Duch, W., Swaminathan, K., & Meller, J. (2007). Artificial intelligence approaches for rational drug design and discovery. Current pharmaceutical design, 13(14), 1497–1508.