Gradient
Search

Improving Structure-Based Virtual Screening with Ensemble Docking and Machine Learning

One of the main challenges of structure-based virtual screening (SBVS) is the incorporation of the receptor’s flexibility, as its explicit representation in every docking run implies a high computational cost. Therefore, a common alternative to include the receptor’s flexibility is the approach known as ensemble docking. Ensemble docking consists of using a set of receptor conformations and performing the docking assays over each of them. However, there is still no agreement on how to combine the ensemble docking results to obtain the final ligand ranking. A common choice is to use consensus strategies to aggregate the ensemble docking scores, but these strategies exhibit slight improvement regarding the single-structure approach. Here, they claim that using machine learning (ML) methodologies over the ensemble docking results could improve the predictive power of SBVS. To test this hypothesis, four proteins were selected as study cases: CDK2, FXa, EGFR, and HSP90. Protein conformational ensembles were built from crystallographic structures, whereas the evaluated compound library comprised up to three benchmarking data sets (DUD, DEKOIS 2.0, and CSAR-2012) and cocrystallized molecules. Ensemble docking results were processed through 30 repetitions of 4-fold cross-validation to train and validate two ML classifiers: logistic regression and gradient boosting trees. Their results indicate that the ML classifiers significantly outperform traditional consensus strategies and even the best performance case achieved with single-structure docking. They provide statistical evidence that supports the effectiveness of ML to improve the ensemble docking performance.



Structure-based virtual screening (SBVS) is a key component in the early stage of drug discovery. SBVS comprises different in silico methodologies to filter a chemical compound library using the three-dimensional structure of the molecular target of interest (receptor). Commonly, the process consists of docking each compound (candidate ligand) to the receptor, predicting their binding mode, and estimating their binding affinity through a scoring function. Subsequently, ligands are sorted by their binding score to the receptor, and the top-ranked ones are selected as the best candidates for further experimental analysis.


Overview of the methodology workflow (per target protein): data collection, ensemble docking, and 30 × 4cv to implement target-specific ML models over ensemble docking scores and compare their performance against traditional consensus strategies and single-conformation docking. *The CSAR-2012 data set was available only for the CDK2 protein.


To be computationally efficient, traditional docking tends to oversimplify the binding process, negatively affecting the predictive power of SBVS. One of the main drawbacks is that docking simulations usually rely on a single rigid conformation of the receptor, resulting in a poor representation of the receptor’s flexibility. Consequently, multiple approaches have been proposed to ameliorate this deficiency. Among these alternatives, ensemble docking is one of the most popular approaches and has been successfully applied in various prospective studies, which support the motivation of using ensemble docking to improve SBVS. Ensemble docking aims to represent the receptor’s binding site flexibility through a conformational ensemble. The assumption is that this structural diversity can lead to more accurate binding modes and explore broader chemical space. Thus, the most common implementation of ensemble docking consists of docking each ligand to multiple rigid conformations of the receptor.


Results from the 30 × 4cv analysis. (a) AUC-ROC values. (b) NEF values. Violin boxplots showing the distribution of the performance values of each SBVS method across the 30 repetitions. The white points within the boxes indicate the value of the median, and the notches represent the 95% confidence interval around it. Outliers are shown as black points. The max SCP and the avg SCP dashed lines indicate the maximum and the average performance, respectively, achieved by a single conformation using the raw docking scores from the 120 × n validation sets generated during the 30 × 4cv analysis. The translucent gray area surrounding the avg SCP value represents one standard deviation from the average SCP. The dotted black lines indicate the expected performance of a random classifier.


As a result, multiple docking scores are computed for each ligand, which finally must be aggregated to obtain a single compound score or ranking. Nevertheless, it is still unclear what is the best methodology to perform this aggregation. Consensus strategies are a usual alternative to combine ensemble docking results. These strategies are data fusion rules similar to the consensus scoring schemes used to combine the scores from different docking tools. However, their effectiveness is target-dependent, varying among each docking study. Moreover, consensus strategies provide only modest improvement over the average performance of single-conformation docking and hardly exceed the best case’s performance of single-conformation docking. Previous studies also observed that adding more conformations to the ensemble tends to favor false-positive ligands, decreasing the SBVS predictive power.


Results of the y-randomization test showing the SBVS methods’ average AUC-ROC values at different percentages of active/inactive shuffled labels. Error bars indicate standard deviations. The csAVG and csGEO strategies had practically the same average and standard deviation values.


Here, they ask whether machine learning (ML) could be a better alternative than traditional consensus strategies to exploit ensemble docking results and improve the SBVS performance. ML has been widely and successfully employed in SBVS. Most of these models were developed using intermolecular features computed from experimental protein–ligand complexes and single-conformation docking results. Other studies have used ML to integrate the predictions of different scoring functions computed from a single binding mode.

Therefore, to evaluate the suitability of combining ML and ensemble docking, they evaluated four proteins as study cases: cyclin-dependent kinase 2 (CDK2), factor Xa (FXa), epidermal growth factor receptor (EGFR), and heat shock protein 90 (HSP90). These four proteins have been extensively used to test and validate SBVS methodologies and have several crystallographic structures available.


Top eight (red orange to black points) protein conformations selected by the RFE procedure using the GBT as a base estimator. (a) CDK2 protein: 402 conformations. (b) FXa protein: 136 confs. (c) EGFR protein: 64 confs. (d) HSP90 protein: 64 confs. Left panels: cMDS over the protein pocket’s shape. Each point represents a protein conformation. The point’s size is proportional to the protein pocket’s volume, computed by POVME3. Right panels: Swarm plots showing the SCP (AUC-ROC) value obtained by each protein conformation according to its performance using the raw docking scores from the whole data set. The top eight RFE-rank conformations are colored from red to black according to their rank position.


They selected CDK2 and FXa proteins because their binding sites are quite different; the CDK2 binding pocket is a deep hydrophobic cavity, while the FXa active site is a shallow and solvent-exposed hydrophobic groove. On the other hand, the EGFR and HSP90 proteins were suggested by one of their anonymous reviewers as a blind test. They started by building the conformational ensembles of each of the four proteins. Then, ensemble docking simulations were carried out using ligands from three molecular libraries (DUD, DEKOIS 2.0, and CSAR ) and cocrystallized compounds. Then, they used the raw docking scores to perform 30 repetitions of 4-fold cross-validation (30 × 4cv) to train and evaluate two ML algorithms: logistic regression (LR) and gradient boosting trees (GBT). The former is a basic linear classifier, whereas the latter is a relatively new nonlinear ensemble algorithm successfully employed in SBVS. The yalso considered three consensus strategies: the lowest score (csMIN), the average score (csAVG), and the geometric mean score (csGEO). csMIN and csAVG are among the commonly used consensus strategies, whereas csGEO was recently recommended as a better alternative to other strategies. Finally, they statistically compared the ML classifiers’ prediction power with that of the consensus strategies. They also explored different conformational selection criteria to evaluate how some associated properties and single-conformation predictions affected the ML performance.


  1. Improving Structure-Based Virtual Screening with Ensemble Docking and Machine Learning Joel Ricci-Lopez, Sergio A. Aguila, Michael K. Gilson, and Carlos A. Brizuela Journal of Chemical Information and Modeling Article ASAP DOI: 10.1021/acs.jcim.1c00511