Large scale comparison of QSAR and conformal prediction methods

Large scale comparison of QSAR and conformal prediction methods and their applications in drug discovery

Structure–activity relationship modelling is frequently used in the early stage of drug discovery to assess the activity of a compound on one or several targets, and can also be used to assess the interaction of compounds with liability targets. QSAR models have been used for these and related applications over many years, with good success. Conformal prediction is a relatively new QSAR approach that provides information on the certainty of a prediction, and so helps in decision-making. However, it is not always clear how best to make use of this additional information. In this article, they describe a case study that directly compares conformal prediction with traditional QSAR methods for large-scale predictions of target-ligand binding. The ChEMBL database was used to extract a data set comprising data from 550 human protein targets with different bioactivity profiles. For each target, a QSAR model and a conformal predictor were trained and their results compared. The models were then evaluated on new data published since the original models were built to simulate a “real world” application. The comparative study highlights the similarities between the two techniques but also some differences that it is important to bear in mind when the methods are used in practical drug discovery applications.

Schema of the data collection from ChEMBL

Public databases of bioactivity data play a critical role in modern translational science. They provide a central place to access the ever-increasing amounts of data that would otherwise have to be extracted from tens of thousands of different journal articles. They make the data easier to use by automated and/or manual classification, annotation and standardization approaches. Finally, by making their content freely accessible, the entire scientific community can query, extract and download information of interest. As a result, such public resources have been instrumental in the evolution of disciplines such as data mining and machine learning. PubChem and ChEMBL represent the two largest public domain databases of molecular activity data. The latest release (version 24) of ChEMBL contains more than 6 million curated data points for around 7500 protein targets and 1.2 million distinct compounds. This represents a gold mine for chemists, biologists, toxicologists and modellers alike.

Contemporary experimental approaches and publication norms mean that the ChEMBL database is inherently sparsely populated with regard to the compound/target matrix. Therefore, in silico models are particularly useful, as they can in principle be used to predict activities for protein-molecule pairs that are absent from the public experimental record and the compound/target data matrix. Quantitative structure–activity relationship (QSAR) models have been used for decades to predict the activities of compounds on a given protein. These models are then frequently used for selecting compound subsets for screening and to identify compounds for synthesis, but also have other applications ranging from prediction of blood–brain barrier permeation to toxicity prediction. These many applications of QSAR not only differ in their scope but also in terms of the level of confidence required for the results to be practically useful. For example, it could be considered that compound selection for screening may tolerate a lower level of confidence than synthesis suggestions due to the inherently higher cost of the latter.

Traditional QSAR and machine learning methods suffer from the lack of a formal confidence score associated with each prediction. The concept of a model’s applicability domain (AD) aims to address this by representing the chemical space outside which the predictions cannot be considered reliable. However, the concept of chemical space can be fuzzy, and it is not always straightforward to represent its boundaries. Recently, some new techniques have been introduced which aim to address this issue of confidence associated with machine learning results. In this article they focus on conformal prediction (CP) , but recognize that there are also alternatives such as Venn–ABERS predictors which have also been applied to drug discovery applications. As with QSAR, these approaches rely on a training set of compounds characterized by a set of molecular descriptors that is used to build a model using a machine learning algorithm. However, their mathematical frameworks differ—QSAR predictions are the direct outputs of the model whereas CP and Venn–ABERS rely on experience provided by a calibration set to assign a confidence level to each prediction.

The mathematical concepts behind CP have been published by Vovk et al. and the method has been described in the context of protein-compound interaction prediction by Norinder et al. Several examples of CP applications applied in drug discovery or toxicity prediction have also been reported. In practice, it is common to observe the results using different confidence levels and to decide, a posteriori, with what confidence a CP model can be trusted.

In this study, the development of QSAR and CP models for many protein targets is described and the differences in their predictions are examined. They used the data available in the ChEMBL database for this purpose. As they will describe later in this paper, the general challenges with such an application are that sometimes there is a limited number of data points available and there is an imbalance between the activity classes. This then requires a compromise to be achieved between the number of models that can be built, the numbers of data points used to build each model, and model performance. This is unfortunately a situation very common in drug discovery where predictive models can have the biggest impact early in a project when (by definition) there may be relatively little data available. As described later, in this study they used machine learning techniques able to cope with these limitations, specifically class weighting for QSAR and Mondrian conformal prediction (MCP) . Finally, they aim to compare QSAR and MCP as objectively as possible, making full use of all the data, subject to the constraints inherent in each method.1

  1. Bosc, N., Atkinson, F., Felix, E. et al. Large scale comparison of QSAR and conformal prediction methods and their applications in drug discovery. J Cheminform11, 4 (2019).

Large scale comparison of QSAR
Download PDF • 2.59MB

picture2: What is the CHEMBL?