Evaluation of deep and shallow learning methods in chemogenomics

Evaluation of deep and shallow learning methods in chemogenomics for the prediction of drugs specificity

Chemogenomics, also called proteochemometrics, covers a range of computational methods that can be used to predict protein ligand interactions at large scales in the protein and chemical spaces. Differ from more classical ligand based methods (also called QSAR) that predict ligands for a given protein receptor. In the context of drug discovery process, chemogenomics allows to tackle the question of predicting off target proteins for drug candidates, one of the main causes of undesirable side effects and failure within drugs development processes. The present study compares shallow and deep machine learning approaches for chemogenomics, and explores data augmentation techniques for deep learning algorithms in chemogenomics. Shallow machine learning algorithms rely on expert based chemical and protein descriptors, while recent developments in deep learning algorithms enable to learn abstract numerical representations of molecular graphs and protein sequences, in order to optimise the performance of the prediction task. first propose a formulation of chemogenomics with deep learning, called the chemogenomic neural network (CN), as a feed forward neural network taking as input the combination of molecule and protein representations learnt by molecular graph and protein sequence encoders. they show that, on large datasets, the deep learning CN model outperforms state of the art shallow methods, and competes with deep methods with expert based descriptors. However, on small datasets, shallow methods present better prediction performance than deep learning methods. Then, they evaluate data augmentation techniques, namely multi view and transfer learning, to improve the prediction performance of the chemogenomic neural network. They conclude that a promising research direction is to integrate heterogeneous stheirces of data such as auxiliary tasks for which large datasets are available, or independently, multiple molecule and protein attribute views.

The chemogenomic neural network (CN)

The current paradigm in rationalized drug design usually associates a disease to one or several protein targets, called therapeutic targets, belonging to biological pathways that are involved in the disease development. The goal of the drug discovery process is then to identify a drug molecule that binds to the protein target and alters disease evolution. However, the drug discovery process has limited success, and only a few tens of new drugs reach the market every year. Indeed, most of the hits identified by high-throughput screening (HTS) fail to become approved drugs because of unwanted side effects and toxicity. This in part results from unexpected interactions with so-called “off-target” proteins due to drugs lack of specificity.

Sketch of the Graph neural network iterative process. (a) The AGGREGATE(l)nodeAGGREGATEnode(l) function updates node representation vectors by aggregating information coming from itself and its neighbours. (b) As the process iterates, nodes receive information from further nodes in the graph. (c) The AGGREGATE(l)graphAGGREGATEgraph(l) function builds a graph-level representation vector by aggregating information from all nodes. (d) A graph-level representation is learned at each iteration.

However, it is impossible to conduct bio-assays at the human proteome scale to discard molecules with unacceptable off-targets, and chemoinformatics algorithms provide interesting in silico screening meth-ods. Chemogenomics is a generalisation of Quantitative Structure-Activity Relationship (QSAR) methods. QSAR methods can predict interactions for a given pro-tein (or few, in the case of multi-task QSAR), while chemogenomic models are trained to simultaneously predict interactions for several proteins , with the underlying idea that the prediction of a drug–target interaction may benefit from interactions known between other targets and other molecules. Chemogenomics enables to predict drugs unexpected “off-targets”, and guide experiments to only test interactions predicted with high probability scores. Note that some authors use the word “proteo-chemometrics” as a synonym for chemogenomics, and various studies report the diversity of applications of proteochemometrics. The word “proteochemometrics” is often preferred in papers that study the impact of protein and molecule descriptors on prediction performances.

Performances on DBHuman for the S1 (random split), S2 (orphan proteins in test set), S3 (orphan molecules in test set), and S4 (double orphan in test set) settings, with a positive:negative samples ratio set to 1:5. For each setting, the order from left to right in which the results are displayed is given in the legend

In the present paper, they tackle the question of predicting drugs specificity at large scale in the space of proteins, and therefore, the word “chemogenomics” appeared more adequate. They explore chemogenomics with deep learning approaches, including transfer learning strategies, and compare their prediction performances to those of state-of-the-art shallow methods. They formulate the problem as a Drug–Target Interaction (DTI) prediction task, in terms of a binary classification of (protein,molecule) pairs that interact or not. Outline and contributions“Reference shallow methods in machine-learning for chemogenomics” section recall key reference shallow machine-learning methods for chemogenomics. In “Pro-posed chemogenomic neural network” section, they present a formal scheme, named the chemogenomic neural network (CN), for molecule and protein representation learning in the context deep learning for chemogenomics. “Related works” section shortly reviews related works in chemogenomics and QSAR with deep learning approaches. Then, first compare the prediction performances of state-of-the-art shallow and deep machine-learning methods, which was never discussed in previous chemogenomics studies. They show that, on small datasets, the more simple and less computationally demanding shallow methods perform better than deep learning methods. On large datasets, the proposed chemogenomic neural network (CN) with representation learning competes with state-of-the-art shallow and deep methods that use expert-based descriptors, but is not ultimately superior. These conclusions differ from those of previous works which considered specific datasets and compared deep learning approaches with baseline (i.e. not state-of-the-art) shallow methods.

Furthermore, consider data augmentation techniques, namely multi-view and transfer learning. In particular, they tested the interest of combining expert-based and learnt features. They also tested various implementations of transfer learning using additional larger datasets to pre-train the encoders, before their use in the chemogenomic task of interest. They show that multi-view and transfer learning can improve prediction performance of deep learning methods on small datasets, although not reaching the performances of shallow methods. However, for large datasets, they also show that a less sophisticated model in which the deep neural network simply uses expert-based molecule and protein descriptors (i.e. the descriptors that are not learnt), outperforms state-of-the-art shallow models.

They finally present the proposed methods for chemogenomics with deep learning. Reference shallow methods in machine‑learning for chemogenomicsVarious chemogenomics methods have been proposed in the last decade. Differ by (i) the descriptors used to encode proteins and ligands (ii) how simi-larities are measured between these objects, (iii) the machine-learning algorithm that is used to learn the model and make the predictions. Jacob and Vert used the Kronecker product of protein and ligand kernels to define the kernel associated with the chemogenomic space, i.e. the space of (protein,molecule) pairs. This approach has been successfully applied to DTI prediction. In this paper, this method called “kronSVM” is used it as reference to benchmark the considered deep learning algorithms for chemogenomics.As another reference shallow model, they also used Matrix factorisation (MF) approaches that decompose the (protein, molecule) interaction matrix into the product of two matrices of lower ranks that operate in the two corresponding latent spaces of proteins and molecules. More precisely, used the NRLMF method developed, because it proved to outperform other shallow state-of-the-art methods on various datasets.

Playe, B., & Stoven, V. (2020). Evaluation of deep and shallow learning methods in chemogenomics for the prediction of drugs specificity. Journal of cheminformatics, 12(1), 11.