How can SHAP values help to shape metabolic stability of chemical compounds?

It is not a mystery that the process of drug design and development is very complex and absorbs a huge amount of time and money. Although nowadays it significantly differs from the drug design strategies from the past (the emergence of new medicines used to be rather a result of serendipity and fortunate accidents), it is still a subject to relatively high risk of failure. Nevertheless, the current strategies of searching for new drugs are much more structured and several steps can be distinguished within them, such as target identification, finding the lead structure, its optimization, preclinical studies and 3 phases of clinical tests.

Finding a new active compound towards a particular target is just the first step in the long path of its possible transformation into a drug. Meeting the affinity requirements is not sufficient, as a compound needs to possess favourable physicochemical and pharmacokinetic properties as well, and it should not display any toxic effects. Within the set of considered parameters it is also important to put attention to metabolic stability, because if a compound is transformed in the organism too quickly, it does not have enough time to induce a desired biological response.

Global prediction power of the ML algorithms in a classification and b regression studies. The Figure presents global prediction accuracy expressed as AUC for classification studies and RMSE for regression experiments for MACCSFP and KRFP used for compound representation for human and rat data

Metabolic stability is one of the most difficult parameters to be predicted by computational tools due to extreme complexity of processes related to xenobiotic transformations in the living organisms. The main role in xenobiotic metabolism is played by cytochrome P450—a group of haemoprotein enzymes with monooxidase activity. Almost sixty CYP isoforms occur in human organisms; however, it is CYP3A4 that is responsible for metabolism of the majority of drugs.

The 20 features which contribute the most to the outcome of classification models for a Naïve Bayes, b SVM, c trees constructed on human dataset with the use of KRFP

A high number of processes that contribute to metabolic stability makes the correct prediction of this parameter a challenging task. As a result, publications on in silico tools for evaluating the speed of compound metabolism are scarce. Here, they mention a few examples of such studies. Schwaighofer et al. analyzed compounds examined by the Bayer Schering Pharma in terms of the percentage of compound remaining after incubation with liver microsomes for 30 min. The human, mouse, and rat datasets were used with approximately 1000–2200 datapoints each. The compounds were represented by molecular descriptors generated with Dragon software and both classification and regression probabilistic models were developed with the AUC on the test set ranging from 0.690 to 0.835. Lee et al. used MOE descriptors, E-State descriptors, ADME keys, and ECFP6 fingerprints to prepare Random Forest and Naïve Bayes predictive models for evaluation of compound apparent intrinsic clearance with the most effective method reaching 75% accuracy on the validation set. Bayesian approach was also used by Hu et al. with accuracy of compound assignment to the stable or unstable class ranging from 75 to 78%. Jensen et al. focused on more structurally consistent group of ligands (calcitriol analogues) and developed predictive model based on the Partial Least-Squares (PLS) regression, which was found to be 85% effective in the stable/unstable class assignment.

The 20 features which contribute the most to the outcome of regression models for a SVM, b trees constructed on human dataset with the use of KRFP

In silico evaluation of particular compound property constitutes great support of the drug design campaigns. However, providing explanation of predictive model answers and obtaining guidance on the most advantageous compound modifications is even more helpful. Searching for such structural-activity and structural-property relationships is a subject of Quantitative Structural-Activity Relationship (QSAR) and Quantitative Structural-Property Relationship (QSPR) studies. Interpretation of such models can be performed e.g. via the application of Multiple Linear Regression (MLR) or PLS approaches. Descriptors importance can also be relatively easily derived from tree models. Recently, researchers' attention is also attracted by the deep neural nets (DNNs) and various visualization methods, such as the ‘SAR Matrix’ technique developed by Gupta-Ostermann and Bajorath. The ‘SAR Matrix’ is based on the matched molecular pair (MMP) formalism, which is also widely used for QSAR/QSPR models interpretation . The work of Sasahara et al. is one of the most recent examples of the development of interpretable models for studies on metabolic stability.

Overlap of important keys for a classification studies and b regression studies; c) legend for SMARTS visualization. Analysis of the overlap of the most important keys (in the number of 20) indicated by SHAP values for a classification studies and b regression studies; c legend for SMARTS visualization.

In their study, they focus on the ligand-based approach to metabolic stability prediction. They use datasets of compounds for which the half-lifetime (T1/2) was determined in human- and rat-based in vitro experiments. After compound representation by two key-based fingerprints, namely MACCS keys fingerprint (MACCSFP) and Klekota & Roth Fingerprint (KRFP). They develop classification and regression models (separately for human and rat data) with the use of three machine learning (ML) approaches: Naïve Bayes classifiers, trees, and SVM.

Finally, they use Shapley Additive exPlanations (SHAP) to examine the influence of particular chemical substructures on the model’s outcome. It stays in line with the most recent recommendations for constructing explainable predictive models, as the knowledge they provide can relatively easily be transferred into medicinal chemistry projects and help in compound optimization towards its desired activity or physicochemical and pharmacokinetic profile. SHAP assigns a value, that can be seen as importance, to each feature in the given prediction. These values are calculated for each prediction separately and do not cover a general information about the entire model. High absolute SHAP values indicate high importance, whereas values close to zero indicate low importance of a feature.

Analysis of the metabolic stability prediction for CHEMBL2207577 for human/KRFP/trees predictive model. Analysis of the metabolic stability prediction for CHEMBL2207577 with the use of SHAP values for human/KRFP/trees predictive model with indication of features influencing its assignment to the class of stable compounds; the SMARTS visualization was generated with the use of SMARTS plus.

The service enables analysis of new compounds, submitted by the user, in terms of contribution of particular structural features to the outcome of half-lifetime predictions. It returns not only SHAP-based analysis for the submitted compound, but also presents analogous evaluation for the most similar compound from the ChEMBL dataset. Thanks to all the above-mentioned functionalities, the service can be of great help for medicinal chemists when designing new ligands with improved metabolic stability.

Wojtuch, A., Jankowski, R. & Podlewska, S. How can SHAP values help to shape metabolic stability of chemical compounds?. J Cheminform13, 74 (2021).