Nonadditivity in public and inhouse data: implications for drug design

Numerous ligand-based drug discovery projects are based on structure-activity relationship (SAR) analysis, such as Free-Wilson (FW) or matched molecular pair (MMP) analysis. Intrinsically, they assume linearity and additivity of substituent contributions. These techniques are challenged by nonadditivity (NA) in protein-ligand binding, where the change of two functional groups in one molecule results in much higher or lower activity than expected from the respective single changes. Identifying nonlinear cases and possible underlying explanations is crucial for a drug design project since it might influence which lead to follow. By systematically analyzing all AstraZeneca (AZ) inhouse compound data and publicly available ChEMBL25 bioactivity data, they show significant NA events in almost every second assay among the inhouse and once in every third assay in public data sets. Furthermore, 9.4% of all compounds of the AZ database and 5.1% from public sources display significant additivity shifts, indicating important SAR features or fundamental measurement errors. Using NA data in combination with machine learning showed that nonadditive data is challenging to predict, and even the addition of nonadditive data into training did not result in an increase in predictivity. Overall, NA analysis should be applied on a regular basis in many areas of computational chemistry and can further improve rational drug design.

Schematic depiction of a double-transformation cycle consisting of four molecules. These four molecules are linked by two transformations: First changing yellow to magenta square and second changing light blue to dark blue circle. Nonadditivity is calculated using the compounds’ activities pAct1-4.

The similarity and additivity principles represent the basis of various well-established areas in computer-aided drug design (CADD) such as Free-Wilson (FW) analysis, two-dimensional (2D)/three-dimensional (3D) quantitative structure-activity relationship (QSAR), matched molecular pair (MMP) analysis, and computational scoring functions . Similarity and additivity are often implicitly assumed in CADD approaches in order to identify favorable molecular descriptors and predict the activity of new molecules. Otherwise, chemists would have to synthesize and biologically evaluate every single molecule.

The data curation process of public ChEMBL25 data representing number of measurements after each cleaning step

Yet, both these principles are subject to frequent disruptions. The exceptions to the similarity principle often complicate SAR analysis. So-called ‘activity cliffs’ refer to structurally very similar compound pairs with large alterations in potency. Exceptions to linearity and additivity occur when the combination of substituents significantly boosts or decreases the biological activity of a ligand . Nonadditivity (NA) may have several underlying reasons, including inconsistency in the binding pose of the central scaffold inside the pocket and steric clashes. Conformational changes in the binding pocket such as complete reorientation of the ligands alter the free energy of binding. Furthermore, many nonadditive ‘magic methyl’ cases, i.e. attaching a simple alkyl fragment to a ligand that greatly increases the biological activity, can be explained by conformational changes as the so-called ‘ortho-effect’.

Theoretical NA distribution expected from an experimental uncertainty of a 0.3 and b 0.5 log units (grey lines), and observed NA distribution for all a AZ (yellow) and b ChEMBL (blue) assays, density = normalized count so that the area sums to 1.

Additivity and NA of ligand binding have been studied for many years and can be perceived as a specific kind of interaction between functional groups. By analyzing public SAR data sets for strong NA (ΔΔpActivity > 2.0 log units) and respective X-ray structures, Kramer et al. showed that the cases of strong NA are underlined by changes in binding mode. Babaoglu and Schoichet applied an inverse, deconstructive logic to structure-based drug design (SBDD) and by studying β-lactamase inhibitors demonstrated that fragments often do not recapitulate the binding affinity of the parent molecule. The study of Miller and Wolfenden about substrate recognition demonstrated that the combination of distinct functional groups shows strong nonadditive behavior. The work of Hajduk et al. on stromelysin inhibitors and Congrive et al. on CDK inhibitors showed that molecular affinity after combining a certain amount of functional groups is much higher than expected. Patel et al. examined various combinatorial libraries assayed on several biological responses and concluded that only half of the data is additive. McClure and colleagues developed a method to determine FW additivity in a combinatorial matrix of compounds (when multiple R groups are altered simultaneously; combinatorial analoging) and they intuitively explained the occurring NA by changes in binding mode without any structural validation.

NA distribution among all curated assays from AZ inhouse (a) and public ChEMBL (b) data sets

Water molecules are a major player in ligand−protein interactions by participating in extended hydrogen-bond networks. Baum, Muley, and co-workers thoroughly analyzed the structural data and the reasons behind NA at the molecular level showing that NA can be the result of entropy and enthalpy profile changes, caused by hydrophobic interactions, hydrogen bonding and a loss of residual mobility of the bound ligands. In another study, Kuhn et al. proposed that internal hydrogen bonding gives rise to NA during compound optimization. Gomez et al. explained NA caused by protein structural changes upon ligand binding.

NA distribution for all DTCs among curated assays from AZ (a) and ChEMBL (b) data sets. c NA distribution of DTCs showing significant NA score (from 0.6 to 2 log units) in AZ (c) and (from 1 to 2 log units) ChEMBL (b) bioactivity data

According to these studies, instead of seeing NA as a problem, it should be interpreted as a hint towards key SAR features and variations in the binding modes. Identifying NA and understanding the reasons behind it is crucial for rational drug design, since it provides valuable information about ligand-protein contacts and molecular recognition. NA analysis helps us to identify potential SAR outliers in a data set, ultimately suggesting interesting structural properties that might change the course of small molecule optimization. Importantly, NA might also be caused by experimental noise.

NA distribution among all unique compounds from AZ (a) and ChEMBL (b) data sets

NA is calculated from so-called double-mutant or double-transformation cycles (DTC) . These cycles consist of four molecules, which give rise to four MMPs, and are linked by two identical transformations. The nonadditivity of the DTC is calculated based on the molecules’ individual activities. Would the transformation be perfectly additive, the difference in activities would result in a value of zero. However, a non-zero value does not necessarily indicate nonadditivity. Assuming that each measurement among these double mutants contains experimental uncertainty, the experimental noise might add up and result in false nonadditive cases. Therefore, it is critical to distinguish real NA from assay noise.

a Distribution of the compounds in DTC. b Distribution of the compounds showing a significant NA shift per assay, density = normalized count so that the area sums to 1.

Density distribution of the assays showing significant NA from AZ (a) and ChEMBL (b) based on the average NA and the number of compounds in each assay

1. Gogishvili, D., Nittinger, E., Margreitter, C. et al. Nonadditivity in public and inhouse data: implications for drug design. J Cheminform 13, 47 (2021).