Analyzing Learned Molecular Representations for Property Prediction

Advancements in neural machinery have led to a wide range of algorithmic solutions for molecular property prediction. Two classes of models in particular have yielded promising results: neural networks applied to computed molecular fingerprints or expert-crafted descriptors and graph convolutional neural networks that construct a learned molecular representation by operating on the graph structure of the molecule. However, recent literature has yet to clearly determine which of these two methods is superior when generalizing to new chemical space. Furthermore, prior research has rarely examined these new models in industry research settings in comparison to existing employed models. In this paper, they benchmark models extensively on 19 public and 16 proprietary industrial data sets spanning a wide variety of chemical end points. In addition, they introduce a graph convolutional model that consistently matches or outperforms models using fixed molecular descriptors as well as previous graph neural architectures on both public and proprietary data sets. Their empirical findings indicate that while approaches based on these representations have yet to reach the level of experimental reproducibility, their proposed model nevertheless offers significant improvements over models currently used in industrial workflows.

Molecular property prediction, one of the oldest cheminformatics tasks, has received new attention in light of recent advancements in deep neural networks. These architectures either operate over fixed molecular fingerprints common in traditional QSAR models, or they learn their own task-specific representations using graph convolutions. Both approaches are reported to yield substantial performance gains, raising state-of-the-art accuracy in property prediction.

Despite these successes, many questions remain unanswered.

The first question concerns the comparison between learned molecular representations and fingerprints or descriptors. Unfortunately, current published results on this topic do not provide a clear answer. Wu et al.demonstrate that convolution-based models typically outperform fingerprint-based models, while experiments reported in Mayr et al. report the opposite. Part of these discrepancies can be attributed to differences in evaluation setup, including the way data sets are constructed. This leads us to a broader question concerning current evaluation protocols and their capacity to measure the generalization power of a method when applied to a new chemical space, as is common in drug discovery. Unless special care is taken to replicate this distributional shift in evaluation, neural models may overfit the training data but still score highly on the test data. This is particularly true for convolutional models that can learn a poor molecular representation by memorizing the molecular scaffolds in the training data and thereby fail to generalize to new ones. Therefore, a meaningful evaluation of property prediction models needs to account explicitly for scaffold overlap between train and test data in light of generalization requirements.

Four example distributions fit to a random sample of 100,000 compounds used for biological screening in Novartis. Note that some distributions for discrete calculations, such as fr_pyridine, are not fit especially well. This is an active area for improvement.

In this paper, they aim to answer both of these questions by designing a comprehensive evaluation setup for assessing neural architectures. They also introduce an algorithm for property prediction that outperforms existing strong baselines across a range of data sets. The model has two distinctive features: (1) It operates over a hybrid representation that combines convolutions and descriptors. This design gives it flexibility in learning a task specific encoding, while providing a strong prior with fixed descriptors. (2) It learns to construct molecular encodings by using convolutions centered on bonds instead of atoms, thereby avoiding unnecessary loops during the message passing phase of the algorithm.

Comparison of our D-MPNN with features to the best models

They extensively evaluate their model and other recently published neural architectures with over 850 experiments on 19 publicly available benchmarks from Wu et al. and Mayr et al. and on 16 proprietary data sets from Amgen, Novartis, and BASF (Badische Anilin und Soda Fabrik). Their goal is to assess whether the models’ performance on the public data sets and their relative ranking are representative of their ranking on the proprietary data sets. They demonstrate that under a scaffold split of training and testing data, the relative ranking of the models is consistent across the two classes of data sets. They also show that a scaffold-based split of the training and testing data is a good approximation of the temporal split commonly used in industry in terms of the relevant metrics. By contrast, a purely random split is a poor approximation to a temporal split, confirming the findings of Sheridan. To put the performance of current models in perspective, they report bounds on experimental error and show that there is still room for improving deep learning models to match the accuracy and reproducibility of screening results.

Comparison of our D-MPNN against baseline models on BASF internal regression data sets on a scaffold data split (higher = better). Our D-MPNN outperforms all baselines.

Building on the diversity of their benchmark data sets, they explore the impact of molecular representation with respect to the data set characteristics. They find that a hybrid representation yields higher performance and generalizes better than either convolution-based or fingerprint-based models. They also note that on small data sets (up to 1000 training molecules) fingerprint models can outperform learned representations, which are negatively impacted by data sparsity. Beyond molecular representation issues, they observe that hyperparameter selection plays a crucial role in model performance, consistent with prior work. They show that Bayesian optimization yields a robust, automatic solution to this issue. The addition of ensembling further improves accuracy, again consistent with the literature.

Comparison of our D-MPNN against baseline models on the Novartis internal regression data set on a chronological data split (lower = better). Our D-MPNN outperforms all baseline models.

Their experiments show that their model achieves consistently strong out-of-the-box performance and even stronger optimized performance across a wide variety of public and proprietary data sets. Their model achieves comparable or better performance on 12 out of 19 public data sets and on all 16 proprietary data sets compared to all baseline models. Furthermore, no single baseline model is clearly superior across the remaining 7 public data sets, and the relative performance of the baseline models often varies from data set to data set, whereas their model is consistently strong across data sets. These results indicate that their model and learned molecular fingerprints, in general, are applicable and ready to be used as a powerful tool for chemists actively working on drug discovery.

Illustration of bond-level message passing in their proposed D-MPNN. (a) Messages from the orange directed bonds are used to inform the update to the hidden state of the red directed bond. By contrast, in a traditional MPNN, messages are passed from atoms to atoms (for example, atoms 1, 3, and 4 to atom 2) rather than from bonds to bonds. (b) Similarly, a message from the green bond informs the update to the hidden state of the purple directed bond. (c) Illustration of the update function to the hidden representation of the red directed bond from diagram (a).

Analyzing Learned Molecular Representations for Property Prediction Kevin Yang, Kyle Swanson, Wengong Jin, Connor Coley, Philipp Eiden, Hua Gao, Angel Guzman-Perez, Timothy Hopper, Brian Kelley, Miriam Mathea, Andrew Palmer, Volker Settels, Tommi Jaakkola, Klavs Jensen, and Regina Barzilay Journal of Chemical Information and Modeling 2019 59 (8), 3370-3388 DOI: 10.1021/acs.jcim.9b00237