Deep Dive into Machine Learning Models for Protein Engineering

Protein redesign and engineering has become an important task in pharmaceutical research and development. Recent advances in technology have enabled efficient protein redesign by mimicking natural evolutionary mutation, selection, and amplification steps in the laboratory environment. For any given protein, the number of possible mutations is astronomical. It is impractical to synthesize all sequences or even to investigate all functionally interesting variants. Recently, there has been an increased interest in using machine learning to assist protein redesign, since prediction models can be used to virtually screen a large number of novel sequences. However, many state-of-the-art machine learning models, especially deep learning models, have not been extensively explored. Moreover, only a small selection of protein sequence descriptors has been considered. In this work, the performance of prediction models built using an array of machine learning methods and protein descriptor types, including two novel, single amino acid descriptors and one structure-based three-dimensional descriptor, is benchmarked. The predictions were evaluated on a diverse collection of public and proprietary data sets, using a variety of evaluation metrics. The results of this comparison suggest that Convolution Neural Network models built with amino acid property descriptors are the most widely applicable to the types of protein redesign problems faced in the pharmaceutical industry.

Protein redesign by directed evolution experiments has been widely used in the pharmaceutical industry to develop proteins with improved properties.For example, enzymes are often used for the large-scale synthesis of drugs and drug precursors and must be good catalysts often under conditions different from those that would be optimal for natural enzymes. The directed evolution of proteins consists of multiple rounds of experiments that mimic natural selection. In each round various mutated sequences are created from the best available parent sequence, and desired properties (for example, percentage substrate conversion) are measured in vitro. Desirable variants, for example those that synthesize the most product at a given set of conditions, are carried forward to future rounds of evolution. The process of directed evolution is essentially an optimization problem in the ultrahigh dimensional protein sequence space. Identifying desirable mutations with a finite number of experiments remains challenging.Opportunities to improve the efficiency of identifying desirable mutants through the application of predictive modeling are apparent. Machine learning models can learn relationships between sequences and properties from tested sequences in order to make predictions of the properties for virtual sequences.

Illustration of different approaches for creating numerical descriptors from input protein sequence, which is the first step in building machine learning prediction models for proteins (left flowchart). Please note that the examples described here are for a single protein sequence, and the feature construction strategy can easily be expanded to represent a full data set of aligned sequences: (top) protein feature vector or matrix created from single amino acid properties and mutation locations; (middle) protein feature vector based on ProtVec, which is a widely used embedding feature of 3-gram; and (bottom) protein feature vector created from the 3D structure.

Computational approaches can therefore guide the design of future rounds of experiments to synthesize only the most promising sequences.Recently the application of machine learning models in this field has attracted increasing attention. The application of machine learning to protein optimization requires descriptors that are appropriate for the information rich protein sequences. Some protein sequence descriptors have already been successfully applied to sequence-based protein classification and ligand docking problems, wherein descriptors such as amino acid composition (dipeptide, tripeptide composition), predicted secondary structure and predicted solvent accessibility, histograms of torsion angles density and amino acid distances density, k-Spaced Amino Acid Pairs and Conjoint Triad, and 3D grid protein–ligand structures have been employed.

Example diagram of the two neural network models: (a) simple MLP architecture with two hidden layers and (b) simple CNN architecture with two convolution layers and two dense layers. Here they assume that the input protein sequence length is 300, and the 11-dimensional PCscores is used as a single amino acid descriptor, which results in a protein feature matrix with dimension 11 × 300.

Despite this success for protein classification and ligand docking, the prediction of protein function, usually a continuous value measured by experiments, is quite a different task.

Machine learning guided protein engineering is still a new field; and there are few studies with demonstrable success in the current literature, and these have tended to include only a small number of proteins and models.

Both Kimothi and colleagues and Yang et al.have applied the word embedding model doc2vec to large protein sequence data sets. These methods take cues from language processing, treating the protein sequence as documents and fragments of the sequences as words. Yang et al. used a single machine learning method, Gaussian process (GP) regression, to build predictive models with various input descriptors, including this type of embedding, alongside one-hot encoding, mismatch kernel, ProFET, and AAIndex properties, and compared the performance on four public data sets.

Comparison of training and test sets for each public data set: top row, density histogram of number of mutations, and bottom row, distribution of observed protein properties.

A more recent contribution by Wu et al.demonstrated that a high throughput in silico model found improved mutants with reduced experimental effort for guided directed evolution. Several supervised learning methods were used; however, the input descriptors of the protein sequences were not discussed. Yang et al.authored a review article covering some basic concepts of using machine learning for protein engineering, and this article gives two application examples. However, the discussion and recommendation for choosing a machine-learning sequence-function model for proteins are based on literature review, which can hamper quantitative interpretation of the results.

Boxplot comparison of rank by model (ML method + descriptor combination). For each data set and evaluation metric, the rankings are calculated for all 44 descriptor/method combinations.

Aside from the variety of descriptors to select from, predicting biological properties from protein sequence is a challenging task because of the low signal-to-noise ratio data from experiments. The experimental measurement involves multiple steps of selection or amplification of in vitro protein samples. The resulting values from these low-volume, high-throughput experiments can show high variability. Promising variants are often reconfirmed or followed up at larger scale. Another challenge is extrapolation of predictions to include mutants that have not yet been seen in the existing data. Thus, it is critically important to develop accurate prediction models that perform well given the state of real-world data.

In this work, the predictive performance of 44 machine learning method and descriptor combinations is compared across a diverse collection of public and proprietary data sets using common evaluation metrics. Descriptor sets used herein include both widely used amino acid descriptors from the literature and novel, three-dimensional structure-based descriptors that have been developed specifically to address protein redesign problems. In addition, a Convolution Neural Network (CNN) model is proposed, which performed well in many examples herein.

  1. Deep Dive into Machine Learning Models for Protein Engineering Yuting Xu, Deeptak Verma, Robert P. Sheridan, Andy Liaw, Junshui Ma, Nicholas M. Marshall, John McIntosh, Edward C. Sherer, Vladimir Svetnik, and Jennifer M. Johnston Journal of Chemical Information and Modeling 2020 60 (6), 2773-2790 DOI: 10.1021/acs.jcim.0c00073