PyPEF—An Integrated Framework for Data-Driven Protein Engineering

Data-driven strategies are gaining increased attention in protein engineering due to recent advances in access to large experimental databanks of proteins, next-generation sequencing (NGS), high-throughput screening (HTS) methods, and the development of artificial intelligence algorithms. However, the reliable prediction of beneficial amino acid substitutions, their combination, and their effect on functional properties remain the most significant challenges in protein engineering, which is applied to develop proteins and enzymes for biocatalysis, biomedicine, and life sciences. Here, they present a general-purpose framework (PyPEF: pythonic protein engineering framework) for performing data-driven protein engineering using machine learning methods combined with techniques from signal processing and statistical physics. PyPEF guides the identification and selection of beneficial proteins of a defined sequence space by systematically or randomly exploring the fitness of variants and by sampling random evolution pathways.

Graphical Abstract

The performance of PyPEF was evaluated concerning its predictive accuracy and throughput on four public protein and enzyme data sets using common regression models. It was proved that the program could efficiently predict the fitness of protein sequences for different target properties (predictive models with a coefficient of determination values ranging from 0.58 to 0.92). By combining machine learning and protein evolution, PyPEF enabled the screening of proteins with various functions, reaching a screening capacity of more than 500,000 protein sequence variants in the timeframe of only a few minutes on a personal computer. PyPEF displayed significant accuracies on four public data sets (different proteins and properties) and underlined the potential of integrating data-driven technologies for covering different philosophies by either predicting the fitness of the variants to the highest accuracy accounting for epistatic effects or capturing the general trend of introduced mutations on the fitness in directed protein evolution campaigns. In essence, PyPEF can provide a powerful solution to current sequence exploration and combinatorial problems faced in protein engineering through exhaustive in silico screening of the sequence space.

Schematic workflow of PyPEF that starts with splitting the input data into learning (green) and validation sets (yellow) is followed by featurization (encoding), hyperparameter tuning for regression models (parameter tuning), validation (model validation), and ends with fitness prediction after training on either the learning or all data (blue). The gray boxes indicate the main steps of the workflow, whereas light green-colored boxes represent the process of LOOCV for PLS regression or k-fold CV, where the size of the learning set is reduced by one and onefold, respectively. Dot connections (●) represent splitting path branches for different running routines (for feature transformation and regression).

Protein engineering is essential to improve the demanded function of proteins outside their in vivo constraints as naturally proteins often do not perform as desired in applications in biocatalysis, biomedicine, and life sciences. In order to expand evolutionary-developed protein functions, for example, activity, solvent resistance, or thermostability, usually rational design or directed evolution methods are applied. The directed evolution method involves iterative rounds of random mutagenesis methods and variant selection, reaching a speedup of natural evolution by millions, thus facilitating the adaption of proteins toward the desired characteristics. Although huge diversity can be generated when using the established mutagenesis methods, the screening throughput still limits the efficiency of the directed evolution of proteins.

Performance (R2) for validation set (VS) predictions after LOOCV or fivefold CV-based (CV5) parameter tuning for tested regression options using PLS, RF, SVR, and MLPs performing FFT and raw sequence encoding (AAindex). For tuning, the model parameter grids were iterated on the learning set using the Scikit-learn method GridSearchCV while performing five data splits.

While random mutagenesis-driven approaches such as directed evolution profit from HTS of protein variants, even ultra HTS methods are up to now limited to 106 to 107 screened variants per hour and establishing efficient screens still remains a (time) demanding task in general. As also a single round of directed evolution, when screening only a few thousand variants, can yield more than 10 beneficial amino acid positions, the efficient combination of multiple positions revealing positive epistatic effects represents a general challenge for protein engineers. Synoptically, protein engineers have to admit that they will not be able to sample through the hypothetically accessible sequence space of enzyme variants and more intelligent strategies are indispensable for efficient directed enzyme evolution.

Predicted vs measured fitness values using PLS regression and FFT-ed encodings for model construction for the case studies A: Bacteriorhodopsin absorption wavelength data set (peak absorption wavelengths λmax in nm; R2 = 0.92, RMSE = 12.31, NRMSE = 0.27, Pearson’s r = 0.97, and Spearman’s ρ = 0.93), B: ANEH enantioselectivity data set [enantioselectivity factors (E-values); R2 = 0.72, RMSE = 14.48, NRMSE = 0.52, r = 0.86, and ρ = 0.89], C: NOD enantioselectivity data set {enantiomeric excess [shift in percent times hundred against variant 32K/46F/56L/97L (KFLL)]; R2 = 0.58, RMSE = 0.27, NRMSE = 0.65, r = 0.77, and ρ = 0.73}, D: P450 data set (T50 values in °C; R2 = 0.91, RMSE = 1.91, NRMSE = 0.29, r = 0.96, and ρ = 0.95) (no outliers were removed from the data sets).

In contrast, rational engineering methods require structural knowledge, that is, high-quality protein structures and further structural information collected from computational analyses for efficient sequence combination. As protein sequences are deciphered and deposited increasingly via application of modern sequencing technologies, protein structure determination lags due to the laborious determination of protein structures of X-ray crystallography and nuclear magnetic resonance (NMR) experiments. When structural data are available, rational engineering showed to significantly reduce the library size (e.g., smart libraries) by identifying beneficial mutations, consequently lowering the time and effort needed for screening experiments. The combination of random and focused directed evolution approaches (such as KnowVolution,CAST/ISM, 3DM, or HotSpot Wizard has been successfully used to redesign proteins.Such methods balance throughput and molecular understanding, allowing us to design smart libraries.

However, current rational protein design approaches—relying on structure-based in silico methods—are often computationally expensive and require high computing effort for investigating only a few variants.

Moreover, analyzing and interpreting structural data and simulation trajectories often demand deep molecular understanding of protein functions and dynamics. Nevertheless, state-of-the-art protein engineering approaches only examine a minuscule fraction of the potential diversity of proteins that could theoretically be explored even when using methods that allow consistent high-throughput screening of sequences for position identification and combination, since the number of sequences offered by nature intensifies by collocation of n amino acids of the protein sequence length with 20n. The efficient combination of beneficial protein substitutions is particularly challenging since epistatic effects of amino acids habitually reveal complex networks of interactions across the protein structure. Likewise, the coexistence of identified beneficial substitutions can lead to mutual extinction, implying an exhaustive screening effort or a reliant molecular understanding for identifying combinations that maintain or even amplify each other’s function. Such dependencies were observed in the case of the local fitness landscape of the green fluorescent protein (GFP) from Aequorea victoria, revealing the trend of no or weak epistasis for accumulating GFP mutations.On the other hand, studies of diverse evolutionary trajectories have shown that half of mutational effects are difficult to predict as estimating their effects on the starting background is a challenging task due to negative or neutral epistasis.

Recently, data-driven methods have been increasingly integrated in protein engineering campaigns as advances in experimental techniques such as NGS and HTS enabled the collection of vast protein data sets.Such approaches use generated sequence-fitness relationships to train machine learning models and perform in silico screening and variant selection of a defined or combinatorial sequence space. Furthermore, it is mentioned that the computation, the sequencing of variants, and the increased burden in synthesizing predicted variants are the only additional investments to make when following such an approach.

Standardized AAindex values used for encoding in the top-ranking models identified for data sets A–D using FFT encoding and PLS regression. The given R2 of each index represents the model performance (coefficient of determination of measured vs predicted fitness) obtained using the respective index for sequence encoding.

Assisting the established approaches in protein engineering, the generated predictive models proved as valuable guidance for navigating through the enormous variant sequence space—especially as most of the mutations show deleterious effects and thus fitness landscape paths tendentially end downhill, that is, with inactive proteins.Furthermore, trained machine learning models can be employed for the prediction and efficient recombination of identified key positions. For this purpose, machine learning is proficient to provide a significant improvement of the selection process of recombinants or sequence variants. Thereby, machine learning benefits from its independency from structural information and has successfully been used in several studies aiming for the estimation of sequence-fitness landscapes of proteins for diverse properties such as enzymatic activity, productivity, solubility, selectivity, (thermo)stability, fluorescence, protein design,protein interaction and (membrane) localization, signal peptide identification,ligand binding, peptide polymer or membrane binding, peptide antimicrobial activity prediction, and peptide or substrate design. Besides, a multiple task benchmark tool that helps to evaluate semi-supervised machine learning tasks was described.

Correlation network of the 32 clusters obtained after applying two-step clustering (t = 0.93 and t2 = 0.97) on the AAindex database. The correlation is represented by the edges and weighted accordingly, whereas—for visualization purpose—only correlations with an absolute Pearson correlation coefficient greater than 0.6 are shown. The nodes correspond to the clusters, their size corresponds to the number of contained AAindices, and their color corresponds to one of the coarse-grained physicochemical properties: polarity (green), size (blue), composition (yellow), and secondary structure (orange).

Universal, computational nonintensive and feasible in silico screening methods can help to explore the diversity of protein sequences, estimate effects of single and simultaneous substitutions on the protein function, and thereby output “smart libraries” by identifying beneficial substitution positions. Such predictive methods allow evolutionary approaches to be more efficient and even might lead to the discovery of highly optimized proteins which show a function close to a globular optimum. To date, many machine learning procedures used in protein engineering are based on the conversion of amino acid sequences to vectors of numerical values. In supervised learning, that is, when learning on sequence-fitness data sets, these numerical vectors are used as input features to train models that minimize the computed loss when predicting the protein’s associated labels (fitness). Frequently, these vectors were built either by defining the absence or presence of substitutions in a protein sequence as 0 or 1 or using one-hot encoding binary vectors. Additional approaches used linear combination of position-weighted effects on substitutions in order to predict the protein function. Furthermore, nonlinear methods were proposed to reflect epistasis. In addition, approaches that make use of evolutionary amino acid rate variabilities to encode the protein sequences to numerical feature vectors or amino acid descriptors were reported. Exemplary, the latter technique was applied for quantitative sequence-activity modeling (QSAM) or protein engineering while applying positional weight vectors. A structure-independent mutant library screening machine learning approach termed innov’SAR appeared recently.

The innov’SAR pipeline applies featurization by Fourier-transforming numerical indices, which represent physicochemical and biochemical properties for each amino acid, taken from the amino acid index (AAindex) database. After fast Fourier transform (FFT) processing, a spectral form of the protein is generated and used as the input for subsequent statistical modeling. Furthermore, partial least squares (PLS) regression is used to construct predictive models for a broad range of proteins and properties.

Based on this promising processing pipeline, they present a general-purpose data-driven protein engineering framework termed PyPEF that enables us to easily construct predictive models and explore the sequence fitness landscape by either predicting user queries or performing random evolution walks. PyPEF sequence-based in silico screening proved to perform feature engineering, model training, and prediction for millions of recombinant variants with a system-dependent throughput in the time frame of minutes with high accuracy. Also, they augmented the model options for regression, including the promising machine learning methods support vector regression (SVR) and random forest (RF)regression, similar to those reported by Xu et al., as well as the quintessential deep learning technique, a feedforward deep network, that is, multilayer perceptron (MLP) trained via backpropagation and introduced parameter tuning viak-fold cross validation (instead of leave-one-out cross-validation) using the available functions of the Scikit-learn package. Furthermore, to compare raw (in the following referred to as “AAindex encoding”) and signal-processed sequence encoding, they implemented the option to train on AAindex-encoded amino acid sequences instead on FFT spectra and benchmarked all regression options and encodings for selected validation sets.

Example of in silico-directed evolution that was performed on data set B [ANEH enantioselectivity (predicted fitness: E-value)] for five evolution pathways and 15 total mutation trial steps (up to nine steps accepted) using FFT for transformation of encoded features while allowing predictions of regions beyond which such models have been trained, a high-risk prediction method. While the FFT allows predictions of the unknown design space due to far-reaching substitution changes in spectral amplitudes in the feature space, higher trust regions for design would be defined by positions that were already identified, that is, included in the learning set.

AAindex encoding here refers to a sequence digitally encoded according to an AAindex without subsequent FFT processing, rather than a truly binary-encoded sequence. This work facilitates access to a machine learning framework for protein engineering even for nonprogrammers. The developed framework can be used to perform high-throughput in silico screening in protein evolution campaigns, facilitates the selection phase for identifying beneficial variants from the recombinant sequence space, and was evaluated in terms of efficiency and accuracy using four public (one protein and three enzymes) data sets as case studies.

  1. PyPEF—An Integrated Framework for Data-Driven Protein Engineering Niklas E. Siedhoff, Alexander-Maurice Illig, Ulrich Schwaneberg, and Mehdi D. Davari Journal of Chemical Information and Modeling 2021 61 (7), 3463-3476 DOI: 10.1021/acs.jcim.1c00099