Protein Interaction Network Reconstruction with a Structural Gated Attention Deep Model by Incorporating Network Structure Information
Protein–protein interactions (PPIs) provide a physical basis of molecular communications for a wide range of biological processes in living cells. Establishing the PPI network has become a fundamental but essential task for a better understanding of biological events and disease pathogenesis. Although many machine learning algorithms have been employed to predict PPIs, with only protein sequence information as the training features, these models suffer from low robustness and prediction accuracy. In this study, a new deep-learning-based framework named the Structural Gated Attention Deep (SGAD) model was proposed to improve the performance of PPI network reconstruction (PINR). The improved predictive performances were achieved by augmenting multiple protein sequence descriptors, the topological features and information flow of the PPI network, which were further implemented with a gating mechanism to improve its robustness to noise. On 11 independent test data sets and one combined data set, SGAD yielded area under the curve values of approximately 0.83–0.93, outperforming other models. Furthermore, the SGAD ensemble can learn more characteristics' information on protein pairs through a two-layer neural network, serving as a powerful tool in the exploration of PPI biological space.
Protein–protein interactions (PPIs) represent a delicate but universal mechanism of diverse biological pathways by mediating multiple biological functions, such as signal transduction, post-translational modification, apoptosis, cell growth, and many other cellular events that regulate tissue regeneration and cell proliferation. PPIs, as an important category of drug targets, are also widely embraced by the pharmaceutical industry and scientific community. Many aberrant PPIs have been identified to be involved in disease and have become therapeutic targets for cancers, infectious diseases, neurodegenerative diseases, and tumorigenesis in which aberrant PPIs, loss of PPIs, or unexpected stabilization of a protein complex form spatiotemporally.
To date, at least 31 PPI modulators have entered the stage of clinical trials. Benefits of targeting PPIs include the following:
(1) There is a largely enriched biological space with extended target numbers, approximately 130 000–650 000. Recently, it has been proposed that targeting protein–protein interactions involving post-translational modification (PTM) can serve as an efficient strategy to extend biological space.
(2) Targeting PPIs is an efficient strategy to overcome drug resistance issues, especially in the design and development of infectious drugs.
(3) Understanding PPIs provides an opportunity to uncover potential drug repurposing by studying the PPI networks. Recently, 29 approved drugs were identified for the treatment of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) by examining the SARS-CoV-2 protein interaction map and drug screening. As a result, revealing the unknown PPIs not only can help studies of basic biology issues but also can be used to discover novel therapeutic targets for drug design and discovery.
Therefore, tremendous efforts have been made to build the PPI networks to understand how different proteins interact with each other and to identify drug targets by analyzing the network features. However, characterizing the PPIs in living cells with experimental methods is time-consuming and expensive, making computational predictions of PPIs an attractive approach. A diverse set of computational models have been proposed to predict PPI networks, among which deep learning methods utilizing amino acid sequence information have gained a lot of interest. The deep models can be roughly divided into four categories: gene domain fusion-based models, phylogenetic similarity and conservation-based methods, literature knowledge-based models, and gene coexpression-based models. However, when prior knowledge of proteins (e.g., crystal structural information) is not available, the above methods are hardly effective.
Schematic diagram for the generation of conjoint triads and local descriptors. (A) Combined triad technique. The numerical code strings of consecutive amino acids are “677”, “725”, “572”, “255”, “772”, “257”, “725”, “553”, “53-”, “-67”. Thus, its triad types are F677, F725, F572, F255, F772, F257, and F553. F-67 represents a triplet with the previous amino acid residue; F53- follows a similar pattern. (B) Construction of 10 descriptor regions (1–10) for a hypothetical protein sequence. The regions 1–4 and 5–6 are, respectively, generated by dividing the whole sequence into four equal regions and two equal regions. The regions 7–10 stand for the central 50%, the first 75%, the final 75%, and the central 75% of the entire sequence, respectively.
Among the predictive models, those using protein amino acid sequence information have drawn much attention. In addition, multidescriptor fusion techniques were proposed to effectively represent the biological information on amino acid sequences. Hashemifar proposed a sequence-based deep learning prediction method, DPPI, which used a convolutional neural network and random projection module to predict PPIs. Pan proposed a new IDA-RF model, which used joint triplets to represent local order features and employed random forests to predict PPIs. Tian used pseudo-amino-acid composition (PseAAC), auto-covariance descriptors, and group weight coding to obtain protein sequence information and employed a support vector machine as the classifier. Li used automatic covariance descriptors and a joint triad method to generate eigenvectors to predict PPIs based on deep learning.
Architecture of the SGAD model for protein interaction network reconstruction. First, the four encoding methods, AC, LD, CT, and PseAAc in the box, encode information from protein pairs which is reshaped to a two-dimensional vector. Then, a gated CNN framework is used to train the model which includes the first convolution, a rectified linear activation function (ReLU), another convolution, and finally a fully connected layer. The output of this model is the possibility of classifications.
However, these sequence-based models are incapable of incorporating protein network structure information. As a result, the predictive models suffer from generalizability and are not applicable to the unseen data sets. A simple but powerful technique in systems biology is to represent a biological system as a network instead of disconnected components. Biomolecular studies greatly benefited from a network representation and related algorithms, especially the graph or network embedding approaches. The main idea of graph embedding is to reconstruct the original network by learning a low-dimensional vector (also known as latent space) through projection operations, which assumes that a latent vector representation of nodes can efficiently represent the original network structure.
However, the underlying structure of the network is highly nonlinear, making it difficult to capture the nonlinearity while maintaining the local and global network information. In the past few decades, many methods have been proposed to capture the highly nonlinear network structures, which are based on shallow models, such as Isomap, the Laplacian characteristic graph (LE), and large-scale information network embedding (Line).However, due to the limited representation capability of shallow models, the results are not satisfying. Here, shallow models denote a neural network with only one hidden layer whereas several hidden layers are present in a deep neural network model.
The feature representation becomes a bottleneck not only for PPI prediction but also for medical images, genome and protein sequences, and so on. A deep neural network is well designed for extracting features and learning hidden patterns from complex data, which in turn generates a useful representation for various sorts of data sets. For example, Baldi and his colleagues have successfully applied various deep convolution neural networks to predict the secondary protein structure and protein contact map, with an accuracy of 84% and 30%, respectively. Xuan proposed a method based on information flow propagation and a convolution neural network to predict disease-related lncRNA, achieving a better performance than several existing methods. Another deep learning model trained on a large data set of 45 000 images has achieved similar performance to that of a certified screening radiologist in the detection of breast molybdenum target lesions.
Inspired by the recent success of deep learning in computer vision and natural language processing, there is a study that presents a new deep model to learn the vertex representations of PPI networks, aiming to obtain a protein and the corresponding network representation accurately to capture their high nonlinearities. Moreover, the SDNE (Structural Deep Network Embedding) model in this study applied the first- and second-order approximations to ensure the local and global network topology, respectively. Because of its multilayer architecture, SDNE automatically extracts high-level abstractions from data sets.
Except for feature representation and extraction, most of the computational methods were trained on noisy data sets, leading to fragility in predictive models. This suggests that a robust model is possible only when measures are taken to deal with the non-negligible noise in data sets. In natural language processing, Krause ignored the long-distance information between two events when predicting whether they refer to the same thing or not; Fang used a decomposable attention mechanism-based neural network model to extract important information on two events, still ignoring the noise between events. Therefore, it is not enough to reconstruct a protein interaction network only by considering protein interaction information while ignoring the noisiness of input data. In order to resolve the problem, this study proposes a Structural Gated Attention Deep (SGAD) mechanism (Algorithm 1) to reconstruct protein interaction networks:
(1) Multiple descriptors are utilized to represent the features of proteins, and a convolution neural network is used to extract the features automatically.
(2) The first- and second-order approximations are used to capture the nonlinear network structure.
(3) The gating attention mechanism is used to control the information flow of network structures and protein representations by incorporating linear layers, which in turn makes the model more robust. Twelve external data sets are used to measure the performance and further validate the generalizability. It turns out that SGAD can satisfyingly reconstruct the original network with better accuracy, as well as other performance evaluation criteria, in comparison with the baseline(s).
Protein Interaction Network Reconstruction with a Structural Gated Attention Deep Model by Incorporating Network Structure Information Fei Zhu, Feifei Li, Lei Deng, Fanwang Meng, and Zhongjie Liang Journal of Chemical Information and Modeling 2022 62 (2), 258-273 DOI: 10.1021/acs.jcim.1c00982