Deep learning and transfer learning for cancer prediction based on gene expression data

kübra:)
Jul 4, 2022
5 min read

The transcriptomics technologies (microarray, RNA sequencing) provide massive molecular-scale information about the patients. The analysis of these gene expression profiles is one of the main challenges for the development of new tools for personalized medicine and especially in oncology. The analysis of these data may support the physician for the diagnosis of cancer, the classification of tumors, the outcome prognosis, and the individualized treatment decision. Complex pathologies, like cancers, disrupt gene expression, leaving signatures that can contain valuable information. The problem is that these signatures are complex non-linear combinations of genes hidden in the multiple gene expressions. Machine learning is the main approach to identify these signatures and to construct models making predictions from gene expression profiles. Many classical methods of the machine learning community have been adapted and tested in the transcriptomic context; this includes linear and quadratic models, support vector machines (SVM), random forest (RF), and boosting. Although these methods produced promising results, constructing models that are accurate and robust enough for practical medical application is still an open problem. The most challenging problems are the high dimensionality of the gene expression data, the insufficient number of training examples that lead to overfitting during training, and lack of robustness of the results.

In the last ten years, deep learning has become one of the most important breakthroughs in machine learning. Its primary application domain is image recognition and speech recognition, where it has beaten other machine-learning techniques. However, it is also promising in many other domains, particularly the biomedical sciences. Deep learning techniques have recently drawn attention in bioinformatics because of their automatic capturing of nonlinear relationships from their input and a flexible model design. However, deep learning methods are still very new in gene expression analysis, and few works have been published compared to the other machine learning methods.

Unlike images or text data, gene expression data has no structure that can be exploited in a neural network architecture. Therefore, the main architecture used for prediction from gene expression data is the multilayer perceptron (MLP). Fakoor et al. propose one of the first works to apply MLP on gene expression data to predict the presence of cancer or the sub-type of cancer. Several works try to apply MLP to others types of prediction or use variants to improve the performances. Lai et al. use a multi-modal MLP to integrate clinical data with gene expression and predict the prognostic for non small cell lung cancer. Chen et al. add to the classical cross-entropy used in the MLP, a clustering loss to the last hidden layer. Its purpose is to maximize the marge between each class in the latent space defined in hidden layers and provide better prediction of cancers. In DeepSurv, Katzman et al. replace the classical output layer of a MLP by a Cox proportional hazard layer for modeling interactions between expression profile and treatment effectiveness. Bourgeais et al. use a MLP whose architecture mimics the Gene Ontology to detect cancers and produce an explanation of the prediction.

Other types of architecture have been tested, like convolutional neural networks (CNN) or graph neural networks (GNN). However, they are facing the problem of lack of structure in the gene expression data. Mostafa et al. use CNN to predict tumor type from 1D or 2D expression profiles. The groups of genes analyzed in convolution windows depend on the arrangement of the genes in the 1D-vector or the 2D-matrix. Unlike image data, this arrangement is random and does not represent specific information. Some works tried to integrate external information to identify a structure in the gene expression that CNN or GNN can exploit. For example, the co-expression networks or protein-protein interaction networks have been used to represent the gene expression profile by a graph, and the predictions are computed through a graph neural network . However, there is no consensus that this integration of structural information may help the network to extract relevant expression patterns and improve the prediction performance.

Transfer learning is often proposed to tackle the problem of small-training set and the high dimension of the gene expression data. The term transfer learning refers to a set of techniques to transfer information from a model (source) to another one (target). Transfer learning is widely used in image analysis and natural language processing, where some common visual or textual patterns are helpful for any classification task. They can be extracted from a large source dataset and transferred to the target dataset for the classification task.

There are different scenarios of transfer learning. They can distinguish supervised transfer learning from unsupervised transfer learning. In supervised transfer learning, the labels of samples are used to build the source model, and they consider that the source classification task is close to the target classification task. Unsupervised transfer learning takes advantage of a large set of unlabeled source data to learn a new representation of the data. It generally consists of learning an encoder that projects the data into a small dimension space. In this case, they consider that the new data representation learned from source data, will be helpful for the target classification task. For the different scenarios, after the source model has been trained, the parameters a re copied to the target model. Note that it is possible to retrain the whole target model on target data; in this case, transfer learning is used as an initialization of the target model.

In the gene expression context, the standard approach is the unsupervised pre-training of the hidden layers of the MLP. This approach generally involves an autoencoder (AE) that compresses the gene expression profile into a small vector. Since the training of the AE is unsupervised, they can benefit from much larger datasets. Kim et al. train a variational autoencoder (VAE) from the pan-cancer TCGA dataset, then the hidden layers of the encoder are copied to the hidden layers of the MLP that predicts the hazard ratio of patient with a specific cancer. In, each hidden layer of a MLP for cancer prognostic prediction is initialized with a denoising auto-encoder (DAE) trained from a large pan-cancer dataset. Alzubaidi et al. pre-train their MLP with a sparse auto-encoder to predict cancer subtypes and identify biomarkers.

Non-expert users may overlook the sensitivity of NN to the hyper-parameter values due to the increasing complexity of DNN models that make the parameter tuning task very hard. This leads these users to keep the default settings when training their models leading to sub-optimal results. Moreover, since a lot of pre-trained NN models are available, transfer learning is used without addressing all the strategies that can significantly improve the NN results. In this paper, they address these points by performing an exhaustive evaluation of MLP for several cancer prediction tasks based on gene expression data (microarray and RNA-Seq data). They measure the impact of the main hyper-parameters on the performance of NN and compare the NN with the state-of-the-art machine learning models. In addition, they investigate the usefulness of different transfer learning strategies, including unsupervised pre-training. To their knowledge, this is the first work that promotes the appropriate use of deep learning and transfers learning for biomedical prediction tasks and the most extensive experimental study that addresses this topic. Indeed, they trained and evaluated around 93,000 NN for performing 22 prediction tasks.

Hanczar, B., Bourgeais, V. & Zehraoui, F. Assessment of deep learning and transfer learning for cancer prediction based on gene expression data. BMC Bioinformatics23, 262 (2022). https://doi.org/10.1186/s12859-022-04807-7