Jordi Martorell-Marugán1 • Siham Tabik2 • Yassir Benhammou2 • Coral del Val2 • Igor Zwir2 • Francisco Herrera2 • Pedro Carmona-Sáez1
1GENYO, Centre for Genomics and Oncological Research: Pfizer, University of Granada, Andalusian Regional Government, Granada, Spain; 2Department of Computer Science and Artificial Intelligence, University of Granada, Granada, Spain
Abstract: The rise of omics techniques has resulted in an explosion of molecular data in modern biomedical research. Together with information from medical images and clinical data, the field of omics has driven the implementation of personalized medicine. Biomedical and omics datasets are complex and heterogeneous, and extracting meaningful knowledge from this vast amount of information is by far the most important challenge for bioinformatics and machine learning researchers. In this context, there is an increasing interest in the potential of deep learning (DL) methods to create predictive models and to identify complex patterns from these large datasets. This chapter provides an overview of the main applications of DL methods in biomedical research, with focus on omics data analysis and precision medicine applications. DL algorithms and the most popular architectures are introduced first. This is followed by a review of some of the main applications and problems approached by DL in omics data and medical image analysis. Finally, implementations for improving the diagnosis, treatment, and classification of complex diseases are discussed.
Keywords: artificial neural networks; biomedical informatics; deep learning; omics data analysis; precision medicine
Authors for correspondence: Francisco Herrera, Department of Computer Science and Artificial Intelligence, University of Granada, Granada, Spain. Email: herrera@decsai.ugr.es; Pedro Carmona-Sáez, GENYO, Centre for Genomics and Oncological Research: Pfizer, University of Granada, Andalusian Regional Government, PTS Granada, Avenida de la Ilustración 114 – 18016, Granada, Spain. Email: pedro.carmona@genyo.es
Doi: http://dx.doi.org/10.15586/computationalbiology.2019.ch3
In: Computational Biology. Holger Husi (Editor), Codon Publications, Brisbane, Australia. ISBN: 978-0-9944381-9-5; Doi: http://dx.doi.org/10.15586/computationalbiology.2019
Copyright: The Authors.
Licence: This open access article is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0). https://creativecommons.org/licenses/by-nc/4.0/
The amount of available biological data has increased exponentially since the emergence of high-throughput technologies such as microarrays and next-generation sequencing (1), introducing biology to the big data era. These methods initiated the so-called omics revolution, where large amounts of omics data providing global information about different properties of genes, proteins or biomolecules can be generated within a short period of time in a cost-effective way. These methods have revolutionized biomedical research by providing a more comprehensive understanding of the biological system under study and the molecular mechanisms underlying disease development. The generation of such a large amount of data in biomedicine requires the application of advanced informatics techniques in order to extract new insights and expand our knowledge about diseases, improve diagnosis, and design personalized treatments. In this context, DL algorithms have become one of the most promising methods in the area (2).
DL is a subset of machine learning (ML) algorithms characterized by the use of artificial neural networks (ANN). ANNs are inspired by biological neural networks in a sense that they are formed by interconnected artificial neurons, which receive an input, apply a transformation to the data, and return an output (which can be an input for another neuron). DL is gaining popularity as a powerful approach that can encode and learn from heterogeneous and complex data, in both supervised and unsupervised settings. DL methods have achieved considerable improvements in classical artificial intelligence challenges like language processing, speech recognition, and image recognition (3). In the context of biomedical research, DL methods have drawn the attention of many researchers, and there is an increasing number of applications in omics data analysis. Omics data analysis is frequently impeded by low signal to noise ratios, datasets with large number of variables and relatively small number of samples or large analytical variance. In this context, DL techniques have already over-performed previous methods in terms of sensitivity, specificity and efficiency (4). In addition, DL algorithms not only have the challenge of analyzing each kind of data separately but also have the challenge of integrating different omics layers or even other sources of information such as medical images or clinical health records. This big data analysis and integration is fueling the implementation of personalized medicine approaches allowing early detection and classification of diseases or personalized therapies for each patient depending on their biochemical background. This chapter reviews the main applications of DL methods to omics data analysis with a focus on the types of analysis, challenges, and opportunities in precision medicine.
DL networks are a class of ML algorithms whose aim is to determine a mathematical function f that maps a number of inputs, x, to their corresponding outputs, y, such as y = f(x). A simple feedforward network y = f(x;w) = LN(LN-1(….L1(x)) is defined as a composition of N nonlinear transformations Li(1<=i<=N) where each function Li corresponds to a hidden layer activation, and w is the learnable weight contained in all filter bank layers that are updated during the training.
Under the supervised learning approach, the training of these networks is often done iteratively in which a set of training data, also called batch, with their ground truth labels are provided to the network as input. After a feedforward of this batch through the network’s layers, the output layer computes the loss function as the difference between the calculated prediction and the correct response. After computing the loss function, all layers’ weights are updated so that the loss error of the next iteration is minimized. This weight-tuning operation is performed using a back-propagation algorithm (5) where the error function gradient is propagated in the opposite direction through the network after a batch of feedforwards to adjust filter banks, thereby learning the value of the parameter w that results in the best function approximation.
DFFs, also called multilayer perceptrons, constitute the simplest DL architecture. In these models, the input information x flows to its corresponding output y through an intermediate function f being evaluated and learned inside the neural network layers. These models are called feedforward since there are no feedback connections in which outputs of the model are fed back into themselves.
CNNs are the most adequate DNNs to deal with high multi-dimensional data like medical images. In medical imaging applications, CNNs act like a long dimensionality reduction process, binding input images to their classification scores outputs (e.g., disease or healthy patient). The building block layers of a CNN are convolutional layer, pooling layer, and fully connected layer. Generally, DL CNNs are applied with a transfer learning strategy to enhance their performance in dealing with relatively small datasets. Transfer learning consists of transferring prior learned knowledge from a source domain into a target domain. This approach is carried out by using one of the well-known CNNs pre-trained on a large dataset such as ImageNet (6), either for further training on the new data or to reuse it as a features extractor (7). Rawat and Wang (8) wrote a more comprehensive review on CNNs history and their architectures. Some of the most influential CNNs are summarized in Table 1.
TABLE 1 Summary of some of the most influential CNNs
CNN |
Layers |
Parameters |
Comments |
---|---|---|---|
LeNet | 5 | 60 000 | First CNN to be trained on a large dataset (5, 87) |
AlexNet | 7 | 60 million | Variation of LeNet. First CNN model to win the prestigious ILSVRCa in 2012 (88). |
GoogLeNet | 22 | 4 million | Winner of ILSVRCa in 2014 (89). The main contribution is the inception module which is composed of different parallel small convolutions. |
VGGNet | 16 | - | Initially the runner-up in ILSVRC 2014 behind GoogleNet (90) |
ResNet | 18, 34, 50, 101 or 152 | 11.7 million – 60.2 million | To overcome the gradient vanishing issue, ResNet authors (91) proposed using a residual function F(x) = H(x) - x, where H(x) is the standard mapping function that we want to learn with an input x through few stacked non-linear layers. By reformulating it as H (x) = F (x) + x, where F(x) and x represent the stacked non-linear layers and the identity function, respectively. Based on their hypothesis, it is better to optimize the reformulated residual mapping function F(x) than optimizing the original mapping H(x). |
DenseNet | 121, 161, 169 or 201 | 8 million – 20 million | Presented in (92) to take advantage from previous findings regarding CNN’s depth increasing and identity shortcut connections. The specificity of this new network architecture is that each layer is connected to all its previous and next layers. |
a Large Scale Visual Recognition Challenge.
RNNs are neural networks used especially for sequential data in a way that the reached output decision at time step t – 1 affects the decision which will be reached one moment later at time step t. These networks have two input sources, the present and the recent past, which are combined to determine how they respond to new data.
The main drawback of RNNs is the vanishing gradient problem. To address this issue, a variant of RNN called LSTM was proposed. LSTMs aim to preserve the error that can be back-propagated through time and layers. In fact, they allow recurrent nets to continue to learn over many steps by maintaining a more constant error. LSTMs contain information outside the normal flow of the recurrent network in a gated cell. Information can be stored in, written to or read from a cell, much like data in a computer’s memory.
To learn deep features representation, a DBN (9) is built with a concatenation of several restricted Boltzmann machine (RBM) stacked on each other. RBM is the core component of DBN models (10), being a generative stochastic model that can be used either for unsupervised or supervised learning. It is composed of two layers, an input visible layer and an adjacent hidden layer trained with the aim to learn a probability distribution in the input set. Nevertheless, unlike original Boltzmann machine (11), intra-connections between hidden–hidden or visible–visible layers in an RBM are disjointed forming a bipartite graph.
Generally, AEs act in an unsupervised manner trying to learn a distribution of a given dataset (12) and are often used as a dimensionality reduction network (13). AEs try to learn a mapping function M w, b (x) = x′ ≈ x throughout stacked hidden layers mapping an input data x to its similar identity x′ Generally, an AE is composed of an encoder and a decoder. The first one is trying to learn a set of low-dimensional representation features z, while the second is trying to reconstruct a similar copy of x using only learned features z. A special case of AEs is sparse autoencoder (SAE) (14), where sparsity is introduced into the hidden units by making the number of nodes in the hidden layer z bigger than in the input layer x. When several SAEs with only their encoding parts are stacked on each other, we obtain a stacked sparse autoencoder (SSAE) which is often trained in a bottom–up greedy fashion to learn deep feature representation from the data (14).
DL algorithms are specially suitable to analyze complex, heterogeneous, and high-dimensional data such as omics datasets (15). This section reviews some cases of omics data analyses in which DL methods have provided significant insights, and the next section provides an overview of some of the main applications in the context of precision medicine, such as biomarker discovery for disease classification. A summary of the main applications is provided in Figure 1.
Figure 1 DNNs have been applied to several biological data types. At the top, there are the different types of data. At the middle, there are some examples of DNNs structures. At the bottom, there are some of the main applications achieved with these methodologies. Source of medical images: TCIA (93) for MRI and CT; Chest X-Ray database (94) for X-Ray; MedPix® (https://medpix.nlm.nih.gov) for US; TCGA (58) for the histopathological image and ISIC (https://www.isic-archive.com) for the skin lesion. Some graphical elements were downloaded from Stockio (https://www.stockio.com/) and Freepik (https://www.freepik.com/).
Genomics uses a set of techniques to analyze DNA sequences for studying the structure and function of genomes, gene regulation, and genetic alterations that can be associated with several diseases. During the last years, DL methods have been applied to genomics data to address several questions. For instance, Poplin et al. developed a method to detect single-nucleotide polymorphisms (SNP) and indels by applying CNNs, which outperformed previous tools (16). In this context, other approaches have applied ResNets (17), DFF (18) or CNN (19) to predict the pathogenic consequences of genetic variants. In addition, Xie et al. applied DFFs and SAEs to predict the effect of genetic variants on gene expression (20). In the field of functional genomics, DL algorithms have been applied to predict enhancers’ sequences and regulatory motifs in the genome (21–25) from heterogeneous sources of data (histone modifications, chromatin accessibility and so on). Wang et al. applied CNNs to quantify transcription factor (TF)-DNA binding affinities (26). Oubounyt et al. combined a CNN and an LSTM to predict promoter sequences in genes (27). DL algorithms have also helped to identify splice junctions through CNN (28).
Another important field of application of genomics techniques is the screening of genetic regions (loci) that associate with diseases/phenotypes, what is termed genome-wide association studies (GWAS). In this context, GWAS analyses identify SNPs in genomic locations that are incorporated into risk prediction models traditionally analyzed by polygenic risk scores (29). However, this method presents certain limitations such as the inability to reduce the missing heritability, dealing with epistasis, assumption of a global linear association model or the replication of results in different samples (30).
As an alternative, supervised learning algorithm, especially DL models, is gaining relevance in this field. Promising results have been shown by Montaez et al. (31) that developed a DL framework for the classification of obesity as a binary phenotype. However, the predictive capacity of these genetic markers is weak because it is based on single locus. More recently, Fergus et al. (32) modeled the epistatic effects of SNPs using SSAEs to classify term and preterm births observations in African-American women. Although it shows a good performance in classification and the capture of loci interactions, it suffers from the common black-box problem. The selected SNPs loose the GWAS context making it very difficult to evaluate their contribution to the phenotype. A different approach is the one proposed by PGMRA (33), a deep unsupervised and data-driven ML method designed for fusing genotypic–phenotypic analysis in a semi-supervised fashion including unsupervised non-negative matrix factorization (NMF) method as an AE (13), multiobjective optimization and pooling, interpretable association of types of knowledge, and labeling the associations. Each layer has its own learning process and constitutes the input of the next layer. The results from PGMRA are interpretable and have been able to decrease the missing heritability and identify the epistatic sets of markers that are composed of the genotypic–phenotypic architecture of a disease or trait (34).
Transcriptomics quantifies the expression level of all RNA transcripts that are produced in a cell. Transcriptomics raw data are usually processed to generate expression matrices containing an estimate of expression level of each gene or transcript across several samples and conditions, which are typically the input of DL methods. There is a broad range of transcriptomics applications in which DL has been successfully applied. For example, one of the main goals of gene expression data is the analysis of alternative splicing (i.e., the synthesis of different transcripts isoforms from the same gene). In this context, Zhang et al. notably achieved to analyze differential splicing between different samples using RNA-seq data and combining a DNN and a Bayesian statistical model (35). On the other hand, CNNs have been applied to identify actual splice junctions from false positives generated during RNA-seq reads alignment (36). In addition, Jha et al. proposed a model to integrate RNA-seq and CLIP-seq data in order to improve the study of alternative splicing (37).
Another major research focus in transcriptomics is the prediction of other types of RNAs, such as non-coding RNAs (ncRNAs), and the characterization of their expression. In this context, Hill et al. proposed an RNN to differentiate between coding and non-coding RNAs (38), demonstrating the capability of their algorithm to identify ncRNAs without providing their model with previous knowledge. Tripathi et al. developed a method to detect long ncRNAs (lncRNAs) (39). They reached a remarkable 99% accuracy rate applying a DFF to reference databases. Long intergenic ncRNAs (lincRNAs), a type of lncRNAs which are transcribed in intergenic regions, have been also successfully predicted feeding an AE with previous knowledge about lincRNAs (40).
Epigenomic studies identify modifications in DNA that comprise markers that can potentially alter gene expression without modifying the DNA sequence itself. There are several epigenetic markers such as DNA methylation, histone modification, and specifically positioned nucleosomes. DNA methylation perhaps is the most studied epigenetic modification. DNA methylation studies generate methylation matrices that, like gene expression matrices, can be used for biomarker discovery or disease classification problems. In this context, DL methods have been used to accurately predict the sequences recognized by DNA- and RNA-binding proteins using CNNs (41). A key advantage of this method is the capability to integrate data from different technologies used in epigenomics studies, like chromatin immunoprecipitation (ChIP)-seq or cross-linking immunoprecipitation (CLIP)-seq. DNase I sequencing data have been also used for predicting the three-dimensional chromatin state in a cell using CNN (42). On the other hand, Wang et al. accurately predicted DNA methylation state feeding SAEs with sequence and Hi-C data (43). Histone modifications, similar to DNA methylation, do not affect DNA sequence but can modify its availability to the transcriptional machinery. Using CNNs, Yin et al. designed an algorithm to predict these histone modifications by integrating sequence and DNase data (44). In addition, Singh et al. used a CNN to infer gene expression from histone modifications data (45), while Sekhon et al. used a LSTM to predict differential gene expression, also from histone modifications data (46).
Proteomics comprises a set of techniques that can be used to quantify expression levels, post-translational modifications or localization of proteins in a cell or a biological sample. Metabolomics is the study of a complete metabolome, which are small molecules that participate in general metabolic reactions. The technologies used by these omics-streams are, among others, mass spectrometry (MS) or nuclear magnetic resonance (NMR), and the first challenge for researchers in this field is to assign raw instrumental signals to proteins or metabolites.
In proteomics, the most common experimental strategy is to split proteins into short amino acid chains (peptides) and to analyze these peptides in an MS. The MS output signals are compared to peptide profiles stored in public or proprietary databases to identify them. However, these databases are still incomplete and inaccurate. In this context, Zhou et al. developed a software that uses a LSTM network to predict peptide MS/MS spectra (47). Knowing peptide spectra a priori facilitates the task of assigning MS/MS spectra to peptides comparing them to the theoretical spectra. Another proteomics application is de novo peptide sequencing, which is essential for proteins characterization. In this field, Tran et al. surpassed previous software combining CNN and LSTMs networks to effectively accomplish such a difficult task (48). Once the collection of peptides has been sequenced in a proteomics sample, the next challenge is to identify the proteins of origin of such peptides. Kim et al. addressed this problem through a CNN (49), getting better results than other dedicated libraries for this task. DL has been also applied to predict protein secondary structures from their amino acid sequences (50).
NMR technology is essential for both proteomics and metabolomics data generation. However, it has the technical limitation to return many noise signals that should be filtered in order to improve accuracy. Kobayashi et al. automated this necessary step by applying CNNs to remove noise peaks from NMR spectra (51), thereby improving the performance.
Applying DL methods to metabolomics data is especially challenging because they are unable to identify specific factors that contribute to individual samples, which is essential in these types of experiments (52). Despite this fact, some DL applications have been developed in this field providing interesting results. For instance, Date and Kikuchi combined DNN and mean decrease accuracy metric to analyze NMR-based metabolomics data (52). Asakura et al. also applied DNNs to metabolomics data, overperforming other ML applications (53).
Precision medicine basically aims to move away from general therapies for a broad population to individualized targeted therapies and treatment protocols depending on each patient’s molecular background (54), or establish preventive medicine strategies based on disease susceptibility estimation (55). Omics data have a key role in this transition as they enable studying diseases from several simultaneous levels (e.g., DNA sequence, gene expression, and medical images) and identifying which parts of the complex biological functions are altered. In this new scenario, several ML-based approaches have been applied to medicine (56). However, although ML has been demonstrated to be useful in several precision medicine applications, it has some disadvantages that can be overcome by DL architectures. For instance, ML performance has a strong dependence on the data preprocessing to extract features, while DL models include this feature extraction (57).
One of the most common applications of omics technologies in biomedical research is the identification of new biomarkers for early diseases diagnosis, treatment response, and classification. The availability of large amounts of public omics data, especially in cancer, such as The Cancer Genome Atlas (TCGA) (58), has permitted the identification of new biomarkers with both DL and non-DL strategies. A promising study applied an SDAE to classify breast cancer samples from the TCGA database into healthy or diseased using gene expression data (59). In addition, this method identified a set of highly interactive genes which could be good cancer biomarkers. Gene expression data from TCGA have also been exploited to accurately differentiate samples into different cancer types (60). On the other hand, Si et al. used an AE to classify healthy and breast cancer patients using methylation data (61), while Chatterjee et al. used CNN to classify different cancer types by their methylation patterns, achieving very promising results (62). Multiple omics (RNA-seq, miRNA-seq, and methylation data) have been combined by Chaudhary et al. to classify liver cancer patients into different survival groups (63). Authors used TCGA data to train their AE model, but they expect to improve their method using more clinical data in the future. In a similar work, Olivier et al. integrated the same kinds of omics data from TCGA to stratify bladder cancer patients by their survival chances (64). They used an AE approach to split patients into two survival groups. They also used these clusters to identify biomarkers linked to survival rates. Biomarkers for Alzheimer’s disease have also been proposed using DFFs (65). Another precision oncology application is a tool developed by Yuan et al. to classify cancer types based on somatic mutations (66). The authors combined a DFF with other statistical techniques. They trained and tested their method with TCGA data for 12 cancer types.
Medical imaging is one of the main tools for the transition from traditional medicine to precision medicine. This section reviews some DL-based imaging applications in the context of disease classification and diagnosis.
In skin cancer, the first step for diagnosing is based on visual inspections by dermatologists. Consequently, skin cancer diagnosis is a classical image recognition problem where researchers have applied ML methods and image recognition approaches. In a recent work, Estava et al. trained a CNN with thousands of clinical images to automatically identify whether a skin lesion is a skin cancer symptom (67). With their method, they obtained results as good as a panel of expert dermatologists. Some other studies addressed this problem with CNNs (68), all of them with promising results, and it is expected that this research will be translated in a few years into mobile applications able to accurately diagnose skin cancer lesions.
In the context of brain cancer, tumor segmentation is essential to define the shape and size of the tumor and apply diagnoses and therapies accordingly. This tumor segmentation is usually made manually by doctors using magnetic resonance imaging (MRI) images. However, this crucial task is very time-consuming and subjective. Therefore, there has been a lot of interest in automating tumor segmentation from MRI data. This task is very challenging because MRI data consist of 3D images where tumors are very different between patients, and in addition, they are very heterogeneous images depending on the device and experimental procedures employed (69). Several researchers addressed this challenge using CNNs (70–72) or SAEs (73).
Analysis of histopathological images is one of the most common tests for cancer diagnosis. As with brain tumor segmentation, the analysis of images is manually performed by pathologists, which is a time-consuming task. In this context, several attempts have been made in order to automate this process. Litjens et al. reported a CNNs-based strategy for prostate and breast cancer diagnosis (74), although their results are very preliminary and much more research is necessary in this field. In addition, Xie et al. recently combined different DL algorithms to classify breast cancer subtypes from histopathological images (75). Colorectal polyps have been also classified applying a ResNet (76).
Computed tomography (CT) is used for the diagnosis of several diseases due to its capacity to generate three-dimensional anatomic images. Some DL approaches will likely enable the use of CT images in precision medicine. Roth et al., for instance, proposed the application of CNNs to automatically classify CT images into the different human anatomical parts (77). Such classification is the first step in many CT-based diagnostic strategies. There are also some specific applications in this field, for instance, for pancreas segmentation (78) or coronary artery calcium scoring (79).
Ultrasound (US) imaging is another imaging technique with many medical applications, for instance, in heart dysfunctions diagnosis. Carneiro and Nascimento innovated this field using DBNs to left ventricle endocardium tracking, allowing the automatic detection of different cardiopathies (80). On the other hand, Lekadir et al. applied a CNN to characterize carotid plaque composition (81). In addition, Biswas et al. developed a DL method to characterize liver US images, allowing the diagnosis and stratification of liver pathologies (82).
Some DL methods have been also applied to X-ray images. For instance, Nasr-Esfahani et al. used a CNN to detect vessel regions, a necessary step for coronary artery disease diagnosis (83). Bone age assessment is a common technique to detect growth abnormalities, and currently, it is done manually by comparing the X-ray images from databases. However, some authors applied DL algorithms to automate this process (84, 85).
Finally, facial images are being used with very promising results for automatic disease diagnosis. In a very recent work, Gurovich et al. have presented a facial analysis framework for genetic syndrome classification (86). They used patient facial images and CNNs to quantify similarities of facial features to hundreds of syndromes outperforming clinicians in diagnosis tasks.
Omics technologies are not only changing the way we study biomedicine but also introducing novel analytical challenges to bioinformatics analysts. DL is a promising approach to analyze these complex and heterogeneous datasets to drive precision medicine. This chapter reviewed some of the most common DL applications in omics data analysis and precision medicine. Although these methods have been used with very promising results, there are important considerations to take into account. The most successful application of DL in biomedical research to date has been in supervised learning; therefore, a crucial step is to avoid biases in training sets as quality of learning depends on the quality of the input data. No single method is universally applicable, and the choice of whether and how to use DL approaches will be problem-specific. Conventional analytical approaches will remain valid and have advantages when data are scarce or if the aim is to assess statistical significance, which is currently difficult using DL methods. Another limitation of DL is the increased complexity, which applies both to model design and to the required computing environment. The application of DL methods to omics and precision medicine is a very new field. Although there are still some limitations, there is an increasing interest and research efforts that is resolving the major shortcomings and providing with very promising applications. The increasing availability of a larger number of omics datasets, medical images and clinical health records is fuelling the promising applications of DL technology that in the near future will play an increasingly important role in this field.
Acknowledgement: JMM was partially funded by Ministerio de Economía, Industria y Competitividad. This work was partially supported by Junta de Andalucía (PI-0173-2017).
Conflict of Interest: The authors declare no potential conflicts of interest with respect to research, authorship and/or publication of this chapter.
Copyright and permission statement: To the best of our knowledge, the materials included in this chapter do not violate copyright laws. All original sources have been appropriately acknowledged and/or referenced. Where relevant, appropriate permissions have been obtained from the original copyright holder(s).