Barbara Kramarz • Ruth C. Lovering
Functional Gene Annotation, Preclinical and Fundamental Science, UCL Institute of Cardiovascular Science, University College London, London, UK
Abstract: Gene Ontology (GO) is a universal resource for analyses and interpretation of high-throughput biological datasets. GO is developed and curated by several different groups, based at scientific institutions around the world, working together under the auspices of the GO Consortium. GO annotations capture biological functional knowledge by associating gene products with GO terms. GO term and gene product records all have computer-readable accession numbers; therefore, these annotations can be easily used for analyses of large datasets while retaining human-readable labels. The UCL Functional Gene Annotation group focuses on GO annotation of human gene products. Our group has led initiatives to systematically annotate proteins and microRNAs across specific biomedical fields, and our current biocuration effort, funded by the Alzheimer’s Research UK foundation, is focused on dementia and Alzheimer’s disease. Our group has also contributed to the development and revision of the ontology describing neurological domains of biology. Here we present an overview of GO and explain how our work, as well as the work of other members of the GO Consortium, is improving the neurological domains of the GO resource. These biocuration efforts will benefit the dementia and Alzheimer’s research community by rendering GO more suitable for analyses of neurological datasets.
Keywords: annotation; biocuration; Gene Ontology; high-throughput analysis; neurobiology
Author for correspondence: Ruth Lovering, Functional Gene Annotation, Preclinical and Fundamental Science, UCL Institute of Cardiovascular Science, University College London, London, UK. Email: r.lovering@ucl.ac.uk.
Doi: http://dx.doi.org/10.15586/alzheimersdisease.2019.ch2
In: Alzheimer’s Disease. Thomas Wisniewski (Editor), Codon Publications, Brisbane, Australia. ISBN: 978-0-646-80968-7; Doi: http://dx.doi.org/10.15586/alzheimersdisease.2019
Copyright: The Authors.
Licence: This open access article is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0). https://creativecommons.org/licenses/by-nc/4.0/
Several genes associated with monogenic Alzheimer’s disease (AD) have been identified (1); however, the disease can also be caused by polygenic and environmental risk factors (1, 2). To understand the cellular processes and risk factors associated with AD, numerous transcriptomic, proteomic, and genome-wide association (GWA) studies have been conducted (3–5). Researchers are now turning to pathway-based GWA analysis and Next Generation Sequencing (NGS) to identify the genes contributing to the “missing heritability” (6, 7).
The process of finding gene variants that are causative, or modifiers, of disease is often time-consuming. Bioinformatics-based analyses can aid the identification of AD risk variants, based on the variant’s association with a gene product implicated in neurobiological processes and pathways impaired in dementia. Such approaches are reliant on bioinformatics resources, including Gene Ontology (GO) (8, 9), KEGG (10), Reactome (11), and molecular interaction databases (12, 13). These resources provide connections between gene products and biological pathways or networks, which are relevant to AD. The end result of these analyses is the identification of both the risk variant and the candidate gene associated with the risk (14, 15). In addition, considerable research is now focused on the selection of biomarkers for AD (16), and the creation of biomarker panels is likely to be more successful if it is known what biological pathways the candidate biomarkers have in common.
The majority of analyses of high-throughput approaches rely on high-quality annotation data (4, 5) because these bridge the gap between data collation and data analysis (4, 17). Gene annotation datasets provide functional knowledge about gene products, such as proteins or microRNAs, in a computationally accessible format, thus these data can be exploited by systems biology investigators. The main resources used to identify significantly enriched pathways in “omics” studies are those provided by GO (8, 9), KEGG (10), Reactome (11), and protein interaction databases (12, 13). GO annotation data are frequently used because it can describe a gene product’s role in a process or its location in a cell, even when the basic molecular activity of this gene product is still under investigation (Figure 1) (18). In contrast, Reactome and KEGG provide very specific information about the molecular function of a gene product within a pathway, with the “reaction” catalyzed or facilitated by each gene product clearly identified within a pathway diagram. Consequently, gene products whose role has not been fully elucidated cannot be included in these resources. Furthermore, although the human and mammalian phenotype ontologies (HPO, MP) (19) are being used to interpret NGS data, understanding how multiple genes contribute to a single disease or phenotype will require resources, such as GO, that describe the cellular roles of these genes.
Figure 1 A selection of Gene Ontology annotations. This list of Gene Ontology annotations was downloaded from the QuickGO browser (37). All of these annotations, based on the experimental data presented by Zhao et al. (20), were created by the UCL Functional Gene Annotation group. The annotations were filtered by ‘PMID:26005850’. The columns, in order from left to right, are as follows: Symbol, HGNC-approved gene symbol; GO term, GO term identifier and name; Evidence, one of the many Evidence and Conclusion Ontology (ECO) codes (38) associated with each GO annotation to indicate the type of experiments that support the annotation (IDA, Inferred from Direct Assay; IMP, Inferred from Mutant Phenotype; IPI, Inferred from Physical Interaction); Annotation Extension, additional information about the annotations, for example, the location of the function (occurs_in CL:0002144, capillary endothelial cell), or the entity that activates the function (activated_by CHEBI:64646, amyloid-beta polypeptide 40).
The GO resource (8, 9) is maintained, curated, and made available through the concerted efforts of the GO Consortium, whose aim is to provide both an ontology of terms and gene product annotations. Consequently, the GO Consortium includes skilled biocuration scientists, ontology editors, and software engineers. The ontology enables the description of attributes of gene products, including proteins, macromolecular complexes, and noncoding RNAs, in three key domains: molecular function, biological process, and cellular component. Fully defined computer-readable GO terms are used by the GO Consortium annotation groups, including our Functional Gene Annotation group at UCL, to create links (annotations) between GO terms and gene products across many species, based on published scientific findings, providing a computable and traceable summary of individual experiments. GO terms are used to describe gene products by their molecular functions (e.g., scavenger receptor activity), the biological processes they contribute toward (e.g., microtubule cytoskeleton organization), and their subcellular locations (e.g., extracellular region). For instance, GO curators have contributed 46 GO annotations based on experimental evidence presented by Zhao et al. (20), of which a selection is presented in Figure 1.
The gene product annotations contributed by GO biocurators are regularly submitted to the GO knowledgebase, where the most current and complete collection of GO terms and annotations is publicly available to all users (9). Providers of bioinformatics tools, such as g:Profiler (21), Cytoscape (22), or DAVID (23), import GO data into their tools for use in enrichment analyses of large datasets. Therefore, the association of GO terms with gene product records (to create annotations) and the use of GO annotation data in analysis tools together enable groups of similarly annotated gene products, within an “omics” dataset, to be identified as significantly enriched (18, 24, 25). Thus, dysregulated pathways, functions, and macromolecular complexes can be identified within high-throughput datasets. However, GO annotation is a continuously ongoing initiative with certain biological aspects annotated more thoroughly than others. Insufficient annotation of key biological processes and pathways relevant to dementia can hinder the interpretation of outcomes from GWA studies, microarray, and proteomic approaches to dissect AD and other AD-relevant diseases (26). Consequently, these analyses may identify partial protein networks or only general GO terms as enriched in the dataset, for example mitochondrion (27) and calcium-mediated signaling (4). Having recognized this deficit, the Functional Gene Annotation group at UCL have, for the last 5 years, focused on the annotation of gene products relevant to Parkinson’s and Alzheimer’s diseases (26, 28, 29). This has led to substantial improvements in the representation of processes such as mitophagy, amyloid precursor protein processing, oxidative stress, and tau-associated processes.
The GO is structured as directed graphs, with each GO term having a unique term name, for example, phosphatidylcholine-sterol O-acyltransferase activity, proteasomal protein catabolic process, or high-density lipoprotein particle, and a definition (Figure 2) as well as a computer-readable numerical identifier. In addition, the ontology is a dynamic resource, with the ontology itself continually being expanded and refined to capture current knowledge. Although GO terms exist which describe most gene products’ processes, functions, and locations, many of these terms are very general and are not specific enough to fully describe the role of AD-associated gene products. The UCL Functional Gene Annotation group has begun to address this issue through the development of the ontology to provide more specific and descriptive GO terms, by improving the existing term definitions and by revising the existing ontology structure (26, 28, 29). The association of these more specific GO terms prevents the loss of valuable descriptions of gene products, based on experimental information, that would have been unavailable if the more general GO term had been applied. For example, we have improved the ontology domains describing the unfolded protein response (UPR) (28), autophagy (29), and neuron projection development (26). These improvements have led to an expansion of the number of GO terms describing these processes, as well as revision of relationships between terms within the ontology. All of these biological processes have relevance to AD as well as Parkinson’s disease and other neurological conditions.
Figure 2 A selection of the dendrite Gene Ontology graph. This figure was generated by the QuickGO browser (37) and shows the is_a (black arrows) and part_of (blue arrows) hierarchy of just a small number of terms within the dendrite branch of the ontology; currently, the dendritic GO domain has 36 terms. The general term dendrite is used to group different types of dendrites, for example, primary dendrite and distal dendrite are both more descriptive child terms of dendrite. The definition (as displayed in QuickGO) of one of the GO terms, dendritic branch, is also included. The yellow highlighted terms were contributed by the UCL Functional Gene Annotation group.
Although GO terms are categorized into three key domains, as introduced above, revisions in one domain are often done in conjunction with another domain describing the same biological niche. For instance, our work on neuron projection development (26), a biological process GO term, resulted, first, in contribution of new, more descriptive, GO terms, such as neuron projection arborization, dendrite morphogenesis, or dendrite arborization. Yet, simultaneously, we also improved the dendrite branch of the cellular component GO aspect, as shown in Figure 2. Similarly, curation of the autophagy (29) processes led to not only generation of highly specific biological process GO terms, but also resulted in revisions of related cellular component terms, such as autophagosome, amphisome, or late endosome. Thus, enhancing one ontology branch within a specific domain of GO is often done in conjunction with improvements in other branches and domains, consequently enriching the ontology resource more broadly.
The neuroscience research community will also have benefited from curation work of the SYSCILIA research Consortium, which involved revisions and improvements to cilia-related biology in GO, resulting in contribution of 50 new GO terms (30). Among others, ciliary dysfunction has been shown to affect Sonic hedgehog signaling in the brain, a pathway with demonstrated implications in Alzheimer’s (31), Parkinson’s (32), and Huntington’s (33) diseases. Consequently, revisions and new contributions to the ciliary niche will have improved the representation of cilia biology in GO and, therefore, resulted in more informative analyses of neurological datasets with changes in ciliary proteins.
Another ongoing biocuration initiative with direct relevance to elucidation of Alzheimer’s data is the SynGO project and the associated synaptic GO portal (34). SynGO is a collaboration between the Stanley Center for Psychiatric Research at the Broad Institute (Cambridge, MA, USA), the Center for Neurogenomics and Cognitive Research at the Vrije Universiteit (Amsterdam, The Netherlands), and the GO Consortium, thus combining the efforts of experts in synapse biology and GO biocurators to generate the best possible representation of synapse biology in GO.
The UniProt Knowledgebase (EMBL-EBI, Cambridge, UK) has also been improving the representation of Alzheimer’s data in their resource, an initiative which includes GO annotation as well as IntAct (13) curation of protein–protein interaction and/or curation of disease variants, as a part of a project funded by the National Institutes of Health (USA). The ultimate goal of this project is to create an online AD portal with thoroughly annotated and easily searchable information on the disease and biological pathways impaired in dementia (35). Importantly, all biocuration scientists, aiming to improve the representation of dementia-relevant biology in GO, work together under the auspices of the GO Consortium, thus ensuring GO annotation consistency and quality.
There are two major approaches that rely on concerted efforts of skilled biologists and software engineers (36), which result in high-quality GO annotations: manual techniques that depend on the knowledge and expertise of biocuration scientists and computational methods that generate annotations, for instance, based on sequence similarity algorithms. Every annotation is attributed to an identified reference, often a publication identifier, such as PMID, and each annotation must indicate what kind of evidence supports the association between the gene product and the GO term (Figure 1).
The computational annotation approach is a high-throughput and efficient method of associating high-level terms to a large number of gene products across all genomes. These annotations are often assigned, based on specific protein domains with known functions or cellular locations, or based on orthology to a manually curated gene product. However, to provide more specific annotations, GO biocurators read the published scientific literature and use the published data to manually associate highly descriptive GO terms to gene products. Consequently, complete, highly detailed annotation of the processes and networks that a single gene product is involved in may take a considerable time. Depending on the number of published papers describing the gene product, a curator will annotate an average of 1–3 experimental papers per day.
Furthermore, as there is no limit to the number of GO annotations that can be assigned to a gene product record, it is possible to describe the many different roles that the gene product may have, depending on the cell type it is expressed in, the developmental stage of the organism, and the environmental stimuli the cell is responding to. The UCL Functional Gene Annotation group takes an unusual approach to annotation, in that we usually focus on annotation of a specific process involving a number of gene products, such as amyloid precursor protein processing, rather than working through an unrelated set of gene products. This enables us to develop a better understanding of the biology and apply a consistent annotation approach to all gene products involved in the process, thus providing depth to the annotations. In addition, at UCL we annotate full papers, whereas some groups will curate only the information in a paper that is relevant to a specific prioritized gene. This approach enables us to provide annotations to a large number of relevant gene products, involved in a specific process, which may not be included in the list of annotation priorities. For example, after completing the annotation of 84 proteins and protein complexes, prioritized for annotation as part of the amyloid-beta or tau projects, we had, in total, annotated 526 proteins and complexes (26).
Furthermore, in response to the research community’s needs (39), at UCL our annotation procedure involves inclusion of annotation extensions (40) to capture information about the cell and tissue types in which a particular gene product is active, as well as the specific target of a protein or a microRNA. These detailed annotations provide critical knowledge for biomarkers, diagnostics, and drug discovery and will be of considerable value to the research community and allow users of GO to query a variety of data. For example, a GO user could investigate all targets of a particular protein ubiquitin ligase, or, more specifically, search for all proteins involved in catabolic pathways in microglial cells. Unfortunately, although biocurators have been contributing the annotation extension data for over 6 years, there are no tools that are using this data, and only a few browsers display it (9, 41). In the near future, the annotation extension information will be ported to Gene Ontology Causal Activity Modeling (GO-CAM) (42).
Historically, GO was used specifically for annotation of proteins. Recently, the GO Consortium has extended the range of gene products that are annotated; rather than only annotating proteins some members of the GO Consortium are now annotating protein complexes (43) and microRNAs (26, 44, 45). To curate these entities, it has been necessary to create new identifiers (43, 46) and develop strict guidelines to ensure that a consistent annotation approach is applied. For example, there are many papers describing the coregulation of a microRNA or a set of microRNAs with the transcription of a panel of mRNAs and implying that these microRNAs therefore regulate the coregulated mRNAs. Such data do not comply with quality standards implemented by the GO Consortium and are not being captured as GO annotations (44). Instead microRNA GO annotations are contributed based on more precise low-throughput functional experiments, involving microRNA mimics or knockdown, followed by an assessment of the expression of a panel of specific mRNAs. In addition, reporter assay data, confirming a direct interaction between a microRNA and an mRNA, are being captured using specific GO terms (e.g., mRNA binding involved in posttranscriptional gene silencing). Furthermore, in these cases the annotation extension will be used to capture the identifier of the targeted mRNA. The resulting interaction data are not only available in the GO annotation files, but also within the EBI-GOA-miRNA dataset from the PSICQUIC web server (45).
By creating an open access dataset of high-quality annotations, which describe the cellular role of those proteins and microRNAs that contribute to pathways dysregulated in AD, the GO provides an invaluable resource for researchers. GO annotations are incorporated into over 50 functional analysis tools, the majority of which are freely available, such as g:Profiler (21), PANTHER (47), Cytoscape (22), and DAVID (23), but others are subscription based, such as Ingenuity Pathway Analysis (QIAGEN Bioinformatics) (48) and MetaCore (Clarivate Analytics) (49). These tools, and many other functional analysis tools, are used by researchers to analyze a variety of high-throughput data, including transcriptomic (4, 50–53), proteomic (5, 54, 55), and GWA (6, 14, 56) data. In addition, existing pipelines ensure that the GO annotations are included in widely used public resources such as UniProt (57), NCBI Gene (58), Ensembl (59), RNAcentral (46), and even Wikipedia. GO annotations associated with individual protein, RNA, or macromolecular complex records are used by researchers to extract a synopsis of the cellular role of a gene product. These gene summaries have many uses in research, for example, they can help guide researchers to the most likely candidate gene associated with a risk locus (14, 18). However, it is the use of GO for the interpretation of data from high-throughput analyses where this resource can be exploited to the full.
The quality of the GO annotations used in the analysis of large biological datasets will determine how informative the outcomes of this analysis will be. Without highly descriptive annotations the analysis can only identify GO functions, processes, or location that are not very specific, such as site of polarized growth, wound healing, and cell migration (54). The identification of more informative enriched terms is dependent not only on the presence of highly descriptive GO terms describing biological knowledge, but also on the association of these terms with a sufficient number of gene products to enable the term to be detected as significantly enriched. A recent meta-analysis of late-onset AD, that included over 94,000 individuals, identified over 100 new risk loci, associated with amyloid-beta and tau processes, as well as immune response pathways and lipid processing (14). This meta-analysis took a wide range of approaches to identify new risk loci, one of which was the use of the pathway analysis software, MAGMA (60), and GO annotation files (36). The GO terms plasma lipoprotein particle assembly, reverse cholesterol transport, regulation of amyloid precursor catabolic process, and activation of immune response were identified as processes with relevance to AD. The first three of these GO terms provide a good description of the processes involved, whereas the last term activation of immune response is too general to really give an indication of the mechanism involved. This is likely to reflect the considerable investment in annotation of cardiovascular (18, 61, 62) and nervous system genes (26, 28), and the lack of focused annotation of the immune system. In addition, papers describing the immune system are often highly detailed and more challenging for biocurators without a background in immunology to fully annotate (63). Thus, the annotation of immune-associated pathways does not reflect the volume of literature and knowledge in this domain.
Another study aimed to elucidate protein expression in different brain regions in Alzheimer’s cases relative to controls to provide a broader understanding of molecular pathways impaired in dementia (64). In this study, GO analysis was used to identify the biological processes that had the largest numbers of differentially expressed proteins associated with them. A wide variety of processes were identified, in this way, including regulation of apoptosis associated with the hippocampus and protein transport associated with the cerebellum and cingulate gyrus, therefore allowing researchers to identify new routes for potential therapeutic interventions.
GO term enrichment analysis has also been implemented in pilot studies aiming to identify biomarkers associated with dementia, which can be detected using noninvasive methods in easily accessible bodily fluids, such as blood (65) and urine (66). For instance, Chouliaras et al. (65) used GO enrichment together with KEGG pathway analysis to demonstrate the relationships between the identified blood biomarkers with neurological processes and neuronal components. Significantly enriched GO terms included regulation of amyloid-beta formation and amyloid-beta binding, main axon, and ion channel complex, whereas KEGG pathways included glutamatergic synapse and Alzheimer’s disease, thus confirming their relevance to cognitive impairments.
Similarly, Watanabe et al. (66) used GO term enrichment and KEGG pathway analyses to delineate the roles of proteins differentially expressed in urine of Alzheimer’s patients relative to healthy controls to identify a urine biomarker signature, which could be used for noninvasive diagnostic purposes. Lipoprotein metabolism, heat shock protein 90 signaling pathway, and matrix metalloproteinase signaling pathway as well as redox regulation by thioredoxin were among the molecular pathways with the highest enrichment scores, providing evidence for impairment of vascular processes key to the development of dementia (66). In addition, Watanabe et al. (66) also supplemented their functional GO and KEGG analyses with an interrelation network analysis to determine, which of the proteins differentially expressed in the Alzheimer’s urine samples interact with each other (either directly or via an intermediate). This network analysis of molecular relationships enabled these researchers to further elucidate which GO biological processes and KEGG pathways should be prioritized in future studies and whether they correlate with and confirm other findings. An alternative approach to using GO annotations is to visualize them on an interaction network. This provides the researcher with an overview of the contribution that a network, or part of the network, makes to a particular process or the cellular location of the interacting entities, as shown in the example in Figure 3. Thus, the use of multiple, interoperable, annotation resources provides the opportunity to fully exploit and interrogate individual datasets.
Figure 3 Network of proteins identified in Alzheimer’s disease meta-analysis. Nine proteins identified in an Alzheimer’s disease meta-analysis (14) due to their association with the GO term “regulation of amyloid precursor protein catabolic process” were used to seed an interaction network using Cytoscape (22) and five files available on the PSICQUIC web server (67) (IntAct, BHF-UCL, UniProt, MINT, and EBI-GOA-non-IntAct). The seed proteins are outlined in yellow. The network was analyzed using Golorize (68), BiNGO (69), and GO ontology and annotation files (36), as described in Denny et al. (29) (downloaded March 29, 2019). The proteins associated with a selection of the enriched GO terms (or one of their child terms, including regulation child terms) are shown in the network.
The above examples demonstrate how continuous, systematic, and consistent improvements to the GO resource, including contribution of new descriptive GO terms, and their association with gene products in the form of GO annotations, impact more informative outcomes of analyses of high-throughput datasets. The functional analyses relying on GO allow researchers to, first, plan and design further studies leading to a better understanding of the molecular mechanisms underlying dementia, and, second, to develop noninvasive diagnostic methods, which collectively will help to improve the management and treatment of AD.
The GO resource (8, 9, 70) adds value to published experimental data by creating computer-readable annotations that describe specific functions of a gene product, such as protein, complex, or noncoding RNA, and the biological processes and pathways it contributes to. This benefits all biological areas, including the AD field, on which the UCL Functional Gene Annotation team has recently been focusing their biocuration efforts. Gene annotations using GO terms enable groups of gene products, with similar cellular roles or locations in the cell, to be easily identified within a dataset, such as a list of differentially expressed genes from AD cases. Thus, dysregulated pathways, functions, and macromolecular complexes can be identified within high-throughput datasets using GO annotation data and functional enrichment tools. GO annotation data are therefore needed for pathway construction, enrichment analyses and interpretation of large-scale datasets (3–5) and to inform biomarker selection decisions (27), and can also be used to identify novel drug targets or novel repurposing of drugs. Furthermore, the AD-focused and comprehensive efforts of the UCL Functional Gene Annotation team have improved and continue to improve the GO resource, enhancing its applicability to this neurobiological research domain and facilitating analyses and interpretation of AD big data.
Acknowledgments: The UCL Functional Gene Annotation group has been supported by Alzheimer’s Research UK grants ARUK-NSG2016-13, ARUK-NAS2017A-1, and ARUK-NSG2018-003 and the National Institute for Health Research University College London Hospitals Biomedical Research Centre. We thank the many biocurators, editors, and other members of the GO Consortium who have contributed to the annotation of dementia-relevant gene products and the development of the Gene Ontology, especially Dr Rachael P. Huntley. The GO Consortiumis supported by a grant from the National Human Genome Research Institute (grant no. U41 HG002273 to P.D. Thomas, P.W. Sternberg, C.J. Mungall, J.M. Cherry, and J.A. Blake).
Conflict of interest: The authors declare no potential conflicts of interest with respect to research, authorship, and/or publication of this chapter.
Copyright and Permission Statement: To the best of our knowledge, the materials included in this chapter do not violate copyright laws. All original sources have been appropriately acknowledged and/or referenced.