Annotation and Annotation Resources

^ with content from Lori Shepherd Kern, Martin Morgan and James W. MacDonald CSAMA 2026

Johannes Rainer

Institute for Biomedicine, Eurac Research

Introduction

Data analysis

  • Annotation: external data fed into an analysis to make sense of quantified entities.

Annotation

  • put a (new or different) label on an entity
  • entity: something measured with an assay (RNA, molecule, cell)
  • label: commonly used (standard) identifier (e.g. gene name)
  • sounds trivial, but is very important task in computational biology
  • examples:
    • map positions on the genome to transcripts
    • map genes to biological pathways
    • annotate cell types based on presence of marker genes
  • 🎯 enable or assist in interpretation; make biological sense of assay data.

In this presentation

  • 1️⃣ provide some general information on annotation
  • 2️⃣ give you an overview on available annotation resources
  • 3️⃣ short use cases (with code)

1️⃣ General information on annotation

What do we need for annotation?

  • 1️⃣ an annotation resource: reference information on entities or a mapping of one type of identifiers to another.
  • 2️⃣ a rule that defines how to assign labels to entities
    • annotation through similarity: sequence similarity, spectral similarity
    • direct mapping

What do we need for annotation?

  • 1️⃣ an annotation resource: reference information on entities or a mapping of one type of identifiers to another.
  • 2️⃣ a rule that defines how to assign labels to entities
  • examples
    • map between gene identifiers (Ensembl, NCBI, …)

What do we need for annotation?

  • 1️⃣ an annotation resource: reference information on entities or a mapping of one type of identifiers to another.
  • 2️⃣ a rule that defines how to assign labels to entities
  • examples
    • annotate through sequence similarity: map short reads to genome

What do we need for annotation?

  • 1️⃣ an annotation resource: reference information on entities or a mapping of one type of identifiers to another.
  • 2️⃣ a rule that defines how to assign labels to entities
  • examples
    • cell types through presence of marker genes

Reproducibility

In order to guarantee reproducibility of annotation:

annotation resource has to be versioned and findable

🙅 dynamic annotation resources (daily updates) without the possibility to access a specific version

the same version should be used throughout the whole analysis

🙅 using a different genome release for RNA-seq alignment and gene counting.

Standardization and metadata

  • Common nomenclature is required
  • It’s great if you can assign a name to an entity - but also other researchers should understand what you mean.
  • example: synonyms
  • Synonyms: aliases or previous names of entities.
  • Can be tricky if different names are used across publications.
  • 👉 use the official HGNC gene name!

Standardization and metadata

  • 🙀 … it can always be worse …

Why? annotation resource was compiled from community provided information.

  • Standardization and common nomenclature is important!
  • Standardization is important for human consumers, but even more so for computational use (including AI).

2️⃣ Annotation resources

Annotation resources

  • Where can you get annotations from?

Annotation resources

For genes

Annotation resources

For proteins

Annotation resources

For small molecules

Annotation resources

Pathways

Annotation resources

Pathways

Annotation resources

Pathways

  • 😒 some annotation resources use their own data/file format making it cumbersome to include them in data analysis workflows.

Bioconductor annotation resources

  • Format standardized to simplify integration into the analysis (big thanks to the maintainers! 🙌)
  • AnnotationDbi package defines a common interface to extract information from annotation resources: mapIds() and select() functions.

Bioconductor annotation resources

- 💍 one to get them all: AnnotationHub

  • Central registry for annotation resources.
library(AnnotationHub)

#' Synchronize and cache AnnotationHub information locally
ah <- AnnotationHub()
#' List all available resources
ah
AnnotationHub with 69129 records
# snapshotDate(): 2026-04-23
# $dataprovider: Ensembl, BroadInstitute, UCSC, Haemcode, FANTOM5,DLRP,IUPHA...
# $species: Homo sapiens, Mus musculus, Drosophila melanogaster, Rattus norv...
# $rdataclass: GRanges, TwoBitFile, BigWigFile, EnsDb, Rle, ChainFile, SQLit...
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH5012"]]' 

             title                          
  AH5012   | Chromosome Band                
  AH5013   | STS Markers                    
  ...        ...                            
  AH122185 | Data.table for MeSH (Qualifier)
  AH122186 | Data.table for MeSH (SCR)      
  • Resources come in variety of different formats and data types.
#' List data class
table(ah$rdataclass)

                      AAStringSet                        BigWigFile 
                                1                             10247 
                           biopax                         ChainFile 
                                9                              1115 
                        character                            CompDb 
                               10                                 8 
                       data.frame data.frame, DNAStringSet, GRanges 
                               57                                 3 
                       data.table                             EnsDb 
                               25                              5272 
                           FaFile                           GRanges 
                                3                             30545 
                           igraph                     Inparanoid8Db 
                                2                               268 
                           JASPAR                              list 
                                3                                71 
                           MSnSet                          mzRident 
                                1                                 1 
                            OrgDb                               Rda 
                               18                                45 
                              Rle                            sqlite 
                             2365                                 1 
                           SQLite                  SQLiteConnection 
                                5                                 3 
                       SQLiteFile                            String 
                              631                                16 
                           Tibble                        TwoBitFile 
                               69                             17825 
                             TxDb                           VcfFile 
                              502                                 8 
  • Resource can be a reference to an external file
query(ah, c("ensembl", "TwoBitFile"))
AnnotationHub with 17681 records
# snapshotDate(): 2026-04-23
# $dataprovider: Ensembl
# $species: Mus musculus, Homo sapiens, Danio rerio, Xiphophorus maculatus, ...
# $rdataclass: TwoBitFile
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH49592"]]' 

             title                                                        
  AH49592  | Ailuropoda_melanoleuca.ailMel1.cdna.all.2bit                 
  AH49593  | Ailuropoda_melanoleuca.ailMel1.dna_rm.toplevel.2bit          
  ...        ...                                                          
  AH107046 | Zosterops_lateralis_melanops.ASM128173v1.dna_sm.toplevel.2bit
  AH107047 | Zosterops_lateralis_melanops.ASM128173v1.ncrna.2bit          
ah["AH49592"]
AnnotationHub with 1 record
# snapshotDate(): 2026-04-23
# names(): AH49592
# $dataprovider: Ensembl
# $species: Ailuropoda melanoleuca
# $rdataclass: TwoBitFile
# $rdatadateadded: 2015-12-28
# $title: Ailuropoda_melanoleuca.ailMel1.cdna.all.2bit
# $description: TwoBit cDNA sequence for Ailuropoda melanoleuca
# $taxonomyid: 9646
# $genome: ailMel1
# $sourcetype: FASTA
# $sourceurl: ftp://ftp.ensembl.org/pub/release-82/fasta/ailuropoda_melanole...
# $sourcesize: 10988959
# $tags: c("TwoBit", "ensembl", "sequence", "2bit", "FASTA") 
# retrieve record with 'object[["AH49592"]]' 
  • Or a dedicated R object.
  • We retrieve one such object from AnnotationHub.
query(ah, c("Hsapiens", "EnsDb"))
AnnotationHub with 28 records
# snapshotDate(): 2026-04-23
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: EnsDb
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH53211"]]' 

             title                             
  AH53211  | Ensembl 87 EnsDb for Homo Sapiens 
  AH53715  | Ensembl 88 EnsDb for Homo Sapiens 
  ...        ...                               
  AH116860 | Ensembl 112 EnsDb for Homo sapiens
  AH119325 | Ensembl 113 EnsDb for Homo sapiens
edb <- ah[["AH119325"]]
edb
EnsDb for Ensembl:
|Backend: SQLite
|Db type: EnsDb
|Type of Gene ID: Ensembl Gene ID
|Supporting package: ensembldb
|Db created by: ensembldb package from Bioconductor
|script_version: 0.3.10
|Creation time: Sat Oct 26 21:34:14 2024
|ensembl_version: 113
|ensembl_host: 127.0.0.1
|Organism: Homo sapiens
|taxonomy_id: 9606
|genome_build: GRCh38
|DBSCHEMAVERSION: 2.2
|common_name: human
|species: homo_sapiens
| No. of genes: 87726.
| No. of transcripts: 413674.
|Protein data available.
  • ℹ️ data is downloaded and cached. Subsequent calls will load the data from the cache.

Bioconductor annotation resources

  • … and there are also dedicated R packages with specific annotations.
  • example: identifiers and metadata for human genes.
library(org.Hs.eg.db)
org.Hs.eg.db
OrgDb object:
| DBSCHEMAVERSION: 2.1
| Db type: OrgDb
| Supporting package: AnnotationDbi
| DBSCHEMA: HUMAN_DB
| ORGANISM: Homo sapiens
| SPECIES: Human
| EGSOURCEDATE: 2026-Mar18
| EGSOURCENAME: Entrez Gene
| EGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
| CENTRALID: EG
| TAXID: 9606
| GOSOURCENAME: Gene Ontology
| GOSOURCEURL: https://current.geneontology.org/ontology/go-basic.obo
| GOSOURCEDATE: 2026-01-23
| GOEGSOURCEDATE: 2026-Mar18
| GOEGSOURCENAME: Entrez Gene
| GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
| KEGGSOURCENAME: KEGG GENOME
| KEGGSOURCEURL: ftp://ftp.genome.jp/pub/kegg/genomes
| KEGGSOURCEDATE: 2011-Mar15
| GPSOURCENAME: UCSC Genome Bioinformatics (Homo sapiens)
| GPSOURCEURL: ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/database
| GPSOURCEDATE: UTC-Mar19
| ENSOURCEDATE: 2025-Sep03
| ENSOURCENAME: Ensembl
| ENSOURCEURL: ftp://ftp.ensembl.org/pub/current_fasta
| UPSOURCENAME: Uniprot
| UPSOURCEURL: http://www.UniProt.org/
| UPSOURCEDATE: Fri Mar 20 10:22:53 2026
  • Most annotation resources provide mapping between gene identifiers or mapping to e.g. pathways.
  • Also available: position-relative annotations.

Position-relative annotations

  • Mapping between entities (genes, exons) and positions (on the genome, within a transcript).
  • GenomicFeatures::TxDb resources.

Position-relative annotations

  • Mapping between entities (genes, exons, protein domains) and positions (on the genome, within a transcript, within a protein).
  • GenomicFeatures::TxDb and ensembldb::EnsDb resources.

Position-relative annotations

  • Mapping between entities (genes, exons, protein domains) and positions (on the genome, within a transcript, within a protein).

  • GenomicFeatures::TxDb and ensembldb::EnsDb resources.

  • Organism and release-specific data (SQLite databases).

  • Positional information: genomic position of exons

#' extract exon coordinates from resource downloaded form AnnotationHub
exons(edb)
GRanges object with 1131826 ranges and 1 metadata column:
                  seqnames            ranges strand |         exon_id
                     <Rle>         <IRanges>  <Rle> |     <character>
  ENSE00004248723        1       11121-11211      + | ENSE00004248723
  ENSE00004248721        1       11125-11211      + | ENSE00004248721
              ...      ...               ...    ... .             ...
  ENSE00004286596        Y 57215558-57215594      - | ENSE00004286596
  ENSE00004286597        Y 57215558-57215634      - | ENSE00004286597
  -------
  seqinfo: 540 sequences (1 circular) from GRCh38 genome

ℹ️ such resources can also used to map e.g. between positions within a protein and the location of the encoding sequence in the genome.

3️⃣ Use cases

Annotating RNA-seq, raw reads

  • (Short) RNA reads
  • 👉 map to genome
  • 👉 count number of reads within exons of genes

Annotation RNA-seq, raw reads

library(airway)
dir(system.file("extdata", package = "airway"))
 [1] "GSE52778_series_matrix.txt"        "Homo_sapiens.GRCh37.75_subset.gtf"
 [3] "quants"                            "sample_table.csv"                 
 [5] "SraRunInfo_SRP033351.csv"          "SRR1039508_subset.bam"            
 [7] "SRR1039509_subset.bam"             "SRR1039512_subset.bam"            
 [9] "SRR1039513_subset.bam"             "SRR1039516_subset.bam"            
[11] "SRR1039517_subset.bam"             "SRR1039520_subset.bam"            
[13] "SRR1039521_subset.bam"            
  • task: quantifying reads per gene.
  • required: positional (genome) information for genes (exons)
  • 👉 get annotation resource for genome version / Ensembl release.
library(EnsDb.Hsapiens.v75)
edb <- EnsDb.Hsapiens.v75
  • Get positional information for exons
exns <- exonsBy(edb, by = "gene")
exns
GRangesList object of length 64102:
$ENSG00000000003
GRanges object with 17 ranges and 1 metadata column:
       seqnames            ranges strand |         exon_id
          <Rle>         <IRanges>  <Rle> |     <character>
   [1]        X 99894942-99894988      - | ENSE00001828996
   [2]        X 99891790-99892101      - | ENSE00001863395
   ...      ...               ...    ... .             ...
  [16]        X 99885756-99885863      - | ENSE00000868868
  [17]        X 99883667-99884983      - | ENSE00001459322
  -------
  seqinfo: 273 sequences (1 circular) from GRCh37 genome

...
<64101 more elements>
  • Get the BAM files with the alignment results for the present experiment.
#' Get the BAM file names in the *airway* package
f <- dir(system.file("extdata", package = "airway"),
         pattern = "bam$", full.names = TRUE)

library("Rsamtools")
bamfiles <- BamFileList(f, yieldSize = 2000000)
  • Use GenomicAlignments::summarizeOverlaps() to count reads falling within exon boundaries.
library(GenomicAlignments)
se <- summarizeOverlaps(features = exns, reads=bamfiles,
                        mode = "Union", singleEnd = FALSE,
                        ignore.strand = TRUE,
                        fragments = TRUE)
se
class: RangedSummarizedExperiment 
dim: 64102 8 
metadata(0):
assays(1): counts
rownames(64102): ENSG00000000003 ENSG00000000005 ... LRG_98 LRG_99
rowData names(0):
colnames(8): SRR1039508_subset.bam SRR1039509_subset.bam ...
  SRR1039520_subset.bam SRR1039521_subset.bam
colData names(0):
  • Result: gene count data.
  • ℹ️ afternoon labs will use an updated workflow using Salmon for alignment against the transcriptome and the tximeta Bioconductor package for managing data and annotation resources.

Annotating gene information

  • Common task in annotation: add additional metadata to existing identifiers.
  • example: bulk RNA-seq data airway:
se
class: RangedSummarizedExperiment 
dim: 64102 8 
metadata(0):
assays(1): counts
rownames(64102): ENSG00000000003 ENSG00000000005 ... LRG_98 LRG_99
rowData names(0):
colnames(8): SRR1039508_subset.bam SRR1039509_subset.bam ...
  SRR1039520_subset.bam SRR1039521_subset.bam
colData names(0):

ℹ️ gene quantification data, 8 samples.

  • What gene identifiers do we have?
#' extract gene identifiers
ids <- rownames(se)
head(ids)
[1] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000000457"
[5] "ENSG00000000460" "ENSG00000000938"
  • We’ve got Ensembl gene IDs.
  • 🎯 get official HGNC gene symbols (names) the genes.
  • 👉 use AnnotationDbi::mapIds() with org.Hs.eg.db.
library(org.Hs.eg.db)
#' available gene annotations:
columns(org.Hs.eg.db)
 [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
 [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
[11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MAP"         
[16] "OMIM"         "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"        
[21] "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
[26] "UNIPROT"     
#' map Ensembl gene IDs to official gene symbols
symbols <- mapIds(org.Hs.eg.db, keys = ids, keytype = "ENSEMBL",
                  column = "SYMBOL", multiVals = "first")
head(symbols)
ENSG00000000003 ENSG00000000005 ENSG00000000419 ENSG00000000457 ENSG00000000460 
       "TSPAN6"          "TNMD"          "DPM1"         "SCYL3"         "FIRRM" 
ENSG00000000938 
          "FGR" 

ℹ️ multiVals… to handle ambiguity; default is "first" but there are other options too.

Annotation for enrichment analysis

  • going one step further … 🎯 annotating genes to biological pathways.

  • 👉 define gene sets.
  • 👉 map identifiers to gene identifiers used in pathway annotation resource.

Annotation for enrichment analysis

  • Enrichment analysis for biological pathways from reactome.
  • example: use the reactome.db package to get the list of genes per pathway.
library(reactome.db)

#' Available data
columns(reactome.db)
[1] "ENTREZID"   "GO"         "PATHID"     "PATHNAME"   "REACTOMEID"

ℹ️ gene identifiers: resource uses NCBI EntrezGene IDs.

  • Define the gene sets.
mapping <- select(reactome.db, keys = keys(reactome.db),
                  columns = c("ENTREZID", "PATHID"))
head(mapping)
  ENTREZID        PATHID
1        1  R-HSA-109582
2        1  R-HSA-114608
3        1  R-HSA-168249
4        1  R-HSA-168256
5        1 R-HSA-6798695
6        1   R-HSA-76002
  • Convert into a list of genes per pathway.
gs <- split(mapping$ENTREZID, mapping$PATHID)
gs[1:2]
$`R-BTA-1059683`
[1] "280826" "282081" "507359" "512484" "527418" "533590"

$`R-BTA-109581`
 [1] "100101492" "100140945" "100296226" "280730"    "280955"    "281020"   
 [7] "281048"    "281169"    "282125"    "282126"    "282152"    "282321"   
[13] "282691"    "286862"    "286863"    "287022"    "327672"    "404144"   
[19] "404151"    "407101"    "407111"    "408016"    "493720"    "493999"   
[25] "504707"    "504727"    "506223"    "507481"    "507804"    "507981"   
[31] "508345"    "508646"    "510373"    "510767"    "512405"    "514090"   
[37] "516952"    "517850"    "528453"    "529902"    "531514"    "533233"   
[43] "533527"    "533949"    "538801"    "539003"    "539350"    "539679"   
[49] "539941"    "540108"    "540369"    "540444"    "540643"    "540892"   
[55] "541141"    "614840"    "616398"    "768311"    "785911"   
  • Map input genes from Ensembl to EntrezGene IDs.
entrez <- mapIds(edb, keys = rownames(se), keytype = "GENEID",
                 column = "ENTREZID", multiVals = "first")
  • 👉 use this definition of gene sets in enrichment functions, such as EnrichmentBrowser::sbea() or others.
library(EnrichmentBrowser)

rownames(se) <- entrez
#' ... assuming differential expression analysis was done too ...
res <- sbea(method = "ora", se, gs = gs)

ℹ️ EnrichmentBrowser makes the mapping easier with its idMap() and getGenesets() functions.

Annotating mass spectrometry data

  • 👀 Thursday’s Metabolomics lecture …

Annotating single-cell RNA-seq

🎯 cell type recognition from single-cell RNA-seq data.

SingleR package:

👉 reference gene expression data for known (pure) cell types.

👉 unbiased annotation by comparing single-cell gene expression against reference.

  • Load reference dataset using celldex: Human Primary Cell Atlas
library(celldex)
hpca_se <- HumanPrimaryCellAtlasData()
hpca_se
class: SummarizedExperiment 
dim: 19363 713 
metadata(0):
assays(1): logcounts
rownames(19363): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
rowData names(0):
colnames(713): GSM112490 GSM112491 ... GSM92233 GSM92234
colData names(3): label.main label.fine label.ont
library(scRNAseq)
hESCs <- LaMannoBrainData("human-es")
hESCs <- hESCs[, 1:100]
  • Annotate each cell based on (non-parametric) correlation of gene expression data between test and reference.
library(SingleR)
pred_ct <- SingleR(test = hESCs, ref = hpca_se, assay.type.test = 1,
                   labels = hpca_se$label.main)
pred_ct
DataFrame with 100 rows and 4 columns
                                        scores               labels delta.next
                                      <matrix>          <character>  <numeric>
1772122_301_C02 0.347652:0.139036:0.109547:... Neuroepithelial_cell  0.0833286
1772122_180_E05 0.361187:0.155395:0.134934:...              Neurons  0.0728350
...                                        ...                  ...        ...
1772122_298_F09 0.332361:0.173357:0.141439:... Neuroepithelial_cell  0.1200606
1772122_302_A11 0.324928:0.127518:0.101609:...            Astrocyte  0.0509478
                       pruned.labels
                         <character>
1772122_301_C02 Neuroepithelial_cell
1772122_180_E05              Neurons
...                              ...
1772122_298_F09 Neuroepithelial_cell
1772122_302_A11            Astrocyte

Thank you for your attention 🙌

References

La Manno, Gioele, Daniel Gyllborg, Simone Codeluppi, et al. 2016. “Molecular Diversity of Midbrain Development in Mouse, Human, and Stem Cells.” Cell 167 (2): 566–580.e19. https://doi.org/10.1016/j.cell.2016.09.027.
Love, Michael I., Simon Anders, Vladislav Kim, and Wolfgang Huber. 2015. RNA-Seq Workflow: Gene-Level Exploratory Analysis and Differential Expression.” 4: 1070. https://doi.org/10.12688/f1000research.7035.1.