Annotation and Annotation Resources

^ with content from Lori Shepherd Kern, Martin Morgan and James W. MacDonald CSAMA 2026

Johannes Rainer

Institute for Biomedicine, Eurac Research

Introduction

Data analysis

Annotation: external data fed into an analysis to make sense of quantified entities.

Annotation

put a (new or different) label on an entity
entity: something measured with an assay (RNA, molecule, cell)
label: commonly used (standard) identifier (e.g. gene name)
sounds trivial, but is very important task in computational biology
examples:
- map positions on the genome to transcripts
- map genes to biological pathways
- annotate cell types based on presence of marker genes
🎯 enable or assist in interpretation; make biological sense of assay data.

In this presentation

1️⃣ provide some general information on annotation
2️⃣ give you an overview on available annotation resources
3️⃣ short use cases (with code)

1️⃣ General information on annotation

What do we need for annotation?

1️⃣ an annotation resource: reference information on entities or a mapping of one type of identifiers to another.
2️⃣ a rule that defines how to assign labels to entities
- annotation through similarity: sequence similarity, spectral similarity
- direct mapping

What do we need for annotation?

1️⃣ an annotation resource: reference information on entities or a mapping of one type of identifiers to another.
2️⃣ a rule that defines how to assign labels to entities
examples
- map between gene identifiers (Ensembl, NCBI, …)

What do we need for annotation?

1️⃣ an annotation resource: reference information on entities or a mapping of one type of identifiers to another.
2️⃣ a rule that defines how to assign labels to entities
examples
- annotate through sequence similarity: map short reads to genome

What do we need for annotation?

1️⃣ an annotation resource: reference information on entities or a mapping of one type of identifiers to another.
2️⃣ a rule that defines how to assign labels to entities
examples
- cell types through presence of marker genes

Reproducibility

In order to guarantee reproducibility of annotation:

✅ annotation resource has to be versioned and findable

🙅 dynamic annotation resources (daily updates) without the possibility to access a specific version

✅ the same version should be used throughout the whole analysis

🙅 using a different genome release for RNA-seq alignment and gene counting.

Standardization and metadata

Common nomenclature is required
It’s great if you can assign a name to an entity - but also other researchers should understand what you mean.

example: synonyms

Synonyms: aliases or previous names of entities.

Can be tricky if different names are used across publications.
👉 use the official HGNC gene name!

Standardization and metadata

🙀 … it can always be worse …

Why? annotation resource was compiled from community provided information.

Standardization and common nomenclature is important!
Standardization is important for human consumers, but even more so for computational use (including AI).

2️⃣ Annotation resources

Annotation resources

Where can you get annotations from?

Annotation resources

For genes

NCBI, Ensembl, …

Annotation resources

For proteins

UniProt, Ensembl, …

Annotation resources

For small molecules

HMDB, MassBank, …

Annotation resources

Pathways

reactome, wikipathways, KEGG

Annotation resources

Pathways

reactome, wikipathways, KEGG

Annotation resources

Pathways

reactome, wikipathways, KEGG

😒 some annotation resources use their own data/file format making it cumbersome to include them in data analysis workflows.

Bioconductor annotation resources

Format standardized to simplify integration into the analysis (big thanks to the maintainers! 🙌)
AnnotationDbi package defines a common interface to extract information from annotation resources: mapIds() and select() functions.

Bioconductor annotation resources

- 💍 one to get them all: AnnotationHub

Central registry for annotation resources.

library(AnnotationHub)

#' Synchronize and cache AnnotationHub information locally
ah <- AnnotationHub()
#' List all available resources
ah

AnnotationHub with 69129 records
# snapshotDate(): 2026-04-23
# $dataprovider: Ensembl, BroadInstitute, UCSC, Haemcode, FANTOM5,DLRP,IUPHA...
# $species: Homo sapiens, Mus musculus, Drosophila melanogaster, Rattus norv...
# $rdataclass: GRanges, TwoBitFile, BigWigFile, EnsDb, Rle, ChainFile, SQLit...
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH5012"]]' 

             title                          
  AH5012   | Chromosome Band                
  AH5013   | STS Markers                    
  ...        ...                            
  AH122185 | Data.table for MeSH (Qualifier)
  AH122186 | Data.table for MeSH (SCR)

Resources come in variety of different formats and data types.

#' List data class
table(ah$rdataclass)


                      AAStringSet                        BigWigFile 
                                1                             10247 
                           biopax                         ChainFile 
                                9                              1115 
                        character                            CompDb 
                               10                                 8 
                       data.frame data.frame, DNAStringSet, GRanges 
                               57                                 3 
                       data.table                             EnsDb 
                               25                              5272 
                           FaFile                           GRanges 
                                3                             30545 
                           igraph                     Inparanoid8Db 
                                2                               268 
                           JASPAR                              list 
                                3                                71 
                           MSnSet                          mzRident 
                                1                                 1 
                            OrgDb                               Rda 
                               18                                45 
                              Rle                            sqlite 
                             2365                                 1 
                           SQLite                  SQLiteConnection 
                                5                                 3 
                       SQLiteFile                            String 
                              631                                16 
                           Tibble                        TwoBitFile 
                               69                             17825 
                             TxDb                           VcfFile 
                              502                                 8

Resource can be a reference to an external file

query(ah, c("ensembl", "TwoBitFile"))

AnnotationHub with 17681 records
# snapshotDate(): 2026-04-23
# $dataprovider: Ensembl
# $species: Mus musculus, Homo sapiens, Danio rerio, Xiphophorus maculatus, ...
# $rdataclass: TwoBitFile
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH49592"]]' 

             title                                                        
  AH49592  | Ailuropoda_melanoleuca.ailMel1.cdna.all.2bit                 
  AH49593  | Ailuropoda_melanoleuca.ailMel1.dna_rm.toplevel.2bit          
  ...        ...                                                          
  AH107046 | Zosterops_lateralis_melanops.ASM128173v1.dna_sm.toplevel.2bit
  AH107047 | Zosterops_lateralis_melanops.ASM128173v1.ncrna.2bit

ah["AH49592"]

AnnotationHub with 1 record
# snapshotDate(): 2026-04-23
# names(): AH49592
# $dataprovider: Ensembl
# $species: Ailuropoda melanoleuca
# $rdataclass: TwoBitFile
# $rdatadateadded: 2015-12-28
# $title: Ailuropoda_melanoleuca.ailMel1.cdna.all.2bit
# $description: TwoBit cDNA sequence for Ailuropoda melanoleuca
# $taxonomyid: 9646
# $genome: ailMel1
# $sourcetype: FASTA
# $sourceurl: ftp://ftp.ensembl.org/pub/release-82/fasta/ailuropoda_melanole...
# $sourcesize: 10988959
# $tags: c("TwoBit", "ensembl", "sequence", "2bit", "FASTA") 
# retrieve record with 'object[["AH49592"]]'

Or a dedicated R object.
We retrieve one such object from AnnotationHub.

query(ah, c("Hsapiens", "EnsDb"))

AnnotationHub with 28 records
# snapshotDate(): 2026-04-23
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: EnsDb
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH53211"]]' 

             title                             
  AH53211  | Ensembl 87 EnsDb for Homo Sapiens 
  AH53715  | Ensembl 88 EnsDb for Homo Sapiens 
  ...        ...                               
  AH116860 | Ensembl 112 EnsDb for Homo sapiens
  AH119325 | Ensembl 113 EnsDb for Homo sapiens

edb <- ah[["AH119325"]]
edb

EnsDb for Ensembl:
|Backend: SQLite
|Db type: EnsDb
|Type of Gene ID: Ensembl Gene ID
|Supporting package: ensembldb
|Db created by: ensembldb package from Bioconductor
|script_version: 0.3.10
|Creation time: Sat Oct 26 21:34:14 2024
|ensembl_version: 113
|ensembl_host: 127.0.0.1
|Organism: Homo sapiens
|taxonomy_id: 9606
|genome_build: GRCh38
|DBSCHEMAVERSION: 2.2
|common_name: human
|species: homo_sapiens
| No. of genes: 87726.
| No. of transcripts: 413674.
|Protein data available.

ℹ️ data is downloaded and cached. Subsequent calls will load the data from the cache.

Bioconductor annotation resources

… and there are also dedicated R packages with specific annotations.

example: identifiers and metadata for human genes.

library(org.Hs.eg.db)
org.Hs.eg.db

OrgDb object:
| DBSCHEMAVERSION: 2.1
| Db type: OrgDb
| Supporting package: AnnotationDbi
| DBSCHEMA: HUMAN_DB
| ORGANISM: Homo sapiens
| SPECIES: Human
| EGSOURCEDATE: 2026-Mar18
| EGSOURCENAME: Entrez Gene
| EGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
| CENTRALID: EG
| TAXID: 9606
| GOSOURCENAME: Gene Ontology
| GOSOURCEURL: https://current.geneontology.org/ontology/go-basic.obo
| GOSOURCEDATE: 2026-01-23
| GOEGSOURCEDATE: 2026-Mar18
| GOEGSOURCENAME: Entrez Gene
| GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
| KEGGSOURCENAME: KEGG GENOME
| KEGGSOURCEURL: ftp://ftp.genome.jp/pub/kegg/genomes
| KEGGSOURCEDATE: 2011-Mar15
| GPSOURCENAME: UCSC Genome Bioinformatics (Homo sapiens)
| GPSOURCEURL: ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/database
| GPSOURCEDATE: UTC-Mar19
| ENSOURCEDATE: 2025-Sep03
| ENSOURCENAME: Ensembl
| ENSOURCEURL: ftp://ftp.ensembl.org/pub/current_fasta
| UPSOURCENAME: Uniprot
| UPSOURCEURL: http://www.UniProt.org/
| UPSOURCEDATE: Fri Mar 20 10:22:53 2026

Most annotation resources provide mapping between gene identifiers or mapping to e.g. pathways.
Also available: position-relative annotations.

Position-relative annotations

Mapping between entities (genes, exons) and positions (on the genome, within a transcript).
GenomicFeatures::TxDb resources.

Position-relative annotations

Mapping between entities (genes, exons, protein domains) and positions (on the genome, within a transcript, within a protein).
GenomicFeatures::TxDb and ensembldb::EnsDb resources.

Position-relative annotations

Mapping between entities (genes, exons, protein domains) and positions (on the genome, within a transcript, within a protein).
GenomicFeatures::TxDb and ensembldb::EnsDb resources.
Organism and release-specific data (SQLite databases).
Positional information: genomic position of exons

#' extract exon coordinates from resource downloaded form AnnotationHub
exons(edb)

GRanges object with 1131826 ranges and 1 metadata column:
                  seqnames            ranges strand |         exon_id
                     <Rle>         <IRanges>  <Rle> |     <character>
  ENSE00004248723        1       11121-11211      + | ENSE00004248723
  ENSE00004248721        1       11125-11211      + | ENSE00004248721
              ...      ...               ...    ... .             ...
  ENSE00004286596        Y 57215558-57215594      - | ENSE00004286596
  ENSE00004286597        Y 57215558-57215634      - | ENSE00004286597
  -------
  seqinfo: 540 sequences (1 circular) from GRCh38 genome

ℹ️ such resources can also used to map e.g. between positions within a protein and the location of the encoding sequence in the genome.

3️⃣ Use cases

Annotating RNA-seq, raw reads

(Short) RNA reads
👉 map to genome
👉 count number of reads within exons of genes

Annotation RNA-seq, raw reads

example (Love et al. 2015):
- bulk RNA-seq data
- alignment against genome using STAR

library(airway)
dir(system.file("extdata", package = "airway"))

 [1] "GSE52778_series_matrix.txt"        "Homo_sapiens.GRCh37.75_subset.gtf"
 [3] "quants"                            "sample_table.csv"                 
 [5] "SraRunInfo_SRP033351.csv"          "SRR1039508_subset.bam"            
 [7] "SRR1039509_subset.bam"             "SRR1039512_subset.bam"            
 [9] "SRR1039513_subset.bam"             "SRR1039516_subset.bam"            
[11] "SRR1039517_subset.bam"             "SRR1039520_subset.bam"            
[13] "SRR1039521_subset.bam"

task: quantifying reads per gene.
required: positional (genome) information for genes (exons)
👉 get annotation resource for genome version / Ensembl release.

library(EnsDb.Hsapiens.v75)
edb <- EnsDb.Hsapiens.v75

Get positional information for exons

exns <- exonsBy(edb, by = "gene")
exns

GRangesList object of length 64102:
$ENSG00000000003
GRanges object with 17 ranges and 1 metadata column:
       seqnames            ranges strand |         exon_id
          <Rle>         <IRanges>  <Rle> |     <character>
   [1]        X 99894942-99894988      - | ENSE00001828996
   [2]        X 99891790-99892101      - | ENSE00001863395
   ...      ...               ...    ... .             ...
  [16]        X 99885756-99885863      - | ENSE00000868868
  [17]        X 99883667-99884983      - | ENSE00001459322
  -------
  seqinfo: 273 sequences (1 circular) from GRCh37 genome

...
<64101 more elements>

Get the BAM files with the alignment results for the present experiment.

#' Get the BAM file names in the *airway* package
f <- dir(system.file("extdata", package = "airway"),
         pattern = "bam$", full.names = TRUE)

library("Rsamtools")
bamfiles <- BamFileList(f, yieldSize = 2000000)

Use GenomicAlignments::summarizeOverlaps() to count reads falling within exon boundaries.

library(GenomicAlignments)
se <- summarizeOverlaps(features = exns, reads=bamfiles,
                        mode = "Union", singleEnd = FALSE,
                        ignore.strand = TRUE,
                        fragments = TRUE)
se

class: RangedSummarizedExperiment 
dim: 64102 8 
metadata(0):
assays(1): counts
rownames(64102): ENSG00000000003 ENSG00000000005 ... LRG_98 LRG_99
rowData names(0):
colnames(8): SRR1039508_subset.bam SRR1039509_subset.bam ...
  SRR1039520_subset.bam SRR1039521_subset.bam
colData names(0):

Result: gene count data.
ℹ️ afternoon labs will use an updated workflow using Salmon for alignment against the transcriptome and the tximeta Bioconductor package for managing data and annotation resources.

Annotating gene information

Common task in annotation: add additional metadata to existing identifiers.

example: bulk RNA-seq data airway:

se

class: RangedSummarizedExperiment 
dim: 64102 8 
metadata(0):
assays(1): counts
rownames(64102): ENSG00000000003 ENSG00000000005 ... LRG_98 LRG_99
rowData names(0):
colnames(8): SRR1039508_subset.bam SRR1039509_subset.bam ...
  SRR1039520_subset.bam SRR1039521_subset.bam
colData names(0):

ℹ️ gene quantification data, 8 samples.

What gene identifiers do we have?

#' extract gene identifiers
ids <- rownames(se)
head(ids)

[1] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000000457"
[5] "ENSG00000000460" "ENSG00000000938"

We’ve got Ensembl gene IDs.
🎯 get official HGNC gene symbols (names) the genes.

👉 use AnnotationDbi::mapIds() with org.Hs.eg.db.

library(org.Hs.eg.db)
#' available gene annotations:
columns(org.Hs.eg.db)

 [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
 [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
[11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MAP"         
[16] "OMIM"         "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"        
[21] "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
[26] "UNIPROT"

#' map Ensembl gene IDs to official gene symbols
symbols <- mapIds(org.Hs.eg.db, keys = ids, keytype = "ENSEMBL",
                  column = "SYMBOL", multiVals = "first")
head(symbols)

ENSG00000000003 ENSG00000000005 ENSG00000000419 ENSG00000000457 ENSG00000000460 
       "TSPAN6"          "TNMD"          "DPM1"         "SCYL3"         "FIRRM" 
ENSG00000000938 
          "FGR"

ℹ️ multiVals… to handle ambiguity; default is "first" but there are other options too.

Annotation for enrichment analysis

going one step further … 🎯 annotating genes to biological pathways.

👉 define gene sets.
👉 map identifiers to gene identifiers used in pathway annotation resource.

Annotation for enrichment analysis

Enrichment analysis for biological pathways from reactome.
example: use the reactome.db package to get the list of genes per pathway.

library(reactome.db)

#' Available data
columns(reactome.db)

[1] "ENTREZID"   "GO"         "PATHID"     "PATHNAME"   "REACTOMEID"

ℹ️ gene identifiers: resource uses NCBI EntrezGene IDs.

Define the gene sets.

mapping <- select(reactome.db, keys = keys(reactome.db),
                  columns = c("ENTREZID", "PATHID"))
head(mapping)

  ENTREZID        PATHID
1        1  R-HSA-109582
2        1  R-HSA-114608
3        1  R-HSA-168249
4        1  R-HSA-168256
5        1 R-HSA-6798695
6        1   R-HSA-76002

Convert into a list of genes per pathway.

gs <- split(mapping$ENTREZID, mapping$PATHID)
gs[1:2]

$`R-BTA-1059683`
[1] "280826" "282081" "507359" "512484" "527418" "533590"

$`R-BTA-109581`
 [1] "100101492" "100140945" "100296226" "280730"    "280955"    "281020"   
 [7] "281048"    "281169"    "282125"    "282126"    "282152"    "282321"   
[13] "282691"    "286862"    "286863"    "287022"    "327672"    "404144"   
[19] "404151"    "407101"    "407111"    "408016"    "493720"    "493999"   
[25] "504707"    "504727"    "506223"    "507481"    "507804"    "507981"   
[31] "508345"    "508646"    "510373"    "510767"    "512405"    "514090"   
[37] "516952"    "517850"    "528453"    "529902"    "531514"    "533233"   
[43] "533527"    "533949"    "538801"    "539003"    "539350"    "539679"   
[49] "539941"    "540108"    "540369"    "540444"    "540643"    "540892"   
[55] "541141"    "614840"    "616398"    "768311"    "785911"

Map input genes from Ensembl to EntrezGene IDs.

entrez <- mapIds(edb, keys = rownames(se), keytype = "GENEID",
                 column = "ENTREZID", multiVals = "first")

👉 use this definition of gene sets in enrichment functions, such as EnrichmentBrowser::sbea() or others.

library(EnrichmentBrowser)

rownames(se) <- entrez
#' ... assuming differential expression analysis was done too ...
res <- sbea(method = "ora", se, gs = gs)

ℹ️ EnrichmentBrowser makes the mapping easier with its idMap() and getGenesets() functions.

Annotating mass spectrometry data

… 👀 Thursday’s Metabolomics lecture …

Annotating single-cell RNA-seq

🎯 cell type recognition from single-cell RNA-seq data.

SingleR package:

👉 reference gene expression data for known (pure) cell types.

👉 unbiased annotation by comparing single-cell gene expression against reference.

Load reference dataset using celldex: Human Primary Cell Atlas

library(celldex)
hpca_se <- HumanPrimaryCellAtlasData()
hpca_se

class: SummarizedExperiment 
dim: 19363 713 
metadata(0):
assays(1): logcounts
rownames(19363): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
rowData names(0):
colnames(713): GSM112490 GSM112491 ... GSM92233 GSM92234
colData names(3): label.main label.fine label.ont

Test data set: (La Manno et al. 2016), subset to first 100 cells

library(scRNAseq)
hESCs <- LaMannoBrainData("human-es")
hESCs <- hESCs[, 1:100]

Annotate each cell based on (non-parametric) correlation of gene expression data between test and reference.

library(SingleR)
pred_ct <- SingleR(test = hESCs, ref = hpca_se, assay.type.test = 1,
                   labels = hpca_se$label.main)
pred_ct

DataFrame with 100 rows and 4 columns
                                        scores               labels delta.next
                                      <matrix>          <character>  <numeric>
1772122_301_C02 0.347652:0.139036:0.109547:... Neuroepithelial_cell  0.0833286
1772122_180_E05 0.361187:0.155395:0.134934:...              Neurons  0.0728350
...                                        ...                  ...        ...
1772122_298_F09 0.332361:0.173357:0.141439:... Neuroepithelial_cell  0.1200606
1772122_302_A11 0.324928:0.127518:0.101609:...            Astrocyte  0.0509478
                       pruned.labels
                         <character>
1772122_301_C02 Neuroepithelial_cell
1772122_180_E05              Neurons
...                              ...
1772122_298_F09 Neuroepithelial_cell
1772122_302_A11            Astrocyte

Thank you for your attention 🙌

References

La Manno, Gioele, Daniel Gyllborg, Simone Codeluppi, et al. 2016. “Molecular Diversity of Midbrain Development in Mouse, Human, and Stem Cells.” Cell 167 (2): 566–580.e19. https://doi.org/10.1016/j.cell.2016.09.027.

Love, Michael I., Simon Anders, Vladislav Kim, and Wolfgang Huber. 2015. “RNA-Seq Workflow: Gene-Level Exploratory Analysis and Differential Expression.” … 4: 1070. https://doi.org/10.12688/f1000research.7035.1.