^ with content from Lori Shepherd Kern, Martin Morgan and James W. MacDonald CSAMA 2026
1️⃣ General information on annotation
In order to guarantee reproducibility of annotation:
✅ annotation resource has to be versioned and findable
🙅 dynamic annotation resources (daily updates) without the possibility to access a specific version
✅ the same version should be used throughout the whole analysis
🙅 using a different genome release for RNA-seq alignment and gene counting.
Why? annotation resource was compiled from community provided information.
2️⃣ Annotation resources
For genes
For proteins
For small molecules
Pathways
Pathways
Pathways
mapIds() and select() functions. - 💍 one to get them all: AnnotationHub
AnnotationHub with 69129 records
# snapshotDate(): 2026-04-23
# $dataprovider: Ensembl, BroadInstitute, UCSC, Haemcode, FANTOM5,DLRP,IUPHA...
# $species: Homo sapiens, Mus musculus, Drosophila melanogaster, Rattus norv...
# $rdataclass: GRanges, TwoBitFile, BigWigFile, EnsDb, Rle, ChainFile, SQLit...
# additional mcols(): taxonomyid, genome, description,
# coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
# rdatapath, sourceurl, sourcetype
# retrieve records with, e.g., 'object[["AH5012"]]'
title
AH5012 | Chromosome Band
AH5013 | STS Markers
... ...
AH122185 | Data.table for MeSH (Qualifier)
AH122186 | Data.table for MeSH (SCR)
AAStringSet BigWigFile
1 10247
biopax ChainFile
9 1115
character CompDb
10 8
data.frame data.frame, DNAStringSet, GRanges
57 3
data.table EnsDb
25 5272
FaFile GRanges
3 30545
igraph Inparanoid8Db
2 268
JASPAR list
3 71
MSnSet mzRident
1 1
OrgDb Rda
18 45
Rle sqlite
2365 1
SQLite SQLiteConnection
5 3
SQLiteFile String
631 16
Tibble TwoBitFile
69 17825
TxDb VcfFile
502 8
AnnotationHub with 17681 records
# snapshotDate(): 2026-04-23
# $dataprovider: Ensembl
# $species: Mus musculus, Homo sapiens, Danio rerio, Xiphophorus maculatus, ...
# $rdataclass: TwoBitFile
# additional mcols(): taxonomyid, genome, description,
# coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
# rdatapath, sourceurl, sourcetype
# retrieve records with, e.g., 'object[["AH49592"]]'
title
AH49592 | Ailuropoda_melanoleuca.ailMel1.cdna.all.2bit
AH49593 | Ailuropoda_melanoleuca.ailMel1.dna_rm.toplevel.2bit
... ...
AH107046 | Zosterops_lateralis_melanops.ASM128173v1.dna_sm.toplevel.2bit
AH107047 | Zosterops_lateralis_melanops.ASM128173v1.ncrna.2bit
AnnotationHub with 1 record
# snapshotDate(): 2026-04-23
# names(): AH49592
# $dataprovider: Ensembl
# $species: Ailuropoda melanoleuca
# $rdataclass: TwoBitFile
# $rdatadateadded: 2015-12-28
# $title: Ailuropoda_melanoleuca.ailMel1.cdna.all.2bit
# $description: TwoBit cDNA sequence for Ailuropoda melanoleuca
# $taxonomyid: 9646
# $genome: ailMel1
# $sourcetype: FASTA
# $sourceurl: ftp://ftp.ensembl.org/pub/release-82/fasta/ailuropoda_melanole...
# $sourcesize: 10988959
# $tags: c("TwoBit", "ensembl", "sequence", "2bit", "FASTA")
# retrieve record with 'object[["AH49592"]]'
AnnotationHub with 28 records
# snapshotDate(): 2026-04-23
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: EnsDb
# additional mcols(): taxonomyid, genome, description,
# coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
# rdatapath, sourceurl, sourcetype
# retrieve records with, e.g., 'object[["AH53211"]]'
title
AH53211 | Ensembl 87 EnsDb for Homo Sapiens
AH53715 | Ensembl 88 EnsDb for Homo Sapiens
... ...
AH116860 | Ensembl 112 EnsDb for Homo sapiens
AH119325 | Ensembl 113 EnsDb for Homo sapiens
EnsDb for Ensembl:
|Backend: SQLite
|Db type: EnsDb
|Type of Gene ID: Ensembl Gene ID
|Supporting package: ensembldb
|Db created by: ensembldb package from Bioconductor
|script_version: 0.3.10
|Creation time: Sat Oct 26 21:34:14 2024
|ensembl_version: 113
|ensembl_host: 127.0.0.1
|Organism: Homo sapiens
|taxonomy_id: 9606
|genome_build: GRCh38
|DBSCHEMAVERSION: 2.2
|common_name: human
|species: homo_sapiens
| No. of genes: 87726.
| No. of transcripts: 413674.
|Protein data available.
OrgDb object:
| DBSCHEMAVERSION: 2.1
| Db type: OrgDb
| Supporting package: AnnotationDbi
| DBSCHEMA: HUMAN_DB
| ORGANISM: Homo sapiens
| SPECIES: Human
| EGSOURCEDATE: 2026-Mar18
| EGSOURCENAME: Entrez Gene
| EGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
| CENTRALID: EG
| TAXID: 9606
| GOSOURCENAME: Gene Ontology
| GOSOURCEURL: https://current.geneontology.org/ontology/go-basic.obo
| GOSOURCEDATE: 2026-01-23
| GOEGSOURCEDATE: 2026-Mar18
| GOEGSOURCENAME: Entrez Gene
| GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
| KEGGSOURCENAME: KEGG GENOME
| KEGGSOURCEURL: ftp://ftp.genome.jp/pub/kegg/genomes
| KEGGSOURCEDATE: 2011-Mar15
| GPSOURCENAME: UCSC Genome Bioinformatics (Homo sapiens)
| GPSOURCEURL: ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/database
| GPSOURCEDATE: UTC-Mar19
| ENSOURCEDATE: 2025-Sep03
| ENSOURCENAME: Ensembl
| ENSOURCEURL: ftp://ftp.ensembl.org/pub/current_fasta
| UPSOURCENAME: Uniprot
| UPSOURCEURL: http://www.UniProt.org/
| UPSOURCEDATE: Fri Mar 20 10:22:53 2026
GenomicFeatures::TxDb resources.GenomicFeatures::TxDb and ensembldb::EnsDb resources.Mapping between entities (genes, exons, protein domains) and positions (on the genome, within a transcript, within a protein).
GenomicFeatures::TxDb and ensembldb::EnsDb resources.
Organism and release-specific data (SQLite databases).
Positional information: genomic position of exons
GRanges object with 1131826 ranges and 1 metadata column:
seqnames ranges strand | exon_id
<Rle> <IRanges> <Rle> | <character>
ENSE00004248723 1 11121-11211 + | ENSE00004248723
ENSE00004248721 1 11125-11211 + | ENSE00004248721
... ... ... ... . ...
ENSE00004286596 Y 57215558-57215594 - | ENSE00004286596
ENSE00004286597 Y 57215558-57215634 - | ENSE00004286597
-------
seqinfo: 540 sequences (1 circular) from GRCh38 genome
ℹ️ such resources can also used to map e.g. between positions within a protein and the location of the encoding sequence in the genome.
3️⃣ Use cases
[1] "GSE52778_series_matrix.txt" "Homo_sapiens.GRCh37.75_subset.gtf"
[3] "quants" "sample_table.csv"
[5] "SraRunInfo_SRP033351.csv" "SRR1039508_subset.bam"
[7] "SRR1039509_subset.bam" "SRR1039512_subset.bam"
[9] "SRR1039513_subset.bam" "SRR1039516_subset.bam"
[11] "SRR1039517_subset.bam" "SRR1039520_subset.bam"
[13] "SRR1039521_subset.bam"
GRangesList object of length 64102:
$ENSG00000000003
GRanges object with 17 ranges and 1 metadata column:
seqnames ranges strand | exon_id
<Rle> <IRanges> <Rle> | <character>
[1] X 99894942-99894988 - | ENSE00001828996
[2] X 99891790-99892101 - | ENSE00001863395
... ... ... ... . ...
[16] X 99885756-99885863 - | ENSE00000868868
[17] X 99883667-99884983 - | ENSE00001459322
-------
seqinfo: 273 sequences (1 circular) from GRCh37 genome
...
<64101 more elements>
GenomicAlignments::summarizeOverlaps() to count reads falling within exon boundaries.class: RangedSummarizedExperiment
dim: 64102 8
metadata(0):
assays(1): counts
rownames(64102): ENSG00000000003 ENSG00000000005 ... LRG_98 LRG_99
rowData names(0):
colnames(8): SRR1039508_subset.bam SRR1039509_subset.bam ...
SRR1039520_subset.bam SRR1039521_subset.bam
colData names(0):
class: RangedSummarizedExperiment
dim: 64102 8
metadata(0):
assays(1): counts
rownames(64102): ENSG00000000003 ENSG00000000005 ... LRG_98 LRG_99
rowData names(0):
colnames(8): SRR1039508_subset.bam SRR1039509_subset.bam ...
SRR1039520_subset.bam SRR1039521_subset.bam
colData names(0):
ℹ️ gene quantification data, 8 samples.
[1] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000000457"
[5] "ENSG00000000460" "ENSG00000000938"
AnnotationDbi::mapIds() with org.Hs.eg.db. [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
[6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
[11] "GENETYPE" "GO" "GOALL" "IPI" "MAP"
[16] "OMIM" "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM"
[21] "PMID" "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG"
[26] "UNIPROT"
ENSG00000000003 ENSG00000000005 ENSG00000000419 ENSG00000000457 ENSG00000000460
"TSPAN6" "TNMD" "DPM1" "SCYL3" "FIRRM"
ENSG00000000938
"FGR"
ℹ️ multiVals… to handle ambiguity; default is "first" but there are other options too.
[1] "ENTREZID" "GO" "PATHID" "PATHNAME" "REACTOMEID"
ℹ️ gene identifiers: resource uses NCBI EntrezGene IDs.
ENTREZID PATHID
1 1 R-HSA-109582
2 1 R-HSA-114608
3 1 R-HSA-168249
4 1 R-HSA-168256
5 1 R-HSA-6798695
6 1 R-HSA-76002
list of genes per pathway.$`R-BTA-1059683`
[1] "280826" "282081" "507359" "512484" "527418" "533590"
$`R-BTA-109581`
[1] "100101492" "100140945" "100296226" "280730" "280955" "281020"
[7] "281048" "281169" "282125" "282126" "282152" "282321"
[13] "282691" "286862" "286863" "287022" "327672" "404144"
[19] "404151" "407101" "407111" "408016" "493720" "493999"
[25] "504707" "504727" "506223" "507481" "507804" "507981"
[31] "508345" "508646" "510373" "510767" "512405" "514090"
[37] "516952" "517850" "528453" "529902" "531514" "533233"
[43] "533527" "533949" "538801" "539003" "539350" "539679"
[49] "539941" "540108" "540369" "540444" "540643" "540892"
[55] "541141" "614840" "616398" "768311" "785911"
EnrichmentBrowser::sbea() or others.ℹ️ EnrichmentBrowser makes the mapping easier with its idMap() and getGenesets() functions.
🎯 cell type recognition from single-cell RNA-seq data.
SingleR package:
👉 reference gene expression data for known (pure) cell types.
👉 unbiased annotation by comparing single-cell gene expression against reference.
class: SummarizedExperiment
dim: 19363 713
metadata(0):
assays(1): logcounts
rownames(19363): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
rowData names(0):
colnames(713): GSM112490 GSM112491 ... GSM92233 GSM92234
colData names(3): label.main label.fine label.ont
DataFrame with 100 rows and 4 columns
scores labels delta.next
<matrix> <character> <numeric>
1772122_301_C02 0.347652:0.139036:0.109547:... Neuroepithelial_cell 0.0833286
1772122_180_E05 0.361187:0.155395:0.134934:... Neurons 0.0728350
... ... ... ...
1772122_298_F09 0.332361:0.173357:0.141439:... Neuroepithelial_cell 0.1200606
1772122_302_A11 0.324928:0.127518:0.101609:... Astrocyte 0.0509478
pruned.labels
<character>
1772122_301_C02 Neuroepithelial_cell
1772122_180_E05 Neurons
... ...
1772122_298_F09 Neuroepithelial_cell
1772122_302_A11 Astrocyte
Thank you for your attention 🙌