Version 1 (modified by jamverlouw, 9 months ago) (diff)


Reference and annotation

File naming policy

Inlcude md5sum files by running '$md5sum [filename] > [filename].md5sum'.

Annotation source: Ensembl v.71

We use Ensembl as our primary source of annotation and fix it at v.71 (link to archived v.71 BioMart). To get this version of Ensemble in the R package biomaRt:
ensembl = useMart(biomart = "ENSEMBL_MART_ENSEMBL", host = "", path = "/biomart/martservice" , dataset = "hsapiens_gene_ensembl") G=getBM(c("chromosome_name", "start_position","end_position","ensembl_gene_id"), mart=ensembl)

This table describes the reference and annotation files.

Transcript GTFsrm://
Meta Exon GTFsrm://
Masked genomesrm://
STAR indexsrm://

Reference and annotation files description.

1. Transcript annotation.

To create this transcript annotation the human gtf annotation was downloaded from Ensembl v.71 (containing Gencode v.16: Only genes on chromosomes for 1-22, X, Y, MT were retained. Then for each chromosome genes were sorted by their start position.

2. Meta-exon annotation.

To create the meta-exon annotation we merged all overlapping exons from Ensembl version 71 (see Transcript annotation section) using mergeBed tool from BEDTools suite. Overlapping exons belonging to different genes or different strands were also merged into one meta-exon. See Meta-exon annotation documentation for a detailed description on how the meta-exon annotation has been created.

See Meta-exon annotation 05-06-13 for issues with this file.

3. Masked genome.

To mask the genome we took all SNPs called in GoNL project that had a MAF > 1% and replaced them with “N” in genome fasta files using maskFastaFromBed tool from BEDTools suite.

4. STAR genome index.

To make the masked genome index we run STAR in genomeGenerate mode on the masked genome fasta files, setting the --sjdbOverhang parameter to 100 and using the transcript annotation from Ensembl v.71.

GTF to BED conversion

Paste magic to get from the GTF to required BED format with columns in correct order for downstream quantification.

paste <(cut -f1,4,5 meta-exons_v71_cut_sorted_05-06-13.gtf)  <(cut -f9 meta-exons_v71_cut_sorted_05-06-13.gtf | cut -d';' -f2) | paste - <(cut -f9 meta-exons_v71_cut_sorted_05-06-13.gtf ) > meta-exons_v71_cut_sorted_05-06-13.bed