Summary Statistics Plot

Author: Peter-Bram 't Hoen
Date: 12-November-2013
Based on this file density plots of important qc statistics seperated by biobank were created: summary stats for different biobanks
The first 10 principle components from Lude based on the gene counts (PCA Lude) were correlated to these summary statistics. Pearson correlations can be found here. Plots of pca loadings against summary statistics can be found here.

Decision on samples to be repeated from run 1

On November 13, the BIOS management team decided to repeat samples that had less than 30M past filter reads / mappable reads (15M paired end reads). There were in total 2240 samples that were run once, 30 samples were run twice, and also the merged files for these 30 samples were in the output database with 2330 run_ids. The number of unique samples is 2270. Of these 2154 passed the threshold before or after merging. 116 samples did not pass the threshold. The sample list of samples not passing the threshold can be found here.

PCA on samples that are included after removal of clear sample mix-ups and all Amsterdam samples

Author: Lude Franke
Date: December 10
PCA conducted on 2188 samples (from the original 2270 samples removed sample mix-ups (see:, removed Amsterdam samples, and removed the three samples with very low number of reads BD1NW4ACXX-3-27, AC1JV9ACXX-1-10,AD1NE2ACXX-5-22
Principal components can be found in attachment here.

Samples that are outliers (correlation with first principal component < 0.93):
Sample Comp1
AD2DATACXX-6-6 -0.893288
BD1NYRACXX-6-10 -0.894323
AD1NNNACXX-8-7_BD2D5MACXX-6-7 -0.896686
AD2CJPACXX-8-9 -0.912417
AD1NE2ACXX-1-18 -0.912594
BC1C8DACXX-7-15 -0.919944
AD1NE2ACXX-1-19 -0.922991
AD1NNNACXX-6-18 -0.9262
BD2CPRACXX-7-3 -0.927969
BD1NW4ACXX-2-10 -0.928871
AD1NP0ACXX-8-23 -0.929918
AD1NFNACXX-8-4 -0.930331
AC1C40ACXX-5-8 -0.933724
AD1NE2ACXX-5-4 -0.934033
AD1NNNACXX-5-7 -0.934122
BD1NRGACXX-8-23 -0.935606
AD10W1ACXX-8-9 -0.935622
AC1JV9ACXX-5-10 -0.935901
BC1C19ACXX-6-19 -0.935993
BD2CPRACXX-5-5 -0.936282
AD2CJPACXX-1-5 -0.936736
BC1C8DACXX-6-22 -0.937006
AD1NP0ACXX-2-10 -0.939364

Summary Statistics Plots on 2188 samples

Author: Peter-Bram 't Hoen
Date: 12-December-2013

There were 6 outliers manually flagged based on relatively low percentage of mapped reads, or relatively low exon or gene correlations. This is how they behave in Lude's principal component analysis: outlier behavior. BD1NYRACXX-6-10 too low percentage of mapped reads, outlier on principal component 1,4,5,6
AD2CJPACXX-8-9 low exon correlation, outlier on principal component 1,11,14
BD1NYRACXX-4-20 low percentage of unique mappings, not an outlier in pca
BD24PGACXX-3-13 low percentage of mapped reads, not an outlier in pca
AC1C40ACXX-4-4 low percentage of exon mapping, not an outlier in pca
BC1C19ACXX-8-7 low percentage of mapped reads, not an outlier in pca
BD1NR9ACXX-7-27 low percentage of mapped reads, outlier on principal component 4

Propose to exclude BD1NYRACXX-6-10,AD2CJPACXX-8-9, BD1NR9ACXX-7-27(degraded?). These are now put on the blacklist. See:

Principal components were correlated to available qc parameters (including 5' and 3'-bias and gender specific expression): correlations, scatter plots

PC1: number of reads (but also exon and gene correlations, and difference between exon and gene correlations, possibly explaining discrepancies between exon and gene correlations: discrepant samples are usually samples with low number of reads.
PC2: percentage GC and biobank, but also of number of duplicates. These all seem confounded.
PC3: percentage multiple mappings
PC4: gender and median 5'-bias and possibly RNA degradation. Needs to evaluate exon expression for that
PC5: XIST and Y-chromosomal expression, likely gender effect
PC6: percentage GC and multimappings
PC7+8: nothing obvious
PC9: ratio exon/genome mapped, can reflect genomic DNA contamination, perhaps also intronic and thus pre-RNA content
PC10: nothing obvious

CODAM sample mixups

Author: Dasha Zhernakova
Date: 12-December-2013

In CODAM dataset MixupMapper? identified one sample swap. Original sample id conversion table contained the following:

genotype id 2495 -> RNA-seq id AD10W1ACXX-5-18
genotype id 2345 -> RNA-seq id AD10W1ACXX-8-11

MixupMapper? suggested that these samples were swapped and that the correct conversion table is:

genotype id 2495 -> RNA-seq id AD10W1ACXX-8-11
genotype id 2345 -> RNA-seq id AD10W1ACXX-5-18

If the sample ids are swapped in this way, the genotype concordance indeed increases from low to normal level.

Phenotype information says that:
2345 is female
2495 is male

XIST Expression:
AD10W1ACXX-8-11 doesn't have any reads mapping to XIST
AD10W1ACXX-5-18 (normalized) expression is 19.06149483

Mean chrY genes' expression (normalized):
AD10W1ACXX-8-11: 1.807537771
AD10W1ACXX-5-18: 0.467295688
(in AD10W1ACXX-5-18 the expressed genes are pseudogenes)

ChrX heterozygosity rate:
2345: 0.277410392
2495: 0.001561367

These results suggest that RNA-seq sample ids were swapped and the correct conversion table is:

genotype id 2495 -> RNA-seq id AD10W1ACXX-8-11
genotype id 2345 -> RNA-seq id AD10W1ACXX-5-18

