wiki:BIOS_Metadatabase

Version 1 (modified by Rick, 11 months ago) (diff)

--

Metadatabase

The BIOS project has generated for over 4000 individuals RNA-sequencing and DNA methylation data. A part from these data, GoNL imputed genotypes were generated from existing genotypes and several phenotypes/demographic variables were collected for the same set of samples. A highly flexible sample-oriented metadatabase (MDb) was created in order to manage the dynamic generation of this large-scale multiple-omic data set.

The MDb is a non-relation database (http://couchdb.apache.org/) that uses JSON to store records and JavaScript for querying. Furthermore, it has an HTTP API suitable to programmatically access the database from the GRID, e.g, the alignment pipeline.

Each record or document is a sample (individual) within the BIOS project and has a unique identifier. Each document has a predefined structure according to our database schema (https://git.lumc.nl/rp3/bios-schema). Custom Python scripts are use to update or modify the database (https://git.lumc.nl/rp3/bios-mdb).

Access to the metadatabase (MDb) is restricted; please contact (Leon Mei or Maarten van Iterson).

Description of MDb content

The MDb contains as much as meta-information as possible from all samples and datatypes: location of (raw) data on srm, md5 checksum verification, quality control information, links between the different identifiers used (person_id, dna_id, etc) and phenotype information.

Every sample's meta information is encoded in a CouchDB document. Each document has a unique identifier (the bios_id) which is biobankname (CODAM, LL, LLS, NTR, RS and PAN) concatenated with person_id separated by a "-", e.g. CODAM-2001. This unique bios_id is not suitable for use in the public domain, e.g., EGA upload, therefore a unique not identifiable identifier has been created for each individual; the uuid.

Every update of a sample in the database is recorded by increasing a revision number. Therefore it is always possible to undo wrong updates. The attachment of this page has a json file representing a sample's information in the metadatabase (The content of the file can be past on a JSON viewer e.g. http://jsonviewer.stack.hu).

Description available views

Views are the way to extract information form a couchDb. Views are organized into designs; each design contains a number of views related to a particular kind of information that can be extracted from the MDb. For example, there is a design “EGA” which contains currently two views 1) “freeze1RNASeq” to extract those samples for which RNAseq data has been uploaded to EGA and 2) “freeze1Methylation” for the DNA methylation data.

Other relevant views are:

design: view:
EGA freeze1RNASeq, freeze1Methylation, freeze2RNASeq, freeze2Methylation
Files getFastq, getIdat
Identifiers getIds
Phenotypes allPhenotypes, cellCounts, minimalPhenotypes
Runs getGenotypes, getMethylationRuns, getRNASeqRuns
Samplesheets rnaseqSamplesheet, methylationSamplesheet
Verification md5

Note: We can always add views if necessary; please contact Maarten van Iterson.

Accessing the MDb

Views can be downloaded as JSON documents by making a GET request. Most programming languages have utilities for making GET requests and to transform JSON documents. Some programming languages have an API for CouchDB e.g. JAVA and Python. There are several online tools available for transforming JSON documents to csv files.

Please note that it is usually better to download the view separately and work on the downloaded file. This way you only have to enter your password once and you're resilient to network connectivity problems.

Access the metadatabase using R

We have developed the R package BIOSRutils (https://git.lumc.nl/rp3/biosrutils) for easy access to the MDb and processed datasets. BIOSRutils is available on the VM for R version 3.2.0 (start R using command R-3.2.0 from the commandline). The current version 0.0.1 this is still a development version, several of our aimed features are not yet fully implemented.

BIOSRutils uses a configuration file to read in your MDb username and password, so that you do not have to type it every time you use the MDb.

Create a file called .biosrutils and stored it in your home directory on the VM (/home/username) and add as the first line:

usrpwd: 'username:password'

Start R-3.2.0 and load the library:

> library(BIOSRutils)

Several predefined variables are available, such as, the urls to the current MDb and Rdb, as well as, your provide username and password (USRPWD). All the variables are capitalized to minimize interference with your own code.

> ls()
[1] "BIOBANKS"     "DATASETS"     "MDb"          "PROXY"        "RDb"         
[6] "RP3DATADIR" "SRMBASE"      "USRPWD"       "VIEWS"   

The BIOSRutils package provides the function getView to extract a particular view from the MDb. All available views are stored in the global variable VIEWS. Use the regular way to get help in R, e.g.:

> ?getView

For example, we want to extract all phenotype information from all samples we use the allPhenotype view from the design Phenotypes.

phenotypes <- getView(view=“allPhenotypes”, design=“Phenotypes”)

Basic R manipulations can be use to select particular information. e.g.:

LLSMalesAbove70 <-  subset(phenotypes, grepl(“LLS”, ids) & Sex == 0 & DNA_BloodSampling_Age > 70)

Access the MDb using curl

By using a curl GET request the content of the view can be obtained as follows. For example using view getIds (substituting your username):

$ curl -X GET https://metadatabase.bbmrirp3-lumc.vm.surfsara.nl:6984/bios/_design/Identifiers/_view/getIds?reduce=false -u 'username' -k -g > getIds.json
$ cat getIds.json
{"total_rows":6379,"offset":0,"rows":[
{"id":"CODAM-2037","key":[false,"CODAM"],"value":{"bios_id":"CODAM-2037","uuid":"BIOS71A89511","biobank_id":"CODAM","person_id":"2037","pheno_id":"2037","gwas_id":"2037","dna_id":"2037","rna_id":"2037","rna_note":"library-prep: succeeded","gonl_id":null,"cg_id":null,"in_rp3":false}},
...
]}

The jq tool (installed on the cloud VM) can be used for quick processing of the JSON formatted result on the command line. For example, to get just the uuid values from that view:

$ curl -X GET
https://metadatabase.bbmrirp3-lumc.vm.surfsara.nl:6984/bios/_design/Identifiers/_view/getIds?reduce=false -k -u username | jq -r '.rows[].value.uuid // empty'
BIOS71A89511
BIOS78A709E9
BIOS700411C4
BIOS75EAD30E
...

The JSON file can be parsed into a Python datastructure as follows:

> import json
> document = json.load(open('getIds.json'))
> document['rows'][0]
{u'value': {u'rna_note': u'library-prep: succeeded', u'biobank_id': u'CODAM', u'cg_id': None, u'in_rp3': False, u'uuid': u'BIOS71A89511', u'dna_id': u'2037', u'gwas_id': u'2037', u'gonl_id': None, u'pheno_id': u'2037', u'rna_id': u'2037', u'person_id': u'2037', u'bios_id': u'CODAM-2037'}, u'id': u'CODAM-2037', u'key': [False, u'CODAM']}

Access the MDb using Firefox via BIOS VM

You can access MDb by running firefox on BIOS VM with X forwarding in your ssh session: "ssh -X bios-vm.bbmrirp3-lumc.vm.surfsara.nl".

Updates

2014-05-09: For NTR set in_rp3 = TRUE for a set of unrelated samples passing methylation qc and have GoNLv5 imputed genotypes.

2014-06-13: Check cg_id of LL all NA's.

2014-06-13: Remove rnaseq info for four samples that had duplicated rnaseq_run_id's.

2014-06-13: Add 80 LL rnaseq samples (1 sample could not be add because rna_id did not occur in sample sheet).

2014-08-14: Added 193 PAN samples to the metadatabase.

2014-08-14: Added 971 samples (LL=37, LLS=23, NTR=816, RS=95) to the metadatabase.

2014-08-18: Added 185 NTR samples to the metadatabase fixed some issues with merged samples.

2014-09-18: Modified the location and name of BIOS database. (Now cloudcouchdb.bbmrirp3-lumc.cloudlet.sara.nl:6984/bios)

2014-10-02: Some more RNAseq and methylation data has been added to the metadatabase. Currently, containing 6070 samples of which 4096 have rnaseq data and 6031 methylation data.

2014-11-05: Added rnaseqFreeze0 view.

2014-12-03: Methylation data freeze flag is set.

2014-12-03: Three LLS methylation data technical replicates are added _key=BIOS-ID-Rep.

2014-12-03: Add uuid (universally unique identifier) using uuidgen -r using the first 8 characters converted to upper case and prefixed with e.g. BIOS2AF124EB.

2016-02-10: Add freeze 2 flag for RNAseq

2016-03-01: Add RNAseq quality control field to all freeze 2 runs

2016-03-01: Set RNAseq quality control field of the first 10 bad quality runs

2016-05-12: Fixed Flipped RNAseq plate

2016-05-12: Fixed 13 detected reciprocal swaps RNA runs

2016-05-12: Added DNA methylation freeze 2 flag

2016-05-12: Added monozygotic twin pair indicator. If last character of the NTR pheno_id is lower-case this indicates that this individual is a monozygotic twin.

2016-05-13: Linked genotype information from monozygotic twin pairs

2016-06-01: Added new gonl identifiers