====== MOBA study ======


This is a placeholder page to describe / discuss data logistics of MOBA datasets and relevant registry data available for researchers at NORMENT.

===== MoBa - Norwegian Mother, Father and Child Cohort Study =====

Official page of the MoBa study: https://www.fhi.no/en/studies/moba/

Internal MoBa wiki with detailed description of Questionnaires: 
https://mobawiki.fhi.no/mobawiki/index.php/Questionnaires

Descripton of MoBaGenetics:
https://docs.google.com/document/d/1s1fmRvgUDP8m8tyW1fL95VDtLBJuUNoTsaa9lHc-Fvg/edit

MoBaPsychGen google drive.
https://drive.google.com/drive/folders/1O7c5mq-3c6RtbLtgvSsxnRJ_Vi0O7I1M

Total sample size for MoBa is 280K, but only 239K has genetics information.
TBD: break-down total numbers of children, mothers and fathers in MoBa.

===== Practicalities ======

  * MoBa genotypes are available on TSD p33 project (what's the path within p33?) and on p697 (/tsd/p697/data/durable/deCODE_nov20/)
  * MoBa phenotypes are available on a dedicated TSD p697 project (/tsd/p697/data/durable/phenotypes/)
  * Key files connecting participant ids to sentrix ids can be found at /tsd/p697/data/durable/moba_participants_p697/

Overview of the data? (# subjects, phenotypes available, ....)


Shared folders:
  * p315sharedp697 (FROM p315 TO p697)
  * p697sharedp315 (FROM p697 TO p315)

===== TSD p697 project =====

Contact Lavinia Athanasiu (<lavinia.athanasiu@medisin.uio.no>) if you need to get an access.

We are in process of setting up and testing p697 infrastructure. The goal is to have
  * 20 TB storage on durable
  * 5 TB storage on cluster
  * 100K CPU hours available on Colossus
  * Shared folders with p229 (MoBa genetics), p471 (FHI) and p33 (NORMENT) to import data to p697, (ToDo - list here path to shared folders)
  * 2x Linux VMs 32GB RAM, 1x Linux webserver 64 GB (NoDa), 2x Windows VMs 16 GB RAM
  * tsd-s3api for large data import/export (that's in addition to standard TSD web loader)
  * sync SUMSTAT inventory into p697
  * double-check all relevant users have access
  * provide a list of users with export rights

Non-goals:
  * Currently we don't plan to install ricopilli pipeline on p697. All heavy processing of MOBA genetic data should happen on p33 or import pre-processed "release" from p229
  * No plans to expand Windows VM.
    
Linux VM software:
  * MATLAB
  * python2 and python3 - both with standard modules (numpy, scipy, pandas, matplotlib)
  * R with standard modules
  * login-site from windows: p697-appn-norment01 from putty to login to the appn

Windows
  * SPSS
  * STATA
  * R-Studio
  * winscp, putty, notepad++, 7zip


===== TSD p697 current issues =====

* Some users are not able to login to Linux VM (after typing the password user gets disconnected immediately). A workaround is to login to Windows machine, open Putty, and connect to p697-tl01-l.tsd.usit.no .


===== Useful codes =====

**Outdated?**Script to convert ICD10 code to diagnosis is located here /tsd/p697/data/durable/scripts/npr/NPR_count_diag_MoBa_CHILDREN_ExampleScript.R

=====Approach to Moba-phenotypes in p697=====

__Usefull resources__

MoBa project overview: [[https://www.fhi.no/en/studies/moba/for-forskere-artikler/moba-project-overview/]]

Read this before use: [[https://mobawiki.fhi.no/mobawiki/index.php/Read_this_before_analyses]]

Questionnaire-overview: [[https://mobawiki.fhi.no/mobawiki/index.php/Questionnaires]]
[[https://www.fhi.no/en/studies/moba/for-forskere-artikler/questionnaires-from-moba/]]

Instrument documentation for Q1(15week of gestation) – Q8yr (child 8 years old): [[https://www.fhi.no/globalassets/dokumenterfiler/studier/den-norske-mor-far-og-barn--undersokelsenmoba/instrumentdokumentasjon/instrument-documentation-synthese.pdf]]
 [overview of questionnaire versions (changes have been done to the questionnaires during the inclusion periods) + assessment tools some MoBa-questions are based on].


Usefull PP regarding syntaxt in Moba (in Norwegian): [[https://mobawiki.fhi.no/mobawiki/index.php/User_guides]]

__Available registries Moba-data can be used with__

Directories in p697: data/durable/phenotypes/ relevant folders

NPR=Norwegian Patient registery (mostly secondary health care)\\ [[https://www.helsedirektoratet.no/tema/statistikk-registre-og-rapporter/helsedata-og-helseregistre/norsk-pasientregister-npr/innhold-og-kvalitet-i-npr]]

MBR=Medical Birth Registry of Norway (ops folder and filename in p697 is called MBRN). 

KUHR is available in other projects, not sure about p697?= KUHR-database (health reimbursement) (also primary health care)\\ [[https://www.helsedirektoratet.no/tema/statistikk-registre-og-rapporter/helsedata-og-helseregistre/kuhr]]

__Questionnaires and Registers in p697__
Questionnaires in p697:
  * location: data->durable->phenotypes->mobaQ->PDB2445_MoBa_V12
  * **The variables have names and lables.** 
  * variable name(colums in SPSS-file in p697): xxnn correspond to names in [[https://mobawiki.fhi.no/mobawiki/index.php/Questionnaires#Synthesis_Instrument_Documentation]] in section annotated eg "FF12".// In the SPSS-file->variable view you find the column "label" containing information about which scheemes (SKJEMA) the variables are included in, as well as the question asked in the scheeme.// Allways check that variable name and label correspond before analysis as there can be errors in annotations on moba-wiki compared to SPSS-file in p697.


MBRN:
  * location: data->durable->phenotypes->mobaQ->PDB2445_MoBa_V12
  * the file PDB2445_MBRN_541_v12 holds information to The Norwegian Birth registery coupeled to Moba-identifiers (not centrix)

NPR-data in p697:
  * location:data->durable->phenotypes->npr
  * NPR compatible with "Phenotools": data->durable->phenotypes->npr->NPR_2020_phenotools


__Key info when analysing Moba phenotypes__
Identifiers: 
  * PREG_ID_xxxx:ID for pregnancy. 1 unique ID per pregnancy. => twins share PREG_ID
  * BARN_NR: number for child within same pregnancy => twins have different BARN_NR
  * M_ID: 1 unique ID per mother => siblings share M_ID and have different PREG_ID
  * F_ID: 1 unique ID per father
  * Centrix_ID: ID for genetic sample.

__Phenotools-R-package for analysing phenotype-data in Moba__
[[https://mobawiki.fhi.no/mobawiki/index.php/Phenotools]]

This package can be used to:
  * analyse scalelevel-data for items in Moba-questionnaires
  * combine NPR, Moba-questionnaire data, MBFR and generated variables from Moba-wiki.

Installation and instructions to the package: [[https://github.com/psychgen/phenotools]]. Note - if you need to update your version of R (to e.g. 4.0) to run, please: Go to "Tools", "Global Options", "General", change R version. 

  * "PDB2445_full_linked_20.readme" explains "PDB2445_full_linked_20.sav" that is the NPR-file currently in use for 

**Ops spesific adaptations to curate_data-function that must be used in p697 due to different data-organisation**:

<code>

preprocd_npr <-preload_npr(npr_data_root_dir =  "//tsd-evs/p697/data/durable/phenotypes/npr/NPR_2020_phenotools/",
                           npr_filename = "PDB2445_full_linked_20.sav") 

npr_and_mobaQ<-curate_dataset(
  variables_required=list(npr =c(
    "xxnn", "FFxx", ... #xxnn= variable name in moba questionnaire. FFxx represent ICD-code in NPR eg. FF32
  )), 
  moba_data_root_dir = "//tsd-evs/p697/data/durable/phenotypes/mobaQ/PDB2445_MoBa_V12/",#change to your directory here. 
  PDB = "2445",
  moba_data_version = 12, ##if you have v11 use 11 if you have v12 of questionnaires use 12
  completion_threshold = 0.5, #modify to your needs 
  return_items = FALSE,   #True - then includes scale itemlevel => ops can become quite heavy. 
  consistent_items = FALSE, #True for repeated measures. 
  transformations = NULL,
  log = NULL,
  out_format = "merged_df",# can be changed to list if you want
  npr_preprocessed= preprocd_npr,# this part is due to different coding of NPR in p697 than p471
  override_filenames = c("MBRN = PDB2445_MBRN_541_v12.sav"))
  
 </code>
If using Moba-Q data replace npr in code above with moba. 

=====Moba-genotypes in p697=====

__Key info when analysing Moba genotypes__

The SENTRIXID is unique to each genotype run. I.e. any individual who has been genotyped multiple times will have multiple SENTRIXID’s (but there should never be a duplicated SENTRIXID). The SENTRIXID is made up of 2 pieces of information:
The SENTRIX BeadChip ID essentially corresponds to the Array BeadChip used to genotype the sample; and the plate positional information.

E.g. if the SENTRIXID was: 1632405013_R01C01
1632405013 is the BeadChip ID used to genotype the sample. R01C01 indicates that the sample was in row 1 and column 1 of the plate (usually 96 well plates are used).

Example on how families would be coded in the fam files [only coded with SENTRIXID as IID in real files].
{{:moba_pedigree_fam_file.png?200|}}

The FID was originally generated with numerals to correspond to each trio or collection of families based on the reported mothers and fathers.

KING (https://www.kingrelatedness.com/) was used during the QC to confirm and build the pedigree. Where relatedness between the coded families exist, they will be merged into a single family. For example, if there are full-siblings within the parent generation of families 1 and 2 + families 3 and 4. The families would be updated with numerals to dummy code the parents thereby capturing the FS relationship. I’ve also created an example family 5_6 where there is a across generation HS relationship between individual r and t: 
{{:moba_pedigree_extended_fam_file.png?200|}}

Release 1 (98k) and release 2 fam files will have as much family information included as possible, but individuals and families not passing QC will not be included in these files. In the directory /tsd/p697/data/durable/projects/moba_qc_imputation/resources/, the file genotyped_pedigree has the initial pedigree based on the MoBa reported information. It also has the linkage between SENTRIXID and MoBa ID (F_ID, M_ID, and PREG_ID_BARN_NR).


==== REK  ====

2016/1226 MoBa