====== Genotype quality control and imputation protocol ======
For genotype data within the NORMENT Centre



===== Purpose =====

  * to help guard against genotyping error, population stratification, sample duplication or contamination and other confounders that can affect downstream analysis results
  * to have a standardized QC applied to all samples genotyped within the NORMENT centre

===== A brief description =====

The quality control protocol is based on PLINK version 1.9.(( Christopher C Chang, Carson C Chow, Laurent CAM Tellier, Shashaank Vattikuti, Shaun M Purcell, James J Lee, Second-generation PLINK: rising to the challenge of larger and richer datasets, GigaScience, Volume 4, Issue 1, December 2015, https://doi.org/10.1186/s13742-015-0047-8 ))<sup>,</sup>(( Shaun Purcell et al., PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses, American Journal of Human Genetics, Volume 81, Issue 3, September 2007, https://doi.org/10.1086/519795 ))
Chromosome-wide haplotypes are subsequently phased with Eagle2(( Loh, Po-Ru, et al. "Reference-based phasing using the Haplotype Reference Consortium panel." Nature genetics 48.11 (2016) ))<sup>,</sup>(( Loh, Po-Ru, Pier Francesco Palamara, and Alkes L. Price. "Fast and accurate long-range phasing in a UK Biobank cohort." Nature genetics 48.7 (2016) )) and missing variants are imputed with Minimac3(( Das S, Forer L, Schönherr S, Sidore C, Locke AE et al. Next-generation genotype imputation service and methods. Nature genetics 48, 1284–1287 (2016) )) using the first release of the [[reference_data_1kg_hrc#HRC|haplotype reference consortium (HRC) reference set]].

All program calls are wrapped in a [[https://github.com/scimerc/genimpute|bash script toolset]] for linux systems.


===== The protocol in short =====

==== Whitelist ====

  * Compile list of variants common to all batches (and reference)

==== Recoding ====

  * For every batch:
    * Extract variants from variant whitelist (if enabled)
    * Extract samples according to sample whitelist (if enabled)
    * Convert colocalized variant set to human readable format (tped) (for later QC)
    * Convert to binary plink (for easy use downstream)

==== Alignment ====

  * For every batch:
    * Create a blacklist with non-coherent co-localized variants (tped)
    * Append non reference-compatible variants to blacklist
    * Create a fliplist to align strand-flipped variants to reference
    * Purge batch file of blacklist
    * Align fliplist in batch file

==== Merge ====

  * Initialize global blacklist
  * While true
    * For every batch
      * Purge batchfile of global blacklist (if any)
    * Attempt merging resulting batchfiles
    * End on success
    * Abort entire procedure if blacklist is not empty       (if more than twice something is weird)
    * Append mismatching variants to global blacklist (derived from errors)

==== Variant HQC ====

  * Compile list of sex hq-variants
  * Compile list of non-sex hq-variants from input file
  * Extract all hq-variants from input file and make hq plink set
  * LD-prune hq-variants genotype data
  * Impute sex once with all standard hq-variants
  * If sex could be imputed for enough individuals, impute it once again after     sex-chromosome HWE tests

==== Individual QC ====

  * Exclude possibly contaminated (mean+5sd heterozygosity) and low-coverage (80%) individuals
  * Erase parent information for any dysfunctional families
  * Annotate relatedness information

==== Variant QC ====

  * Exclude variants with:
    * low coverage
    * Hardy-Weinberg disequilibrium (separating sex and non-sex chromosomes)
    * high rate of Mendel errors in eventual trios
    * more stringent control Hardy-Weinberg equilibrium
    * eventual batch effects (FDR=0.5) (if more than a single batch present)

==== Variant HQC ====

  * Compile list of sex hq-variants
  * Compile list of non-sex hq-variants from input file
  * Extract all hq-variants from input file and make hq plink set
  * LD-prune hq-variants genotype data
  * Impute sex once with all standard hq-variants
  * If sex could be imputed for enough individuals, impute it once again after     sex-chromosome HWE tests

==== PCA etc. ====

  * Compute genetic principal components
  * Compute final individual coverage statistics

==== Eagle phasing ====

==== MaCH 3 imputation ====


===== Example =====



================================================================================
genimpute.sh -- Mon May  6 17:04:49 CEST 2019
================================================================================

genotype files:

  Paris_dlb_cases.grch37_id_update.bed
  Paris_dlb_controls.grch37_id_update.bed

================================================================================

  bfdr = 0.5
  execommand = bash
  freqhq = 0.2
  freqstd = 0.01
  genomeblacklist = highLD_b37.bed
  genomebuild = b37
  genomemap = genetic_map_b37_withX.txt.gz
  hvm = 1
  hvmmax = 0.2
  hvmmin = 0.1
  hvmsdn = 5
  hweflag = midp include-nonctrl
  hweneglogp = 12
  hweneglogp_ctrl = 4
  igroupsize = 1000
  infosep = :
  intersect = true
  log_lvl = 0
  logfn = genimpute.log
  metrios = 0.05
  mevars = 0.1
  minindcount = 100
  minvarcount = 100
  phenotypes = 
  pihat = 0.9
  pihatrel = 0.1
  plinkctime = 12:00:00
  plinkmem = 
  pruneflags = --indep-pairphase 500 5 0.2
  refprefix = 
  samplemiss = 0.05
  samplewhitelist = 
  uid = 000UID
  varmiss = 0.05

================================================================================

==== Whitelist ====


==== Recoding ====


==== Alignment ====


==== Merge ====


==== Variant HQC ====


  lib/3rd/plink --allow-extra-chr --bfile Paris_dlb_testQQ_93ef90/.i/qc/d_hqset_tmp_draft 
                --not-chr 23,24 --exclude range lib/data/highLD_b37.bed 
                --geno 0.01
                --maf 0.2
                --hwe 1.E-4 midp include-nonctrl
                --make-just-bim
                --out Paris_dlb_testQQ_93ef90/.i/qc/d_hqset_tmp_nonsex
  lib/3rd/plink --allow-extra-chr --bfile Paris_dlb_testQQ_93ef90/.i/qc/d_hqset_tmp_draft 
                --chr 23,24 --exclude range lib/data/highLD_b37.bed 
                --geno 0.01
                --maf 0.2
                --make-just-bim
                --out Paris_dlb_testQQ_93ef90/.i/qc/d_hqset_tmp_sex
  lib/3rd/plink --allow-extra-chr --bfile Paris_dlb_testQQ_93ef90/.i/qc/d_hqset_tmp_draft 
                --not-chr 23,24 --exclude range lib/data/highLD_b37.bed 
                --geno 0.05
                --maf 0.2
                --hwe 1.E-4 midp include-nonctrl
                --make-just-bim
                --out Paris_dlb_testQQ_93ef90/.i/qc/d_hqset_tmp_nonsex
  lib/3rd/plink --allow-extra-chr --bfile Paris_dlb_testQQ_93ef90/.i/qc/d_hqset_tmp_draft 
                --chr 23,24 --exclude range lib/data/highLD_b37.bed 
                --geno 0.05
                --maf 0.2
                --make-just-bim
                --out Paris_dlb_testQQ_93ef90/.i/qc/d_hqset_tmp_sex
  lib/3rd/plink --allow-extra-chr --bfile Paris_dlb_testQQ_93ef90/.i/qc/d_hqset_tmp_draft 
                --not-chr 23,24 --exclude range lib/data/highLD_b37.bed 
                --geno 0.1
                --maf 0.2
                --hwe 1.E-4 midp include-nonctrl
                --make-just-bim
                --out Paris_dlb_testQQ_93ef90/.i/qc/d_hqset_tmp_nonsex
  lib/3rd/plink --allow-extra-chr --bfile Paris_dlb_testQQ_93ef90/.i/qc/d_hqset_tmp_draft 
                --chr 23,24 --exclude range lib/data/highLD_b37.bed 
                --geno 0.1
                --maf 0.2
                --make-just-bim
                --out Paris_dlb_testQQ_93ef90/.i/qc/d_hqset_tmp_sex
  lib/3rd/plink --allow-extra-chr --bfile Paris_dlb_testQQ_93ef90/.i/qc/d_hqset_tmp_hq  --indep-pairphase 500 5 0.2
                --out Paris_dlb_testQQ_93ef90/.i/qc/d_hqset_tmp_hq_LD
  lib/3rd/plink --allow-extra-chr --bfile Paris_dlb_testQQ_93ef90/.i/qc/d_hqset_tmp_hq_LDpruned 
                --impute-sex ycount
                --make-bed
                --out Paris_dlb_testQQ_93ef90/.i/qc/d_hqset_tmp_hq_LDpruned_isex

==== Individual QC ====


  lib/3rd/plink --allow-extra-chr --bfile Paris_dlb_testQQ_93ef90/.i/qc/c_proc --keep Paris_dlb_testQQ_93ef90/.i/qc/e_indqc_tmp_out.clean.id
                --geno 0.1
                --make-just-bim
                --out Paris_dlb_testQQ_93ef90/.i/qc/e_indqc_tmp_hcv
  lib/3rd/plink --allow-extra-chr --bfile Paris_dlb_testQQ_93ef90/.i/qc/c_proc --keep Paris_dlb_testQQ_93ef90/.i/qc/e_indqc_tmp_out.clean.id
                --extract Paris_dlb_testQQ_93ef90/.i/qc/e_indqc_tmp_hcv.mrk
                --mind 0.05
                --make-just-fam
                --out Paris_dlb_testQQ_93ef90/.i/qc/e_indqc_tmp_hci
==== Variant QC ====


  lib/3rd/plink --allow-extra-chr --bfile Paris_dlb_testQQ_93ef90/.i/qc/e_indqc
                --me 0.05 0.1
                --set-me-missing
                --make-bed
                --out Paris_dlb_testQQ_93ef90/.i/qc/f_varqc_tmp_nome
  lib/3rd/plink --allow-extra-chr --bfile Paris_dlb_testQQ_93ef90/.i/qc/f_varqc_tmp_nome --keep Paris_dlb_testQQ_93ef90.ids
                --not-chr 23,24
                --set-hh-missing
                --geno 0.1
                --hwe 1.E-4 midp include-nonctrl
                --make-just-bim
                --out Paris_dlb_testQQ_93ef90/.i/qc/f_varqc_tmp_nonsex
  lib/3rd/plink --allow-extra-chr --bfile Paris_dlb_testQQ_93ef90/.i/qc/f_varqc_tmp_nome --keep Paris_dlb_testQQ_93ef90.ids 
                --chr 23,24
                --set-hh-missing
                --geno 0.1
                --hwe 1.E-8 midp include-nonctrl
                --make-just-bim
                --out Paris_dlb_testQQ_93ef90/.i/qc/f_varqc_tmp_sex

==== Variant HQC ====

  lib/3rd/plink --allow-extra-chr --bfile Paris_dlb_testQQ_93ef90/.i/qc/h_hqset_tmp_draft --keep Paris_dlb_testQQ_93ef90.ids
                --not-chr 23,24 --exclude range lib/data/highLD_b37.bed 
                --geno 0.01
                --maf 0.2
                --hwe 1.E-4 midp include-nonctrl
                --make-just-bim
                --out Paris_dlb_testQQ_93ef90/.i/qc/h_hqset_tmp_nonsex
  lib/3rd/plink --allow-extra-chr --bfile Paris_dlb_testQQ_93ef90/.i/qc/h_hqset_tmp_draft --keep Paris_dlb_testQQ_93ef90.ids
                --chr 23,24 --exclude range lib/data/highLD_b37.bed 
                --geno 0.01
                --maf 0.2
                --make-just-bim
                --out Paris_dlb_testQQ_93ef90/.i/qc/h_hqset_tmp_sex
  lib/3rd/plink --allow-extra-chr --bfile Paris_dlb_testQQ_93ef90/.i/qc/h_hqset_tmp_draft --keep Paris_dlb_testQQ_93ef90.ids
                --not-chr 23,24 --exclude range lib/data/highLD_b37.bed 
                --geno 0.05
                --maf 0.2
                --hwe 1.E-4 midp include-nonctrl
                --make-just-bim
                --out Paris_dlb_testQQ_93ef90/.i/qc/h_hqset_tmp_nonsex
  lib/3rd/plink --allow-extra-chr --bfile Paris_dlb_testQQ_93ef90/.i/qc/h_hqset_tmp_draft --keep Paris_dlb_testQQ_93ef90.ids
                --chr 23,24 --exclude range lib/data/highLD_b37.bed 
                --geno 0.05
                --maf 0.2
                --make-just-bim
                --out Paris_dlb_testQQ_93ef90/.i/qc/h_hqset_tmp_sex
  lib/3rd/plink --allow-extra-chr --bfile Paris_dlb_testQQ_93ef90/.i/qc/h_hqset_tmp_draft --keep Paris_dlb_testQQ_93ef90.ids
                --not-chr 23,24 --exclude range lib/data/highLD_b37.bed 
                --geno 0.1
                --maf 0.2
                --hwe 1.E-4 midp include-nonctrl
                --make-just-bim
                --out Paris_dlb_testQQ_93ef90/.i/qc/h_hqset_tmp_nonsex
  lib/3rd/plink --allow-extra-chr --bfile Paris_dlb_testQQ_93ef90/.i/qc/h_hqset_tmp_draft --keep Paris_dlb_testQQ_93ef90.ids
                --chr 23,24 --exclude range lib/data/highLD_b37.bed 
                --geno 0.1
                --maf 0.2
                --make-just-bim
                --out Paris_dlb_testQQ_93ef90/.i/qc/h_hqset_tmp_sex
  lib/3rd/plink --allow-extra-chr --bfile Paris_dlb_testQQ_93ef90/.i/qc/h_hqset_tmp_hq --keep Paris_dlb_testQQ_93ef90.ids --indep-pairphase 500 5 0.2
                --out Paris_dlb_testQQ_93ef90/.i/qc/h_hqset_tmp_hq_LD
  lib/3rd/plink --allow-extra-chr --bfile Paris_dlb_testQQ_93ef90/.i/qc/h_hqset_tmp_hq_LDpruned 
                --impute-sex ycount
                --make-bed
                --out Paris_dlb_testQQ_93ef90/.i/qc/h_hqset_tmp_hq_LDpruned_isex

==== PCA etc. ====




====== Imputation protocol v2 ======

Due to small size of individual batches, and difference between genotyping chips, 
we now designed a new QC & imputation protocol.
As of June 2021, this is work in progress.


==== Module 0. Harmonize genotype data format (for each batch) ====

The goal is to harmonize each batch data format according to the following specification.
This does not have any sample-level QC steps, i.e. it retains all samples.
For SNPs we remove variants that could not be recognized as defined above.

  * Introduce a harmonized batch identifier (herein referred to as <BATCH>), in the following format YYMM_COMMENT (year - two digits, month - to digits, with no more than 10 characters in COMMENT, COMMENT being lower-case, and no underscore or other special  characters in the COMMENT). For example: 1403_batch1.
  * Save a copy of the .fam file as ".fam.original".
  * Generate news IID (individual's identifiers) which look as follows: 1403_batch1_id00001. The identifier starts with YYMM_COMMENT (batch name), followed by _idNNNNN where the index corresponds to unity-based index in the .fam file. Update .fam file with new IID. Also, set FID to the same value so that IID and FID are the same.
  * Make sure there is one .bim/.fam/.bed/.iid set for every batch (i.e. merged across chromosomes if original files were split)
  * Have a valid CHR and BP (no zeros), in h19 build. We could revise this later, but for now let's stay at hg19.
  * Have X chromosome available in .bim file, as well as sex information in the .fam files (if missing, try to bring this from clinical records / DB CRU spreadsheets)
  * Convert A1/A2 to A/G/T/G format
  * Update rs# from chip information (if needed)
  * Update information about each batch in [[https://docs.google.com/spreadsheets/d/1cJ1Ri7-FJisSQ3vBSJvgZNlmlx_z-xbB2x3KZ8mxQqA/edit#gid=0|the spreadsheet]]

As part of Module 0 it is often necessary to perform chip-specific steps.
The info below is copied from MoBa QC & imputation pipeline, but we need to expand it further.

  * For the HumanCoreExome batches only: 
    * update SNP alleles (convert the Illumina A/B alleles to the genetic A,C,T,G alleles)
    * exclude badly performing SNPs (black-listed SNPs)
  * For the Global Screening Array batches:
    * update SNP names (to rsIDs)
  * For OmniExpress and Global Screening Array batches in hg38: 
    * update the chromosomal positions to match the build GRCh37/hg19, using liftover
  * Other arrays?

==== Module 1. Pre-imputation QC (for each batch) ====

  * Exclude MoBa participants from each batch
  * SNP QC: If there is more than 100 individuals in a batch, apply with MAF ≥ 0.01 filter
  * SNP QC: call rate ≥ 0.95 
  * Sample QC: call rate ≥ 0.98   
  * SNP QC: call rate ≥ 0.98 
  * SNP QC: Hardy-Weinberg equilibrium (HWE) pvalue ≥ 1e-10    
  * SNP QC: If there is more than 100 individuals in a batch, apply heterozygosity check (FHET +/- 5 standard deviations around the mean)
  * Sample QC: Sex violations (excluded) - genetic sex does not match pedigree sex
  * SNP QC: Remove all duplicate variants (on rs#), all indels and all strand ambiguous SNPs (A/T and C/G)
  * SNP QC: call rate per chromosome ≥ 0.9

At this stage we do not filter based on ancestry, and only run the above procedure once.

==== Module 2. Phasing & imputation (for each batch) ====

For every batch, use PRE_PHASING_QC.job from MoBa pipeline, which does do the following:

  * Remove the SNPs not present in HRC 
  * Remove SNPs with alleles different to HRC (Oxford perl script can be used for this)
  * Flip the strand where needed (Oxford perl script can be used for this)
  * Update SNP IDs to match those in HRC (Oxford perl script can be used for this)
  * Change ref/alt allele assignments where needed (Oxford perl script can be used for this)
  * Remove duplicated SNPs

==== Module 3. phasing & imputation ====

For every batch separately:

  - Perform phasing with ''shapeit2 --no-mcmc''
  - Perform imputation with TOPMed server

==== Module 4. post-imputation QC and annotations ====

Per batch:
  - Decrypt the data
  - Convert to plink format with "plink2 --hard-call-threshold 0.3"
  - Filter on INFO score (find what is the default INFO threshold for  Minimac , which is NOT 0.8)
  - If there is more than 100 individuals in a batch, apply with MAF ≥ 0.01 filter
  - SNP QC: call rate ≥ 0.95 
  - Sample QC: call rate ≥ 0.98

Merge across batch.
  - Drop SNPs showing batch effects using Cochran-Mantel-Haenszel (as in MoBa plate effects)

========================================

TBD: establish standard genetic analysis:
  - Calculate principal components from HapMap3 SNPs  overlap between 1kG and imputed data (SNP filters: MAF > 5% (or 0.5% as MoBa), HWE > 1.0e-03, MISSING RATE < 2%, no AT/GC SNPs (strand ambiguous SNPs), Remove HighLD regions, prune LD R2 < .2, 200 SNPs window: plink --indep-pairwise 200 100 0.2  (or 3000 1500 0.1 as MoBa), repeatedly until there is less than 100.000 variants
  - Project PCs to 1kG, use clustering to annotate ancestry information (as in https://github.com/norment/moba_qc_imputation/tree/master/users/of/ukb_ancestry
  - Annotate duplicates and related individuals

==== Design summary & discussions ====

  - We aim to use TOPMed reference for imputation, instead of HRC
  - We phase all samples with ''shapeit2 --no-mcmc'' option, using only HRC referece to infer haplotype information, but not using other samples from the same genotyping batch. TBD: evaluate if phasing functionality of the TopMED server has a similar feature. If not, we will phase locally using HRC, and then upload pre-phased data to TopMED to proceed with imputation.
  - We do not perform ANY sample-level quality control before phasing. All sample-level QC will happen at post-imputation step.
  - For genetic variants, we perform a very basic QC procedure prior to phasing (force hg19 genomic build, exclude duplicated SNPs, exclude indels and non-ACTG alleles, and exclude ambiguous alleles). It is important to skip MAF, HWE, missingness and other QC filters which might be unreliable because we didn't QC samples. TBD: consider if there is any additional information from genotyping that we need to take into account when we filter out badly genotyped SNPs. I assume that in hard-call genotypes this is already taken care of, i.e. all genotype calls should have reasonable confidence.
  - All batches are then merged together, and passed through HRC_1000G_check_bim check with -n flag to specify that we do not wish to exclude any SNPs based on allele frequency difference. 
  - Phasing: ''shapeit2 --no-mcmc'' (or a similar functionality in TOPMed)
  - Imputation: TOPmed reference (by uploading to https://imputation.biodatacatalyst.nhlbi.nih.gov/ )
  - Post-imputation QC procedure: this is work in progress; procedure can be customized for each project.

Things to address in post-imputation QC procedure (TBD: the list will be expanded to match protocol we use in MoBa):
- sample-level QC (call rate, HET check, etc)
- variant-level QC (MAF, HWE, INFO)
- infer relatedness (king) and ancestry information (by projecting to 1kG principal components)

