====== A workflow for genotype extraction and QC from NORMENT genotype data ======

<code>

# AD_DemGene workflow

# File rquired:
# 1. hist_miss.R
# 2. AD_sumstats_Jansenetal.txt.gz
# 3. AD_clinical.txt
# 4. Manhattan_plot.R 
# 5. QQ_plot.R 
# 6. MAF_check.R
# 7. hwe.R
# 8. check_heterozygosity_rate.R
# 9. heterozygosity_outliers_list.R
# 10. Relatedness.R
# 11. gender_check.R 
# 12. QQ_plot_assoc.R
# 13.inversion.txt




### Step 1 ### 

# Investigate missingness per individual and per SNP and make histograms.
plink --bfile /net/p01-c2io-nfs/projects/p33/users/osmanga/ClzTOP_new/NORMENT_allnew/NORMENT_allnew-merge --keep keep_subjid.txt --missing    
# output: plink.imiss and plink.lmiss, these files show respectively the proportion of missing SNPs per individual and the proportion of missing individuals per SNP.


# Generate plots to visualize the missingness results.
Rscript --no-save ./inputFiles/hist_miss.R

# Delete SNPs and individuals with high levels of missingness, explanation of this and all following steps can be found in box 1 and table 1 of the article mentioned in the comments of this script.
# The following two QC commands will not remove any SNPs or individuals. However, it is good practice to start the QC with these non-stringent thresholds.  
# Delete SNPs with missingness >0.2.
plink --bfile /net/p01-c2io-nfs/projects/p33/users/osmanga/ClzTOP_new/NORMENT_allnew/NORMENT_allnew-merge  --keep keep_subjid.txt --geno 0.2 --make-bed --out DemGene19_2

# Delete individuals with missingness >0.2.
plink --bfile DemGene19_2 --mind 0.2 --make-bed --out DemGene19_3

# Delete SNPs with missingness >0.02.
plink --bfile DemGene19_3 --geno 0.02 --make-bed --out DemGene19_4

# Delete individuals with missingness >0.02.
plink --bfile DemGene19_4 --mind 0.02 --make-bed --out DemGene19_5

###################################################################
### Step2 #### No sex check performed


###################################################
### Step 3 ### 

# Generate a bfile with autosomal SNPs only and delete SNPs with a low minor allele frequency (MAF).

# Select autosomal SNPs only (i.e., from chromosomes 1 to 22).
awk '{ if ($1 >= 1 && $1 <= 22) print $2 }' DemGene19_5.bim > snp_1_22.txt
# DemGene19_5 yields DemGene19_7 (missing DemGene19_6, becuase no sex check step)
plink --bfile DemGene19_5 --extract snp_1_22.txt --make-bed --out DemGene19_7

# Generate a plot of the MAF distribution.
plink --bfile DemGene19_7 --freq --out MAF_check
Rscript --no-save ./inputFiles/MAF_check.R

# Remove SNPs with a low MAF frequency.
plink --bfile DemGene19_7 --maf 0.05 --make-bed --out DemGene19_8
# 1073226 SNPs are left
# A conventional MAF threshold for a regular GWAS is between 0.01 or 0.05, depending on sample size.


####################################################
### Step 4 ###

# Delete SNPs which are not in Hardy-Weinberg equilibrium (HWE).
# Check the distribution of HWE p-values of all SNPs.

plink --bfile DemGene19_8 --hardy
# Selecting SNPs with HWE p-value below 0.00001, required for one of the two plot generated by the next Rscript, allows to zoom in on strongly deviating SNPs. 
awk '{ if ($9 <0.00001) print $0 }' plink.hwe>plinkzoomhwe.hwe
Rscript --no-save ./inputFiles/hwe.R

# By default the --hwe option in plink only filters for controls.
# Therefore, we use two steps, first we use a stringent HWE threshold for controls, followed by a less stringent threshold for the case data.
plink --bfile DemGene19_8 --hwe 1e-6 --make-bed --out DemGene_hwe_filter_step1

# The HWE threshold for the cases filters out only SNPs which deviate extremely from HWE. 
# This second HWE step only focusses on cases because in the controls all SNPs with a HWE p-value < hwe 1e-6 were already removed
plink --bfile DemGene_hwe_filter_step1 --hwe 1e-10 --hwe-all --make-bed --out DemGene19_9

# Theoretical background for this step is given in our accompanying article:https://www.ncbi.nlm.nih.gov/pubmed/29484742 .

############################################################
### step 5 ###

# Generate a plot of the distribution of the heterozygosity rate of your subjects.
# And remove individuals with a heterozygosity rate deviating more than 3 sd from the mean.

# Checks for heterozygosity are performed on a set of SNPs which are not highly correlated.
# Therefore, to generate a list of non-(highly)correlated SNPs, we exclude high inversion regions (inversion.txt [High LD regions]) and prune the SNPs using the command --indep-pairwise’.
# The parameters ‘50 5 0.2’ stand respectively for: the window size, the number of SNPs to shift the window at each step, and the multiple correlation coefficient for a SNP being regressed on all other SNPs simultaneously.

plink --bfile DemGene19_9 --exclude ./inputFiles/inversion.txt --range --indep-pairwise 50 5 0.2 --out indepSNP
# Note, don't delete the file indepSNP.prune.in, we will use this file in later steps of the tutorial.

plink --bfile DemGene19_9 --extract indepSNP.prune.in --het --out R_check
# This file contains your pruned data set.

# Plot of the heterozygosity rate distribution
Rscript --no-save ./inputFiles/check_heterozygosity_rate.R

# The following code generates a list of individuals who deviate more than 3 standard deviations from the heterozygosity rate mean.
# For data manipulation we recommend using UNIX. However, when performing statistical calculations R might be more convenient, hence the use of the Rscript for this step:
Rscript --no-save ./inputFiles/heterozygosity_outliers_list.R

# Output of the command above: fail-het-qc.txt .
# When using our example data/the DemGene data this list contains 2 individuals (i.e., two individuals have a heterozygosity rate deviating more than 3 SD's from the mean).
# Adapt this file to make it compatible for PLINK, by removing all quotation marks from the file and selecting only the first two columns.
sed 's/"// g' fail-het-qc.txt | awk '{print$1, $2}'> het_fail_ind.txt

# Remove heterozygosity rate outliers.
plink --bfile DemGene19_9 --remove het_fail_ind.txt --make-bed --out DemGene19_10


############################################################
### step 6 ###

# It is essential to check datasets you analyse for cryptic relatedness.
# Assuming a random population sample we are going to exclude all individuals above the pihat threshold of 0.2 in this tutorial.

# Check for relationships between individuals with a pihat > 0.2.
plink --bfile DemGene19_10 --extract indepSNP.prune.in --genome --min 0.2 --out pihat_min0.2


# The DemGene dataset is NOT known to contain parent-offspring relations. 
# The following commands will visualize specifically these parent-offspring relations, using the z values. 
awk '{ if ($8 >0.9) print $0 }' pihat_min0.2.genome>zoom_pihat.genome

# Generate a plot to assess the type of relationship.
Rscript --no-save ./inputFiles/Relatedness.R

# The generated plots show a considerable amount of related individuals (explentation plot; PO = parent-offspring, UN = unrelated individuals) in the Hapmap data, this is expected since the dataset was constructed as such.
# Normally, family based data should be analyzed using specific family based methods. In this tutorial, for demonstrative purposes, we treat the relatedness as cryptic relatedness in a random population sample.
# In this tutorial, we aim to remove all 'relatedness' from our dataset.
# To demonstrate that the majority of the relatedness was due to parent-offspring we only include founders (individuals without parents in the dataset).

plink --bfile DemGene19_10 --filter-founders --make-bed --out DemGene19_11

# Now we will look again for individuals with a pihat >0.2.
plink --bfile DemGene19_11 --extract indepSNP.prune.in --genome --min 0.2 --out pihat_min0.2_in_founders
# The file 'pihat_min0.2_in_founders.genome' shows that, after exclusion of all non-founders, only 1 individual pair with a pihat greater than 0.2 remains in the DemGene data.
# This is likely to be a full sib or DZ twin pair based on the Z values. Noteworthy, they were not given the same family identity (FID) in the DemGene data.

# For each pair of 'related' individuals with a pihat > 0.2, we recommend to remove the individual with the lowest call rate. 
plink --bfile DemGene19_11 --missing
# Use an UNIX text editor (e.g., vi(m) ) to check which individual has the highest call rate in the 'related pair'. 

# Generate a list of FID and IID of the individual(s) with a Pihat above 0.2, to check who had the lower call rate of the pair.
# Get the pair list with more than 0.2 pihat value
awk '{print $1, $3}' pihat_min0.2_in_founders.genome > pihat_pair_0.2.txt

# Get the NMISS list from missingness data from plink.imiss (in plink --bfile DemGene19_11 --missing )

awk '{print $1, $4}' plink.imiss > DemGene19_11_missing.txt
awk '{print $1, $3}' pihat_min0.2_in_founders.genome > pihat_pair_0.2.txt
awk '{print $1}' pihat_pair_0.2.txt > pihat_pair_0.2_SINGLE.txt
awk '{print $2}' pihat_pair_0.2.txt >> pihat_pair_0.2_SINGLE.txt
sort -k1 DemGene19_11_missing.txt > DemGene19_11_missing2.txt 
sort -k1 pihat_pair_0.2_SINGLE.txt > pihat_pair_0.2_SINGLE2.txt
join -1 1 -2 1 -o 2.1 1.2  DemGene19_11_missing2.txt pihat_pair_0.2_SINGLE2.txt > pihat_FINAL.txt
#join -1 1 -2 1 -o 2.1 1.2  <(sort -k1 DemGene19_11_missing.txt) <(sort -k1 pihat_pair_0.2_SINGLE.txt) > pihat_FINAL.txt
# sort -u > pihat_FINAL.txt
sort -k2 pihat_FINAL.txt | head -95 | awk '{print $1, $1}' > 0.2_low_call_rate_pihat.txt


# Press enter on keyboard
# In case of multiple 'related' pairs, the list generated above can be extended using the same method as for our lone 'related' pair.

# Delete the individuals with the lowest call rate in 'related' pairs with a pihat > 0.2 
plink --bfile DemGene19_11 --remove 0.2_low_call_rate_pihat.txt --make-bed --out DemGene19_12


# CONGRATULATIONS!! You've just succesfully completed the first tutorial! You are now able to conduct a proper genetic QC. 

# For the next tutorial, using the script: 2_Main_script_MDS.txt, you need the following files:
# - The bfile DemGene19_12 (i.e., DemGene19_12.fam,DemGene19_12.bed, and DemGene19_12.bim
# - indepSNP.prune.in

############## MDS: covariates ##################

## Create covariates based on MDS.
# Perform an MDS ONLY on DemGene data without ethnic outliers. The values of the 10 MDS dimensions are subsequently used as covariates in the association analysis in the third tutorial.
plink --bfile DemGene19_12 --extract indepSNP.prune.in --genome --out DemGene19_12
plink --bfile DemGene19_12 --read-genome DemGene19_12.genome --cluster --mds-plot 10 --out DemGene19_12_mds

# Change the format of the .mds file into a plink covariate file.
awk '{print$1, $2, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13}' DemGene19_12_mds.mds > covar_mds.txt

# The values in covar_mds.txt will be used as covariates, to adjust for remaining population stratification, in the third tutorial where we will perform a genome-wide association analysis.


############## GWAS Association ##################
# For binary traits.

# assoc
# make the pheno file
awk '{print $1,$1, $3}' clinical_MATLAB.txt | awk '{if ($3==1)$3=2; else $3=1; print $0}'> clinical.pheno

plink --bfile DemGene19_12 --assoc  --allow-no-sex  --pheno clinical.pheno --out assoc_results
# Note, the --assoc option does not allow to correct covariates such as principal components (PC's)/ MDS components, which makes it less suited for association analyses.

# logistic 
# We will be using 10 principal components as covariates in this logistic analysis. We use the MDS components calculated from the previous tutorial: covar_mds.txt.
plink --bfile DemGene19_12 --covar covar_mds.txt --logistic --allow-no-sex  --pheno clinical.pheno  --hide-covar --out logistic_results
# Note, we use the option -–hide-covar to only show the additive results of the SNPs in the output file.

# Remove NA values, those might give problems generating plots in later steps.
awk '!/'NA'/' logistic_results.assoc.logistic > logistic_results.assoc_2.logistic

# The results obtained from these GWAS analyses will be visualized in the last step. This will also show if the data set contains any genome-wide significant SNPs.

# Note, in case of a quantitative outcome measure the option --logistic should be replaced by --linear. The use of the --assoc option is also possible for quantitative outcome measures (as metioned previously, this option does not allow the use of covariates).

#################################################################

# Multiple testing
# There are various way to deal with multiple testing outside of the conventional genome-wide significance threshold of 5.0E-8, below we present a couple. 

#adjust
plink --bfile DemGene19_12 -assoc --allow-no-sex  --pheno clinical.pheno --adjust --out adjusted_assoc_results
plink --bfile DemGene19_12 --covar covar_mds.txt --logistic --allow-no-sex  --pheno clinical.pheno  --hide-covar --adjust --out adjusted_logistic_results

# This file gives a Bonferroni corrected p-value, along with FDR and others.

#####################################################################

# Generate Manhattan and QQ plots.

# These scripts assume R >= 3.0.0.
# If you changed the name of the .assoc file or to the assoc.logistic file, please assign those names also to the Rscripts for the Manhattan and QQ plot, otherwise the scripts will not run.

# The following Rscripts require the R package qqman, the scripts provided will automatically download this R package and install it in /home/{user}/ . Additionally, the scripts load the qqman library and can therefore, similar to all other Rscript on this GitHub page, be executed from the command line.
# This location can be changed to your desired directory

Rscript --no-save ./inputFiles/Manhattan_plot.R
Rscript --no-save ./inputFiles/QQ_plot.R





</code>