======= PleioFDR Analysis =======

The following page provides a basic instruction of the main steps required to conduct a complete pleioFDR analysis.

====== TL;DR ======

**Input data**
  * Ensure no sample overlap. Consider excluding cohorts if overlap is present and sub-study data available
  * Perform literature search to ensure maximum sample size achieved for phenotypes of interest. Using METAL if need to meta-analyse multiple datasets.
  * Generate sumstats using python convert or access through the inventory
  * Make sure you are handling data appropriately according to the privacy/security requirements of the dataset (e.g. MVP, 23andMe, early access PGC etc.)

**Run pleioFDR**
  * Follow steps described here ([[https://github.com/precimed/pleiofdr]])
  * Choose appropriate FDR thresholds (or trial both 0.01 or 0.05)
  * Choose appropriate exclusion regions, always excluding MHC + 8p23.1 if using psychiatric phenotypes
  * Choose number of pruning iterations, n=100 as standard

**Evaluating model fit**
  * Check conditional QQ plots for evidence of enrichment in the primary trait (leftward deflection of blue line) and cross-trait enrichment (successive leftward deflections) 
  * If inadequate model fit, do not proceed

**Locus definitions**
  * Use clumping (or FUMA) to define loci, lead SNPs and candidate SNPs from cond/conjFDR output

**Post-pleioFDR analyses:**
  * Determine effect direction for lead SNPs
  * Identify novel loci using novelty checking database and/or GWAScatalog and consider identifying loci overlapping across multiple analyses within your project if applicable
  * Functionally annotate all candidate SNPs using FUMA SNP2GENE
  * Identify mapped genes for all candidate SNPs (and possibly lead SNPs) using FUMA SNP2GENE
  * Identify gene-sets significantly enriched and pathways significantly overrepresented with mapped genes using FUMA GENE2FUNC **(N.B. genes from regions with complex LD need to be excluded from this step)**
  * Identify significant eQTLs using FUMA SNP2GENE

**Generate conditional QQ plots + Manhattan plots for final manuscript**

====== Input data ======
The input data for pleioFDR analysis consist of GWAS summary statistics.
 
There are several prerequisites to ensure a valid pleioFDR analysis:

  * **Non-overlapping GWAS participant across studies** - We can’t ensure perfectly non-overlapping datasets, and there always will be a possibility that some individuals contributed data into two studies. But it’s important to avoid systematic overlap, i.e. it’s not acceptable to run pleioFDR on two UK Biobank GWAS studies. If your GWAS is a meta-analysis of several substudies, it might be reasonable to approach the authors of the GWAS and ask to share summary stats with specific cohorts excluded. For PGC summary stats we often have per-cohort summary stats and thus can re-run the meta-analysis ourselves. This can be done with the METAL tool, typically with variance-based meta-analysis.
  * **Sufficiently powered** - pleioFDR is to large extent depending on power. The analysis typically requires sample sizes >30-50,000 and the larger the sample the better. It is therefore important to check for new GWAS publications available for relevant phenotypes to maximise the sample size, and consider including new data in our inventory (see below).
 
Finding, processing and using the data:

  * **GWAS inventory** - We maintain a large inventory of summary statistics, where all files are harmonized, and also converted to MATLAB format compatible with pleioFDR. For the overview of the data available in SUMSTAT inventory check the following table . For further information check the following wiki page . 
  * **Using new data not included in the inventory** - If you find new data that might be relevant for other projects, it’s always great to submit it to the inventory via the following google form. Before doing so, please research how to download the actual data - it can be public, available by request from authors, or require some additional application process. Note that, if you want to use the data straight away, you don’t need to submit the data to the inventory just for the purpose of converting the input data to the MATLAB format. All tools needed for this conversion are available here https://github.com/precimed/python_convert , and the process of converting is described in the pleioFDR readme file: https://github.com/precimed/pleiofdr .
  * **Data with extra security requirements** - Note that some data in SUMSTAT inventory has restricted access. It’s important to clarify with the team what are the extra steps needed to include such data in your projects. For example:
      * 23andMe data (and all revised data labels “...._with23andMe”)
      * MVP (Million Veterans Programm) data
      * Early access to PGC data (not yet published) 

====== Running pleioFDR ======  

All pleioFDR code and instructions on how to run the code are available on github at [[https://github.com/precimed/pleiofdr]].
 
When running the pleioFDR, there are three key parameters to consider.

===== Significance thresholds =====

Both 0.05 and 0.01 significance thresholds are commonly used in statistical analyses. In the context of False Discovery Rate (FDR) it implies that 5%/1% of findings are false positives. For many types of analyses these significance levels are considered to be reliable enough while not being overly conservative. Of note, this is not a general rule. For some types of analyses you might want to have higher confidence, while in other cases allowing more false positives to reduce the number of false negatives might be reasonable.

Following this convention, **we use either 0.01 or 0.05 significance thresholds** in condFDR/conjFDR analyses. Usually 0.01 is used for condFDR and 0.05 threshold is used for conjFDR analysis. However, since these thresholds are merely a convention with no “physical” rationale behind them, depending on the study you might want to use 0.05 threshold for cond FDR or to use 0.01 threshold for conj FDR.

Practically, the threshold is set using the following parameter in the config.txt file:

<code>
fdrthresh=
</code>
 
**Reasons to use a higher threshold include:**
  * Threshold 0.01 is too strict to identify adequent number of loci, especially for datasets of smaller sample sizes
  * Compensate the power inbalance due to prominent sample size difference between the pair of traits

**Reasons to use a lower threshold include:**
  * Improved confidence in findings reported
  * Reduce the number of loci identified to enable a more focused/detailed downstream analysis and discussion on the most robust findings

**In general, it might be reasonable to run both setups at the early stage of the study before deciding which setup is more sensible for the project.**

===== Exclusion regions =====
There are several regions in the genome with **“complex LD structure”**, which are typically large regions with strong LD. 

If one or both traits in a cond/conjFDR analysis are associated with a region with complex LD, it can inflate the test statistics leading to over positive output. To avoid this, **any region with complex LD known to be associated with the traits included in your analysis need to be excluded from the result report.** This means that SNPs within these regions are excluded from visulization of the condQQ plots for judgement of enrichment of associations, and functional mapping and annotation analysis.
 
Specifically, **this will not exclude these regions from your results.mat file, or from the loci file if the region is associated with your trait/traits using cond/conjFDR.** This is because each SNP is assigned a cond/conjFDR value at the SNP level based on their p-values, and so the attribution of a cond/conjFDR value is not biased by the complex LD. Nonetheless, this has implications for gene-mapping and gene-set analyses (see sections below).
 
In pleioFDR, regions are excluded for condQQ plots by adding chromosome and BP coordinates to the pleioFDR config.txt file set under the following parameter:

<code>
exclude_chr_pos= 
</code>

**The following is a list of the most important regions with complex LD and associated phenotypes. It is strongly advised to discuss with a more experienced member of the group before deciding which regions to exclude to avoid having to redo the:**
  * MHC - 6 25119106 33854733; all traits. Given the size of the MHC it is probably sensible to exclude for all analyses
  * 8p23 inversion - 8 7200000 12500000; psychiatric disorders and related traits
  * MAPT region - 17 40000000 47000000; neurological disorders e.g. Parkinson’s disease.
  * APOE region - 19 42000000 47000000; Alzheimer’s disease and autism spectrum disorder

===== Random pruning =====

Conditional FDR for a given variant with association p-values p1 and p2 in the primary and conditional traits correspondingly can be (conservatively) approximated as:

   condFDR(p1,p2) = p1/F(p1|p2)

where F(p1|p2) is the conditional cumulative distribution function (CDF) for the first phenotype given that p-values for the second phenotype are p2 or smaller. F(p1|p2) is approximated empirically.

In general, **variants in large LD blocks tend to have lower p-values than variants in smaller LD blocks** (the larger LD block is the likelier it contains a true causal variant). Thus, using all available variants to approximate F(p1|p2) will inflate condFDR estimates. 

**To mitigate this inflation, several iterations of random pruning are performed.** Each iteration of random pruning produces a subset of LD independent variants (r2<0.1 in the current implementation) where one variant is taken from each LD block at random. This subset is then used to approximate F(p1|p2) conditional CDF. Finally, mean F(p1|p2) conditional CDF is estimated taking the average over all iterations of the random pruning. 

**The more iterations you take, the closer you approach to the "actual underlying" conditional CDF** (this procedure is similar to tossing a coin, while with 10 tosses the probability of heads/tails might differ from 0.5 quite a bit, it becomes very close to 0.5 if the number of tosses is large enough). Using too few iterations of random pruning may result in unstable results (i.e. repeating the analysis with the same settings will produce slightly different sets of significant variants).

If you want to be conservative, you can repeat your analysis with an increasing number of iterations until you achieve a “stable result”, i.e. a consistent number (and identity) of loci identified. **There is, however, a pay off between time taken for each run and reliability of the results.**
 
**In practice having 100 iterations of random pruning should be enough to get robust results**, while 500 iterations is enough to essentially guarantee robust results. It may also be reasonable to use fewer iterations for the preliminary runs as it will run faster, and use more iterations for the final run as it will produce better looking conditional QQ plots and more robust findings. 

The number of iterations is set using the following paramater in the config.txt file

<code>
randprune_n= 
</code>

====== Evaluating model fit ======

It is possible that pleioFDR will identify significantly associated loci even when there is an absence of cross-trait enrichment. **It is therefore essential that plots are checked before drawing conclusions from the loci discovered.**

===== Conditional QQ plots =====

Conditional Q-Q plots are the best way to determine whether there is adequate model fit. The pleioFDR code will generate Q-Q plots as standard (format: png, fig, jpg). When evaluating a conditional QQ plot, the key elements to check are:
  * **Is there enrichment of associations between all SNPs and the primary phenotype?** All SNPs are represented by the darker blue line. Enrichment is indicated by leftward deflection from the dotted line. If there is not clear enrichment, conjFDR results will not be valid. PLOT A indicates an analysis with inadequate enrichment in the primary trait, likely due to a lack of power in the primary GWAS.
  * If there is adequate enrichment in the primary trait, **is there evidence for cross-trait enrichment?** This is indicated by incremental enrichment in the primary trait as the threshold for statistical significance with the secondary trait decreases, illustrated by successive leftward deflection of the lines. It is important to check the legend in the bottom right corner of the plot to check which line represents which threshold. PLOT B is an example of no cross-trait enrichment (invalid results), PLOT C is an example of good cross-trait enrichment (valid results)
  * It is possible that at lower p-values (both in the primary trait - i.e. higher on the y-axis, and in the secondary trait - i.e. lower thresholds), the number of SNPs decreases such that the analysis is no longer sufficiently powered. This can lead to disruption of the smooth line with irregular, stepped lines deviating from the curve. This is not a problem if there is clear leftwards deflection at higher p-values (PLOT D, valid results). 

**N.B. For conditional FDR it is only necessary to have adequate cross-trait enrichment in one direction (i.e. for the primary trait). For conjunctional FDR it is necessary to have evidence of cross-trait enrichment in both directions.**

{{:biostat:cond_qq_plots_example.png?800|}}

===== Other plots =====
Look up plots and fold enrichment plots can also be used to evaluate model fit and are automatically generated by the pleioFDR code. A detailed description of these plots is beyond the scope of this page, but can be found here in the following review article [[https://doi.org/10.1007/s00439-019-02060-2]]

====== Locus definitions ======
You may want to use different genomic locus definitions depending on the study. In most of our studies we define loci according to FUMA protocol (same definition is also used in recent studies by PGC consortium). The protocol can be summarized as the following:

  - Independent significant genetic variants are identified as variants with condFDR/conjFDR<0.01/0.05 and linkage disequilibrium (LD) r2<0.6 with each other.
  - A subset of these independent significant variants with LD r2<0.1 are then selected as lead variants.
  - For each independent significant variant all candidate variants are identified as variants with LD r2≥0.6 with the lead variant.
  - For a given lead variant the borders of the genomic locus are defined as min/max positional coordinates over all corresponding candidate variants.
  - Loci are then merged if they are separated by less than 250kb.


In practice this can be done with the following two-step procedure:

//**First**// convert ''result.mat'' file produced by [[https://github.com/precimed/pleiofdr|pleiofdr]] into plain text summary statistics file. This can be done with ''fdrmat2csv.py'' script, which is located in [[https://github.com/precimed/python_convert|python_convert]] github repository. Example of command:
<code>
python fdrmat2csv.py path/to/result.mat path/to/9545380.ref
</code>
This will create ''result.mat.csv'' file in the directory where ''result.mat'' is located. Of note, in this example ''9545380.ref'' reference file is used, which is currently the default choice for condFDR/conjFDR analysis. This file can be found on SAGA (''/nird/projects/nird/NS9114K/MMIL/SUMSTAT/misc/9545380_ref/9545380.ref'') and on TSD (''/tsd/p33/data/durable/s3-api/mmil/SUMSTAT/misc/9545380_ref/9545380.ref'').

//**Then**// loci can be identified using ''sumstats.py clump'' script, which is also in [[https://github.com/precimed/python_convert|python_convert]] github repository. Example of command:
<code>
python sumstats.py clump --sumstats path/to/result.mat.csv \
    --clump-field FDR \
    --clump-p1 0.05 \
    --bfile-chr path/to/chr@ \
    --plink path/to/plink \
    --out path/to/output_prefix \
    --exclude-ranges ['6:25119106-33854733', '8:7200000-12500000']
</code>
Choose <nowiki>--clump-p1</nowiki> depending on your threshold (usually 0.01 or 0.05). ''chr@'' files contain 1KG genotypes corresponding to ''9545380.ref'' template. These files can be found on SAGA (''/nird/projects/nird/NS9114K/MMIL/SUMSTAT/misc/9545380_ref/ref9545380_bfile.tar.gz'') and on TSD (''/tsd/p33/data/durable/s3-api/mmil/SUMSTAT/misc/9545380_ref/ref9545380_bfile.tar.gz''). ''plink'' should be plink 1.9 executable, which can be downloaded from [[https://www.cog-genomics.org/plink2/|here]]. Additionally, in the example above two regions (MHC and chr8 inversion) are excluded from the processing using <nowiki>--exclude-ranges</nowiki> parameter. You might want to change this behavior depending on the context of the study. Four files will be created: ''<OUT>.loci.csv'' (containing loci coordinates), ''<OUT>.lead.csv'' (lead variants), ''<OUT>.indep.csv'' (independent significant variants), ''<OUT>.snps.csv'' (candidate variants), where ''<OUT>'' is the value of <nowiki>--out</nowiki> argument.

Also see complete list of available options with descriptions:
<code>
python sumstats.py clump --help
</code>

//**Alternatively**//, instead of using ''sumstats.py clump'', you can upload your condFDR/conjFDR sumstats to FUMA and get loci coordinates and lead variants from the ''GenomicRiskLoci.txt'' file. Due to FUMA limitations (the threshold for significance can not be > 1E-5), you must divide your condFDR/conjFDR values in ''result.mat.csv'' file by 2E5 if your FDR threshold is 0.01 (0.01/2E5 = 5E-8) or by 1E6 if your FDR threshold is 0.05 (0.05/1E6 = 5E-8). Then you can run FUMA with default settings. Of note, p-value column name should be specified correctly (e.g. "FDR"), you should also fill sample size field (the actual value is irrelevant for loci identification, so you may just use "1"). With this approach you will also get annotations (nearest gene, CADD and RDB scores etc.) for all candidate variants (''snps.txt'' file).

====== Post-pleioFDR ======

Before moving onto the next stages, **please check the output of clumping is correct**:
  - It is possible that some merged loci are still present within the "loci" or "lead_snp" tables (indicated in the "is_locus_lead" column). Any loci coded "false" should be deleted.
  - It is possible SNPs which are not true candidate SNPs are included in the snps file. Please filter out all SNPs which do not meet the definition of candidate SNPs (cFDR<0.1, LD>0.6 with independent significant SNP) 

===== Effect direction =====
Before determining the concordance of effect direction, it is essential to ensure that the alleles are aligned i.e. the measure of effect direction (z score/beta/OR) from each primary GWAS refers to the same allele as A1 and the same allele as A2. This cannot be assumed as often A1 and A2 are attributed inconsistently from GWAS to GWAS. Failure to ensure alignment will result in spurious results, often centring around 50% concordance. 

Pre-aligned z-scores can be extracted from either:
  * .mat pleiofdr output file (all SNPs)
  * “xxx_xxx_zcore_conjFDR_xxxx_all” pleiofdr output file (only independent significant SNPs)

In order to present this data in the supplementary tables, z-scores and beta values/OR must be extracted from the original GWAS sumstats. In this case, **it is essential to ensure alleles and effect directions are aligned**. The process for doing this is:
  - Identify which is the effect allele in the original sumstats (n.b. this might not be A1 - check the readme file for your sumstats!)
  - Cross-reference A1 from clumping against effect allele in original sumstats in TRAIT 1
  - If A1 from clumping IS NOT the same as effect allele in original sumstats, flip the effect direction in z score and beta/OR value
  - Repeat process for Trait 2

Once allele alignment has been confirmed, concordance of effect direction (either same or opposite) is determined **in lead SNPs**. This is because, while we cannot be certain that a lead SNP is a causal SNP, it is the SNP which is most likely to have an effect direction which is representative of the causal SNP. In comparison, assessing effect direction in candidate SNPs is inappropriate since:
  * It is noisier
  * Biased towards loci with more candidate SNPs (the number can vary substantially from locus to locus) 

If the proportion of lead SNPs with same/opposite effect directions is discrepant with what would be expected from genetic correlation and MiXeR results it may be suitable to:
  * Double check allele alignment
  * Check the alignment of effect direction within the candidate SNPs to see how consistent the effect directions are
  * Check effect direction in independent significant SNPs since this may be indicative of multiple causal SNPs within a locus.

===== Novelty checking =====

Novelty checking is the process of systematically cross-referencing your identified loci against previously reported loci/SNP associations to identify the loci among your findings which are novel for each phenotype.

It is very difficult to ensure you have checked all possible studies and so a degree of pragmatism is required when approaching this task. A minimum aim should be to check against all major GWASs, **all major secondary GWAS analyses (e.g. STAR, MTAG), and all previous cond/conjFDR studies** (to ensure we are not reporting novelty twice within our group). It is also possible to use **GWASCatalog** (see below).

==== Novelty checking database ====
To help with the process we have built a novelty checking database which aims to list all loci previously reported from either **major GWASs, previous cond/conjFDR studies, and other major studies** for phenotypes regularly used within the group. Please refer to [[biostat:novelty_checking_database|Novelty checking database]] for a detailed description of the database. 

The main steps are:
  - Identify all relevant studies to be included in the check. **N.B. if you are using the database do not assume that all studies are present.** Please check the description of what has been included (as some phenotypes are incomplete), conduct an updated literature search from the date of the last update to the present day (to ensure there are not any new studies that should be included) and double check list at the top for any obvious missing studies.
  - Extract the list of loci reported from each study and collate on one list (or add to relevant tab in novelty checking database). How findings have been reported may differ from paper to paper. Please use the following logic when extracting the data:
    - If LD informed locus is reported, with min and max base pairs, extract as reported
    - If only lead SNP is reported → create min and max base pairs by adding and subtracting 1,000,000 to lead SNP base pair position
    - If alternative genome builds have been used, then use lift-over function using one of the following tools:
      * [[http://www.ensembl.org/Homo_sapiens/Tools/AssemblyConverter]] (web-based) 
      * [[https://genome.ucsc.edu/cgi-bin/hgLiftOver]]
    - If only rsIDs of lead SNPs are provided → use the following resource to generate BP position + generate min/maxBP (BP position +/- 1million BPs) 
      * [[https://www.ncbi.nlm.nih.gov/genome/tools/cyto_convert/]]
    - If a scenario is encountered that is not described above → please discuss with the  group.

Please help to maintain the database by updating literature searches and adding new phenotypes which have not been included before. This involves very little extra work in addition to what is required anyway, will save everyone time in the future, and ensure we can check for novelty to a high standard!

==== GWAS Catalog ====

To check your results against GWAS catalog, you need to take into account that GWAS catalogue only reports significant SNPs, not loci. There are therefore two possible approaches to using GWAS Catalog:
  - Extract all relevant SNPs associated with your phenotypes of interest and +/- 1,000,000 BPs to create a conservative BP range and use the script above to check for overlap
  - Use candidate SNPs i.e. any locus with a candidate SNP that is already present in the GWAS Catalog is defined non-novel, and any locus without any candidate SNPs that are present in GWAS Catalog are novel. 

**Be aware that, while FUMA have a GWAS Catalog output, this does not use the most up to date version of GWAS Catalog so it is advisable to perform this step through GWAS Catalog itself.** 

**Once you have extracted the loci and GWAS Catalog SNPs accordingly, these two steps can be run using the following python script, as part of the process of compiling the tables:**

https://github.com/precimed/pleiofdr/tree/master/fuma (please contact Zillur with any script related questions - zillurbmb51@gmail.com)

===== Overlapping loci across analyses =====

If you have run several conditional/conjunctional analyses as part of the same project, **it may be of interest to identify loci which are overlapping across analyses.** This can be done:
  * Manually - i.e. check min/maxBPs across analyses, or 
  * Using the novelty checking code, taking the “overlapping” output file to be the overlapping loci and disregarding the nove_loci output file. 

[[https://github.com/precimed/yunhoop/tree/master/tools/novelty]]

While this may be interesting to point out when describing the results, it is important to note that this only tells you that the loci are physically overlapping, **it does not mean the locus is associated with all three phenotypes.** It is possible to formally test for association across three phenotypes using **three-way conjunctional FDR analysis.** Following the logic of conjFDR, this works by performing conjFDR across the 3 permutations. The maximum conjFDR value from all 3 is interpreted as a conservative estimate of the joint association between the SNP and all three phenotypes. Note that this results in a loss of power and so the number of loci identified is likely to be substantially lower than those which are simply overlapping. 

A less rigorous method to identify overlapping loci which are likely to influence all three phenotypes is by highlighting overlapping loci which share lead SNPs. 

===== Replication of conjFDR results =====
Some form of replication is increasingly expected from reviewers and helps to improve the quality of our papers. 
In order to perform a replication analysis, **you must first identify a sample for your phenotype of interest which is independent to BOTH YOUR CONJFDR SAMPLES**. I.e. there cannot be any sample overlap with either sample included in your cond/conjFDR analysis. Useful resources for replication samples include:
  * MVP
  * FinnGen
  * UKB
  * 23andMe

The gold standard for replication is a p-value based procedure:

  - Extract all p-values for lead-snps from the replication GWAS
  - Identify all lead-SNPs with a p-value <0.05 after bonferonni correction for the number of lead SNPs tested
  - If a lead SNP is not present in your replication sample, it is possible to identify a surrogate lead SNP which is in LD with the LD SNP (i.e. next most significant candidate SNP)

Since our replication cohorts are often very small, this approach is unlikely to provide very convincing evidence of replication. It is therefore often necessary to **test for en masse replication using sign concordance**. The procedure is as follows:

  - Extract effect size and pval for each lead SNP from replication dataset
  - If lead SNP is missing in replication data, replace lead SNPs with next most significant candidate SNP
  - QC that alleles + effect directions are correctly aligned in discovery and replication dataset
  - Count no. of lead SNPs with concordant effects
  - Run exact binomial test using the following script in r:
 
binom.test(x, y, 0.5, alternative="less")$p.value
 
Where x =  no. NON-concordant lead SNPs and y = total no. of lead SNPs.
 
An example of how this can be described in the text is as follows (please ref. Bahrami et al. 10.1093/brain/awab267)

**Replication of condFDR and conjFDR significant loci in independent samples**

//Since the probability of replicating individual loci at genome-wide significance is low due to weak genetic effects, we first tested for en masse sign concordance of effect direction in primary and replication samples, in line with previous literature.48–50  For condFDR analyses, there was sign concordance if the lead SNP had concordant effect directions in the primary and replication migraine samples. For conjFDR, there was sign concordance if the lead SNP possessed concordant effect directions for both phenotypes. If lead SNPs were missing in the independent datasets for conjFDR analyses, we replaced them with the next most significant candidate SNP which was present in the independent datasets. We employed a one-sided exact binomial test of significance under the null hypothesis that sign concordance was randomly distributed.48–50 We also identified lead SNPs (or next most significant candidate SNPs) which were nominally significant at p<0.05 in each independent sample.48,50//


===== FUMA functional annotations =====

In order to functionally characterise loci and identify putative causal SNPs, we functionally annotate all candidate SNPs using FUMA’s SNP2GENE function which can be found here [[https://fuma.ctglab.nl/.]] 

**Rationale:** Although the lead SNP is the most strongly associated SNP, this is not proof of causation. Any SNP within LD of the causal SNP (r2>0.6) and associated with both traits (cond/conjFDR<0.1) could be the causal SNP. By having a low threshold for candidate SNPs (i.e. cFDR<0.1), we are trying to ensure that we do not miss putative causal SNPs, but this means we often have large numbers of candidate SNPs for each locus.

==== Running FUMA ====
 
In order to prepare your file for functional annotation you must:
  - Select “conj.result.clump_x.xx.snps.csv” file (where x.xx indicates significance threshold selected)
  - Convert file to a tab-separated .txt file
  - Filter out all SNPs that do not meet the definition of candidate SNP (i.e. cFDR<0.1 and LD>0.6 with independent significant SNP)
  - FUMA will only annotate SNPs with p-values <5*10-8, as it is built to annotate primary GWAS data (not conjFDR results). An “artificial p-value” must therefore be generated in your “.snps.csv” file. The value is not important so you can either insert “5*10-8” for all rows or generate a new value by multiplying the FDR value by 10-9 
  - Select SNP2GENE in the top panel and upload file under “upload input files” sub-heading
  - Insert relevant columns names for “Chromosome” and “Position” (it is also possible to use “rsID” although this will result in fewer annotations), as well as “P-value” (i.e. artificial p-value). It is not necessary to insert the other values but it is advisable to include “effect allele” if present in your input file. 
  - Insert number into “Total sample size” box. This does not affect the results as we do not use MAGMA analysis, so any number e.g. 50,000 can be used. 
  - Leave standard settings for “Parameters for lead SNPs and candidate SNPs” - this is not relevant as we will use the clumping output for lead SNP/candidate SNP definitions
  - Please see Gene mapping section for gene-mapping settings
  - Click run

==== FUMA Output ====
FUMA will output several different files. **The “snps.txt” file contains relevant functional annotations for all the candidate SNPs provided as inputs.** 

There is no “gold standard” for how functional annotations should be presented in any given manuscript, and might depend on other factors such as how many loci have been discovered. I.e. if there are few loci, it may be easier to describe the relevant annotations for each one. If you have many loci, it may only be possible to present an overview. 

Typically, the following annotations are commented on in the main manuscript:
  * **func** is short for functional category. This categorises each SNP according to 12 different functional categories e.g. exonic, intronic, intergenic etc. This is often presented as a bar chart demonstrating the distribution of all candidate SNPs across each category. This can be informative in demonstrating the relative proportion of SNPs in non-coding vs coding regions. Candidate SNPs within exons can also be selected for further evaluation and may warrant discussion in the text.
  * **CADD** predicts how deleterious the SNP effect is on protein structure/function ([[https://doi.org/10.1093/nar/gky1016]]). Score >12.37 considered deleterious (although also possible to use >10 and >20 as thresholds). Candidate SNPs with high CADD scores can be selected for further characterisation/discussion       	
  * **RDB:** The scores relate to the following features related to the probability of the SNP having a regulatory function (TF = transcription factor). Lower scores, particularly 1F and below, indicate SNPs with probable regulatory functions and may warrant further evaluation/discussion ([[https://doi.org/10.1101/gr.137323.112]])
  * **MinChrState:** Minimum chromatin state across 127 tissues at a given SNP locus. Scores of 1-7 are suggestive of open chromatin state, and may warrant further evaluation/discussion (DOI:10.1038/nature14248)
  * **Non-synonomous exonic SNPs:** Identifies SNPs for which allele substitution causes a change in the amino acid sequence of the protein. Functional consequence of the SNP on the gene obtained from ANNOVAR. This information is provided in the “annov.txt” file under the column “exonic_func”. Synonymous means no change in amino acid sequence, non-synonymous means change in sequence.




===== FUMA gene mapping =====

The gold standard for gene-mapping is now to use candidate SNPs as this increases the probability that the “causal” gene(s) for each locus is captured. It is important to note, however, that this leads to increased false positives which is relevant for gene-set and pathway analyses. If you have identified many loci, it may therefore be advisable to rerun with lead SNPs or independent significant SNPs. 

SNPs are mapped to putative “causal” genes by three strategies within FUMA’s SNP2GENE functionality:
 
==== Positional mapping ====
Positional mapping maps each SNP to all genes within a certain number of base pairs from the SNP. FUMA sets the maximum distance from each SNP at 10kb as standard, which we have applied for all analyses.

==== eQTL mapping ====
eQTL-mapping maps SNPs to genes whose expression is associated with the SNPs allelic variation. There are two important features to eQTL mapping: 
  - Since eQTL mapping is only based on expression level, it is possible that SNPs will map to genes which are geographically distant from the SNP, including on other chromosomes. All other settings under FUMA are left as standard.
  - **eQTL-mapping is dependent on which eQTL database is selected.** This allows you to restrict eQTL mapping to biological tissues/cell-types which are relevant to your phenotypes of interest. This is done in FUMA under the “Tissue Types” tab.  The databases you select are highly phenotype dependent, so some care should be taken when selecting them. Please discuss with a senior group member if you are unsure. For standard psychiatric analyses the following are commonly chosen:
    * BRAINEAC (ALL)
    * GTEx v8 Brain (ALL)
    * CommonMind Consortium (ALL)

==== Chromatin interaction mapping ====
Chromatin interaction mapping maps SNPs to genes which are predicted to interact due to chromatin’s 3D structure and the ensuing DNA-DNA interactions. This is because a SNP might influence a gene’s expression that appears distant to the gene according to the primary DNA sequencing, but is in fact nearby when DNA forms a 3D structure. As with eQTL mapping, this means geographically distant genes can be mapped, and **this is dependent on tissue/cell type since chromatin structure is dependent on the epigenetic profile of the cell which is cell type-specific.** 

As with eQTL-mapping, it is therefore possible to select cell-types/tissue types within FUMA under the “Builtin chromatin interaction data” tab which are relevant for your analysis. The databases you select are also highly phenotype dependent. Please discuss with a senior group member if you are unsure. For standard psychiatric analyses you may choose to select some or all of the following:
  * HiC/GSE87112/Dorsolateral_Prefrontal_Cortex
  * HiC/GSE87112/Hippocampus
  * PsychENCODE EP links (one way)
  * PsychENCODE Promoter anchored loops
  * HiC(Giusti-Rodriguez et al. 2019) Adult Cortex
  * HiC(Giusti-Rodriguez et al. 2019) Fetal Cortex


===== FUMA genesets =====

FUMA is applied to perform Functional enrichment analysis. Related informations are listed below:

- Function: GENE2FUNC https://fuma.ctglab.nl/gene2func. Please note that since certain mapped genes may need to be excluded manually, this cannot be run directly from the SNP2GENE output page. Instead, you need to click on the "GENE2FUNC" tab at the top of the page. 

- Genes of interest (input): 
  * mapped genes from SNP2GENE function
  * genes located in a region with complex LD pattern should be removed from input. As for regions with complex LD pattern, please see more details in "//Exclusion regions//" part. You should choose exclusion regions based on your target phenotype. 
  * Genes should then be copy and pasted into the "Paste Genes" box

- Background genes:
  * background genes (all, protein coding, etc.) need to match the gene type in SNP2GENE annotation (i.e., the parameter you choose in “4.Gene type” in SNP2GENE).

- Other optional parameters
  * Ensembl version: v92 (edit: Oct.2020)
  * Gene expression datasets: GTEx v8
  * Exclude the MHC region: genes located in regions with complex LD pattern should be removed before input.
  * multiple test correction method: BH FDR correction (default)
  * Maximum adjusted P-value: 0.05 (default)
  * Minimum overlapping genes with gene-sets (≥): 2 (default)
===== Compiling data in supplementary tables =====

This is a list of tables that are typically included as supplementary data, with a description of the composite data. The order and content is by no means fixed and should be adapted to each manuscript.

  - **Conditional FDR loci (in each direction)**
    * Loci + lead SNP information from clumping output “cond.result.clump_x.xx.loci.csv”
    * For each lead SNP:
      * Z-scores from pleioFDR output file “…_zscore_conjfdr_x.xx_all.csv” or primary GWAS sumstats file
      * Functional annotations from FUMA output “snps.txt”
      * Effect sizes (ORs/beta), SDs + p-values for each input phenotype from primary GWAS sumstats file. As described in “Effect direction” it is essential that alleles are aligned when performing this step. 
    * Loci overlapping within project (see “Overlapping loci across analyses”)
    * Novel loci (see “Novelty checking”)
  - **Conjunctional FDR loci table**
    * Same as above with addition of "effect direction" and possibly "overlapping loci across analysis"
  - **Conjunctional FDR candidate SNPs table** (N.B. we typically do not include conditional FDR candidate SNPs, but this may be preferable if conditional FDR analyses are the focus of the paper)
    * SNP information + conjFDR values from clumping output file “conj.result.clump_x.xx.snps.txt”
    * Z-scores for each input phenotype from pleiFDR output  “…_zscore_conjfdr_x.xx_all.csv” or primary GWAS sumstats file.
    * Effect sizes (ORs/beta), SDs + p-values for each input phenotype from primary GWAS sumstats file.  As described in “Effect direction” it is essential that alleles are aligned when performing this step
    * Functional annotations from FUMA output file “snps.txt”
  - **Mapped genes** - From FUMA SNP2GENE output file genes.txt
  - **Gene ontology gene-sets** - From FUMA GENE2FUNC output file “GSt.xt”
  - **Pathways** - From FUMA GENE2FUNC output file “GS.txt”
  - **Significant brain eQTLs** - From FUMA SNP2GENE output file “eqtl.txt”

It is possible to compile some of these tables using the following script and relevant input files:
   
https://github.com/precimed/pleiofdr/tree/master/fuma (please contact Zillur with any script related questions - zillurbmb51@gmail.com)

**N.B.** This is a brief overview and the script may not work flawlessly depending on differences in the input data. It is therefore important to carefully quality control these tables before submission.
 
Some journals do not accept .xls file formats. In this case, it may be necessary to copy the tables into a word document or print as .pdf files. This is a very boring and time-consuming job. Work-arounds include excluding candidate snp files, explaining that they are available on request from the authors, or asking editors to make exceptions for inclusion of .xls files. Shahram was successful with JAMA Psychiatry on one occasion!

===== Figures for final manuscript =====

When presenting the analysis for publication, the following plots are typically included:
  * **Conditional Q-Q plot.** - either in one or both direction. It is helpful to present the QQ plots as they demonstrate the principles underlying the analysis and can be used as evidence of “cross-trait enrichment”. It is also reasonable to include in the supplementary information if required. 
  * **Conditional/conjunctional Manhattan plots** - it is possible to generate Manhattan plots of either conditional or conjunctional FDR results. These plot FDR values against chromosomal position for each SNP, emphasising lead SNPs (larger circles with black outline) and independent significant SNPs (larger circles), thus illustrating the chromosomal distribution of significantly associated loci.  Manhattan plots are based on clumping output after cond or conjFDR analysis following steps:

  - Transform result.mat from fdr analysis to csv format (results.mat.csv ) 

<code>
python fdrmat2csv.py result.mat 9545380.ref
</code>  
  
  * Clump the FDR results with threshold condFDR 0.01 and conjFDR 0.05. 
  * Make Manhattan plots using the following code.

<code>
python manhattan.py result.mat.csv --lead conj.result.clump.lead.csv --indep conj.result.clump.indep.csv --p FDR --y-label conjFDR --legend-label ‘Trait1 & Trait2’ --legend-location ‘upper right’ --color-list 1 --p-thresh 0.05 --out fdr_manhattan.png
</code>  
  
  * A more detailed description of available input arguments can be obtained with

<code> python manhattan.py --help
</code>

It is also possible to label loci in Manhattan plots with mapped genes/genes of interest. This	must be done manually using a vector graphics software such as inkscape. In order to edit figures manually, the files should be saved in .svg format in matlab.

====== LAVA analysis ======

LAVA analysis ( https://github.com/josefin-werme/LAVA ) is an emerging technique, we haven't yet compared how aligned are the results between LAVA, MiXeR and conjFDR - and how to interpret the differences. But just in case you want to run LAVA here is some info.

1. LAVA is included in CoMorMent containers. All codes and data are already copied to TSD, both p33 and p697, in /cluster/projects/pNN/github/comorment/containers  . 

You may execute LAVA's tutorial by stepping into /cluster/projects/pNN/github/comorment/containers/reference/examples/lava, and executing the steps:

<code>
export COMORMENT=/cluster/projects/pNN/github/comorment
singularity shell --bind /cluster/projets/pNN --home $PWD:/home $COMORMENT/containers/singularity/r.sif 

Rscript lava_script.R "vignettes/data/g1000_test" "vignettes/data/test.loci" "vignettes/data/input.info.txt" "vignettes/data/sample.overlap.txt" "depression;bmi" "depression.bmi"
</code>

NB! lava_script.R has non-default argument 1e-4 in run.univ.bivar() step. It might be bad idea - instead it might be more reasonable to changed this to 0.05. This threshold is for the univariate test for heritability. It should be fine that all loci deemed nominally significantly heritable (p=0.05) can be included in the bivariate analyses. This maintains a good amount of loci for the local genetic correlation. 

This is described here: https://github.com/comorment/containers/tree/main/reference/examples/lava

Remember that any data you create within /cluster/projects/pNN/github will be overwritten by an automated sync job that keeps those folders in sync with github . Please figure out how to run the analysis outside of the github folder.

2. To run a real analysis you need

(1) create an input.info.txt which may look like this

<code>
henotype       cases   controls        filename
SCZ     53386   77258   /cluster/projects/p697/projects/SUMSTATv2/nomhc/PGC_SCZ_0518_EUR.sumstats.gz
BIP     41917   371549  /cluster/projects/p697/projects/SUMSTATv2/nomhc/PGC_BIP_2019_wave3.sumstats.gz
MDD     135458  344901  /cluster/projects/p697/projects/SUMSTATv2/nomhc/PGC_MDD_2018_with23andMe.sumstats.gz
ADHD    19099   34194   /cluster/projects/p697/projects/SUMSTATv2/nomhc/PGC_ADHD_2017_EUR.sumstats.gz
COG     1       0       /cluster/projects/p697/projects/SUMSTATv2/nomhc/CTG_COG_2018.sumstats.gz
EDU     1       0       /cluster/projects/p697/projects/SUMSTATv2/nomhc/SSGAC_EDU_2018_no23andMe.sumstats.gz
NEUR    1       0       /cluster/projects/p697/projects/SUMSTATv2/nomhc/CTG_NEUR_2018_no23andMe.sumstats.gz
SWB     1       0       /cluster/projects/p697/projects/SUMSTATv2/nomhc/23andMe_SWB_2018.sumstats.gz
HEIGHT  1       0       /cluster/projects/p697/projects/SUMSTATv2/nomhc/GIANT_HEIGHT_2018_UKB.sumstats.gz
</code>

(2) prepare sample.overlap.txt file, by putting together relevant set of LDSR intercepts (assuming you already ran those analyses) using a python script that may look like this :

<code>
import pandas as pd
import numpy as np

df=pd.read_csv('/cluster/projects/p697/projects/SUMSTATv2/ANALYSIS/ldsr_rg.csv', delim_whitespace=True)
df['p1'] = [x.split('/')[-1].replace('.sumstats.gz', '') for x in df['p1']]
df['p2'] = [x.split('/')[-1].replace('.sumstats.gz', '') for x in df['p2']]
traits='PGC_SCZ_0518_EUR:SCZ PGC_BIP_2019_wave3:BIP PGC_MDD_2018_with23andMe:MDD PGC_ADHD_2017_EUR:ADHD CTG_COG_2018:COG SSGAC_EDU_2018_no23andMe:EDU CTG_NEUR_2018_no23andMe:NEUR 23andMe_SWB_2018:SWB GIANT_HEIGHT_2018_UKB:HEIGHT'.split()
traits=dict([t.split(':') for t in traits])
df=df[df['p1'].isin(traits.keys()) & df['p2'].isin(traits.keys())].copy()
df['p1']=df['p1'].map(traits)
df['p2']=df['p2'].map(traits)
df=pd.concat([df, df.rename(columns={'p1':'p2', 'p2':'p1'})])
pd.pivot_table(df[['p1', 'p2', 'gcov_int']], index='p1', columns='p2', values='gcov_int').reset_index(drop=False).fillna(1).rename(columns={'p1':''}).to_csv('/cluster/projects/p697/users/ofrei/lava/sample.overlap.txt',sep=' ', index=False)
</code>

(3) a script to trigger a bunch of jobs, e.g. like this
Note that each LAVA analysis takes a day or slightly more.

<code>
template = """#!/bin/bash
#SBATCH --job-name=lava
#SBATCH --account=p697
#SBATCH --time=96:00:00
#SBATCH --ntasks=1
#SBATCH --mem-per-cpu=8000M
#SBATCH --cpus-per-task=4
##SBATCH --array=1-1

source /cluster/bin/jobsetup

module load singularity/3.7.1
export COMORMENT=/cluster/projects/p697/github/comorment
export SINGULARITY_BIND="/cluster/projects/p697"
export SIF=$COMORMENT/containers/singularity
export Rscript="singularity exec --home $PWD:/home $SIF/r.sif Rscript"

export g100eur="/cluster/projects/p697/github/comorment/containers/reference/magma/g1000_eur/g1000_eur"
export locfile="/cluster/projects/p697/github/comorment/containers/reference/lava/blocks_s2500_m25_f1_w200.GRCh37_hg19.locfile"

set -o errexit

$Rscript lava_script.R $g100eur $locfile "input.info.txt" "sample.overlap.txt" "{t1};{t2}" "{t1}.{t2}"
"""

import itertools
runs=list(itertools.combinations(traits.values(), 2))
for t1, t2 in runs:
    with open('/cluster/projects/p697/users/ofrei/lava/jobs/{}_vs_{}.job'.format(t1, t2), 'w')  as f:
        f.write(template.format(t1=t1, t2=t2))
with open('/cluster/projects/p697/users/ofrei/lava/jobs.list', 'w')  as f:
    for t1, t2 in runs:
        f.write('sbatch jobs/{}_vs_{}.job\n'.format(t1,t2))
print(f'{len(runs)} analyses in total. Trygger the analyses by running the following command:')
print(f'cat jobs.list | bash')
</code>


====== mtCOJO ======
Multi-Trait-Based Conditional and Joint Analysis: estimate the effect of an exposure(s) on a trait conditioning on (adjusting for) the exposure(s). Potentially useful addition to pleioFDR analyses. Download and instructions found here: [[https://yanglab.westlake.edu.cn/software/gcta/#mtCOJO]]. Some basic slides on method and implementation [[https://drive.google.com/file/d/1ljXeeQ_z7TU2wjoxOTiZfMXbxb-kbqjm/view?usp=sharing]]
