====== Combine pleiofdr and FUMA annotations ======

Pleiotropy-informed conditional and conjunctional false discovery rate ([[https://github.com/precimed/pleiofdr|pleiofdr]]) is used quite intensively to allow loci discovery in related phenotypes. Once we run the code in MATLAB for any two phenotypes, we obtain a loci file, along with other files. We have additional way of clumping to get a loci file from the result of MATLAB run.\\

Then we use the loci file for annotations of genes, and other related analyses from online services like [[https://fuma.ctglab.nl/|FUMA]]. The results are usually shown in the supplementary tables in the article as a spreadsheet.\\

Supplementary tables in cond/conjFDR papers come in various forms. Mostly they are in Microsoft Excel formats, 10's of files even for comparison of 2-3 traits. Each of the files contains 100's of rows and 10's of columns. Loci associated with 10's of traits are becoming usual these days.\\

Majority of the files are meant to show number of loci associated with a primary trait at condFDR <0.01 given association with a secondary trait, or number of shared loci between two traits at a certain conjFDR level (usually <0.05).  Following Python script is //an example of how to calculate associated loci between two traits (LONELY and SCZ), that can be extended to other traits//.  

** Load only 2 Python libraries**

<code>
import numpy as np
import pandas as pd
</code>

**Load the clumped loci file with FDR values**

Once one gets the output put (result.mat) from MATLAB cond/conjFDR run, [[https://github.com/precimed/python_convert| python_convert ]] is used to convert it to a corresponding csv file. Then clumping is performed (procedure described [[https://github.com/precimed/python_convert| here ]])

<code>
df1= pd.read_csv('results_LONELY_SCZ_condFDR_clump.loci.csv', sep='\t')
</code>
This is the file that was used as input in FUMA analysis,right? For example, I used cond.BMI.result.clump_0.05.lead.csv as FUMA input, so, my df1 would be pd.read_csv('cond.BMI.result.clump_0.05.lead.csv',sep='\t') ?

**Load [[https://fuma.ctglab.nl/|FUMA]] annotated files **

<code>
# Load snps.txt from annotated folder, this file contains most if the columns
df2= pd.read_csv('Loneli_SCZ_condFDR/snps.txt', sep='\t')

# Take only important columns from snps.txt
df2_cols= ['uniqID','rsID', 'nearestGene', 'dist','CADD', 'RDB',  'minChrState','commonChrState','effect_allele','non_effect_allele']
df2= df2[df2_cols]

# load ANNOVAR file, it contains gene annotation
df3= pd.read_csv('Loneli_SCZ_condFDR/annov.txt', sep='\t')

# Take only 2 columns from ANNOV
df3_cols=['chr', 'pos', 'annot']
df3 = df3[df3_cols]

</code>

**Load the STD summary statistics files**

<code>
df_lonely=pd.read_csv('UKB_LONELY_2018_Loneliness.sumstats.gz', compression='gzip', sep='\t', header=0)
df_SCZ = pd.read_csv('PGC_SCZ_2014_EUR.sumstats.gz', compression='gzip', header=0, sep="\t")
colsp= ['SNP','PVAL', 'Z']
df_lonely_p= df_lonely[colsp]
df_SCZ_p= df_SCZ[colsp]

</code>

** Combinations**

<code>
df1_df2 = pd.merge(df1, df2, how= 'left', left_on='LEAD_SNP', right_on='rsID')
df1_df2_df3 = pd.merge(df1_df2, df3,  how='left', left_on=['CHR','LEAD_BP'], right_on = ['chr','pos'])
df4= df1_df2_df3.drop_duplicates(subset=[ 'CHR','LEAD_BP'])
df4= df4.drop(['rsID', 'chr', 'pos'], axis = 1) 

</code>

** Merge the with sumstats to get pval and effect sizes**

<code>
df5= pd.merge(df4,df_lonely_p, how='left', left_on='LEAD_SNP', right_on='SNP' )
df6= pd.merge(df5,df_SCZ_p, how='left', left_on='LEAD_SNP', right_on='SNP', suffixes=('_LONELY', '_SCZ') )

df7= df6.drop([ 'SNP_SCZ', 'SNP_LONELY', 'uniqID'], axis=1)
df7.rename(columns={'locusnum':'GenomicLocus',
                          'FDR':'condFDR',
                          'PVAL_LONELY':'LONELY_pval',
                           'Z_LONELY': 'LONELY_Z',
                        'PVAL_SCZ':'SCZ_pval',
                            'Z_SCZ': 'SCZ_Z',
                           'annot': 'Function',
                           'dist': 'Distance'}, 
                 inplace=True)

df8_cols= ['GenomicLocus','CHR','LEAD_SNP', 'LEAD_BP', 'MinBP', 'MaxBP', 'condFDR' ,
'effect_allele', 'non_effect_allele','LONELY_pval','LONELY_Z', 'SCZ_pval','SCZ_Z',
'nearestGene', 'Distance', 'Function', 'CADD', 'RDB', 'minChrState', 'commonChrState']

df8= df7[df8_cols]

</code>

** Processing for LONELINESS loci **

This file is downloaded from the original paper on [[https://www.nature.com/articles/s41467-018-04930-1|LONELINESS GWAS]] 

<code>
df8.to_excel('Supplement_CondFDR/LONELY_SCZ_condFDR.xlsx', index=False)

# define the loci as BP=250K, both up- and downstream 
min250=df8.MinBP-250000
max250=df8.MaxBP+250000


df_loneli_loci=df_loneli['MTAG (3 Strata Loneliness composite)']
df_loneli_loci= df_loneli_loci[ [ 'chr', 'bpos']]

loneli_chr= df_loneli_loci.chr
loneli_bp= df_loneli_loci.bpos

</code>

** Here is the Secondary (SCZ) trait loci **
<code>
cond_chr= list(df8.CHR)
cond_range = list(zip(min250, max250 ))


loci_col=['Loci in LONELY GWAS', 'MinBP']
dataM=[]
for i in range(len(loneli_bp)):
    for j in range(len(cond_range)):
        if (cond_range[j][0] <= loneli_bp[i] <= cond_range[j][1]) & (cond_chr[j]==loneli_chr[i]):
            loci= str(loneli_chr[i])+":"+str(loneli_bp[i])
            minBP= cond_range[j][0]+250000
            maxBP = cond_range[j][1]-250000
            dataM.append([loci,minBP])
            
df_loci = pd.DataFrame(dataM, columns=loci_col)
</code>

** Merge the dataframes and output in an Excel file **

<code>
df_final= pd.merge(df8,df_loci, how='left', on='MinBP' )
df_final.to_excel('LONELY_SCZ_condFDR.xlsx', index=False)
</code>