====== How to use the genetic data ======

This page describes the best practices to be observed when using inhouse genetic data for statistical analyses. All information relevant to the topics of this page can be accessed through the TSD-database or directly at

  /cluster/projects/p33/groups/biostat/genetics/imputed/cur/all.bio.txt

This file (dubbed "biofile" hereafter) also contains the IDs individual genetic data can be retrieved under (currently ''000UID'') in the [[tsd_directory_structure#biostat_details|genotype files]].


===== Extracting individuals =====

  * The first step for extracting individual data from the genotype files is to find the correct identifiers. Using your favourite spreadsheet viewer/editor, map your individual IDs to their corresponding genetic IDs in the TSD-database or using the biofile.
  * The list of genetic IDs thus obtained can then be used with bcftools/plink to extract individual genotypes from the bcf/vcf/bed-bim-fam genotype files. Consult the bcftools/plink documentation to get usage instructions and additional options.


===== Genotyping batches =====

The genetic data come in separate batches. Even though considerable effort is put into homogenization during the quality control procedures, slight but systematic deviations across batches may still occur in the finished product, the imputed genetic data, and therefrom propagate to the polygenic scores as well. The genotyping batch information must therefore be accounted for whenever feasible. 


===== Sex issues =====

The QC pipeline uses genetic data to infer each individual's sex. This is reported at the location shown above and can be compared with the sex declared/annotated when the individual was included. Any individual with discrepancy between annotated and inferred sex should be excluded from any analyses.

===== Relatedness =====

The QC pipeline annotates any detected relationship roughly above third degree (pi-hat ≥ 0.1). This information is reported at the location shown above alongside any declared/annotated relationships. All individuals with detected relationships differing from the annotated ones should be removed from any analyses. Also, whenever the analyses require a sample of unrelated individuals, an effort should be made to retain only one of any group of relatives. The same applies to identical individuals (pi-hat = 1). A useful aid to choose can in such cases be the genotype coverage statistics also reported at the location shown above.

===== Ethnicity and population substructure =====

Confounding effects may occur as a consequence of genetic differences across population subsamples.
The relevant genetic principal components should therefore be accounted for whenever feasible. 

The QC pipeline annotates each individual's 20 first genetic principal components based on the Haplotype Reference Consortium (HRC) trans-ethnic population data. Efforts are made to ensure that the genetic principal components are equivalent across genotyping batches. Only a subsample of unrelated HRC individuals, i.e. **none of the individuals principal components are computed for**, are used to determine the principal components eigenspace onto which to project. The genetic principal components can be found at the location shown above.

Different strategies can be applied depending on your sample. Two exemplary (constructed) scenarios are depicted below.

{{pcb.png?320}} {{pca.png?320}}

On the left hand side is a case in which two clearly distinguishable clusters are present. In this case, it is advisable to treat the participants in the two clusters as members of two distinct groups. More than two groups may be of course present.

On the right hand side is a case in which a few potential outliers are present. Depending on the size and the separation of the bulk data, it may be best to exclude the potential outliers or keep them, lest loose too much power for the analyses.

Despite the efforts put into harmonizing the principal components across genotyping batches, unmanageable deviations may persist on occasion. In such cases, it may be preferable to [[https://www.sciencedirect.com/science/article/pii/S0749596X13001368|residualize]] the principal components against genotyping batch ahead of the analyses.


==== GWAS and polygenic scores ====

The polygenic scores are very often based on European GWAS summary statistics. Whenever this is the case, the use of polygenic scores for individuals deviating too far from the European cluster is therefore generally discouraged. Consult the individual GWAS publications at the [[https://docs.google.com/spreadsheets/d/19cURugXQQyLgfLU-gwCReuWK99DcpODOpSyDkaR-bow/|GWAS Inventory]] for more information on the samples they involved.

----

==== Contact ====

For more information, contact <thomas.bjella@medisin.uio.no> or <francesco.bettella@medisin.uio.no>.
