
==== Introduction ====

NORMENT (UiO) and MMIL (UCSD) have a joint application (Application ID number 27412) to use data from the UK Biobank (n=500k). The application focuses on mental disorders but is quite broad and we have been granted access to (almost) the full UKB cohort. The UK Biobank data include genotype data, magnetic resonance imaging (MRI) raw data (multiple brain measures, abdomen, heart), registry data (hospital diagnoses, cancer registry, death registry), physical measures (blood pressure, pulse, ECG etc) and self-assessment questionnaires (cognitive tests, personality, environmental exposures etc). 

For the overview of the publications and ongoing projects linked to Application 27412 please see [[https://docs.google.com/document/d/1sluVlmhka-6-ULmdupmznSIUF46CVOixLoTONxxKdZQ/edit|here]]. 
If your project is missing from the table you need to fill in the details.

On this page you will find information about:
  * links to web pages describing the UK Biobank's organization and what data are available from it
  * info on how to access to the data at NORMENT (TSD) and MMIL servers (ip24.ucsd.edu)

Please do not hesitate to ask the UKB data access team (Oleksandr Frei <oleksandr.frei@medisin.uio.no>, Dennis van der Meer <d.van.der.meer@medisin.uio.no>, Chun Chieh Fan <chunchiehfan@gmail.com>, Olav Bjerkehagen Smeland <o.b.smeland@medisin.uio.no>) if you have any questions. In turn, we hope you'll do your part by contributing to this wiki by updating it with whatever information you think is missing.

==== Getting started ====

UKB Showcase ( http://biobank.ctsu.ox.ac.uk/crystal ) is a publicly available resource that gives an overview of UK Biobank data. You may use this resource even if you are not registered for access to UKB data. If you are new to UK Biobank data, please take some time to familiarise yourself with UKB Showcase, as this resource is extremely useful.

UK Biobank Access Management System, or AMS ( https://bbams.ndph.ox.ac.uk/ams/resProjects ) is where the NORMENT project is registered. Every researcher interested in using UK Biobank resources must register here. Remember to use your UiO e-mail or UCSD e-mail. Your application will be approved in typically 1-2 weeks. After approval send an e-mail to Olav Smeland with a request to be invited as collaborator. The NORMENT application ID is 27412. As a side note, sometimes people may refer to the NORMENT application as 2741, as the last digit is a checksum, but the full application ID is 27412.

We organize an in-house copy of UKB data both at MMIL servers (which serve as a "master" copy), and on TSD (which works as a replica of the MMIL data). The original data downloaded from UKB are stored in the UKBLAKE folder (see below). Any derived data, including unpacked phenotype spreadsheets, are stored in the UKBDATA folder (see below). For more details see "[[#Folders structure]]" further below on this page.

==== Data access policy and MTA ====

Every analyst using UK Biobank data must comply with UK Biobank data usage policy.

Please read "Executed MTA" (30-Jan-2018) which is available at the AMS portal.
To access the document, login to your AMS account, select "projects" in the left-hand menu,
and click "Documents" button for 27412 project, which cover among other things the following points:

  * Every analyst must have an AMS account approved for access to our UKB application
  * Please respect data security (keep the data on TSD, do not attempt to identify participants)
  * Exclude participants who withdraw from UK Biobank (see below for technical details)
  * Submit derived data back to the UK Biobank
  * Using external resources to screen UK Biobank participants is not allowed
  * For full details, refer to "Executed MTA document".

Conducting a study, please ensure that you follow UK Biobank Researcher requirements:
  * Include the acknowledgement, “This research has been conducted using data from UK Biobank, a major biomedical database” and where appropriate, include a link to the UK Biobank website: www.ukbiobank.ac.uk
  * Include your UK Biobank project ID number in research papers and presentations (NORMENT application ID is 27412).
  * Notify UK Biobank of any research outputs resulting from the project at least 2 weeks in advance of their publication or release. More details [[https://www.ukbiobank.ac.uk/media/c4yefr4t/210527-uk-biobank-communications-guidelines.pdf|here]]
  * Inform UK Biobank of any press releases associated with a UK Biobank project
  * Return the results of UK Biobank projects within 6 months of public release, including a copy of the manuscript or other publicly available results, any derived variables and the code used to generate the results, so that they can be shared with other researchers. More details [[https://biobank.ndph.ox.ac.uk/showcase/exinfo.cgi?src=returning_results|here]]

Violation of these requirements will likely result in [[https://retractionwatch.com/2021/06/14/a-costly-mistake-prompts-retraction-of-paper-on-hair-loss/|withdrawal of publication]].

==== Data organization ====

The UK Biobank data are organized into three large categories:

  - Phenotypes
  - Participant withdrawal lists
  - Genetics data
  - Bulk data
  - Derived data (produced at NORMENT or MMIL)

Phenotypes (or "Phenotype spreadsheets") is simply a giant table with rows corresponding to ~500,000 individuals, and columns corresponding to different features ("Data Fields", according to UKB terminology). As the NORMENT UKB application evolves over time, all "Data Fields" are organized into "Baskets", which receive new "Refreshes" (i.e. get updated) once UKB data change, for example due to follow up interviews with participants. See below on this page for the latest status of baskets refreshes.
Note that some data fields have multiple values. For example, there are 40 [[http://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=22009|genetic principal components]].

Participants are free to withdraw from UK Biobank at any time and request that their data no longer be used.
Latest withdraw  list is available in ''<UKBLAKE/participant_withdrawal>''.
It is your responsibility to apply participant withdrawal list to your analysis. Please respect this.

It's recommended to use ''ukb_helper.py'' script from https://github.com/precimed/ukb script to extract a list data fields spread across multiple baskets and data refreshes. ''ukb_helper.py'' allows to pass multiple exclusion lists to filter out certain individuals. For your convenience, this allows to apply   participant withdrawal lists. For more info run ''python ukb_helper.py pheno --help''.

Genetics data are described [[http://biobank.ndph.ox.ac.uk/showcase/label.cgi?id=263|here]] and copied in the ''<UKBLAKE>/genetics'' data folders. 

Bulk data are mainly imaging measures. The NORMENT application has access to most of these data, however, due to the size, we handle access to the bulk data on a case by case basis. Please contact Dennis van der Meer <d.van.der.meer@medisin.uio.no> if you need to download some of the UKB bulk data to MMIL or TSD.

Derived data, produced by NORMENT or MMIL, are organized in the ''<UKBDATA>'' folder, as described below.

==== Baskets refreshes status  ====

If you would like to include additional data fields in our application, or if you would like to request a refresh of a certain data fields, please contact Dennis van der Meer and Olav Smeland. If you refresh request had been processed by UKB Team, but the data hasn't been downloaded yet, please contact Oleksandr Frei.

If you would like to include additional data fields in our application, or if you would like to request a refresh of a certain data fields, please contact Olav Smeland, and CC Dennis and Oleksandr.

To find the latest phenotype spreadsheet containing a specific data field you need to know (1) what basket contains your data field, and (2) what was the latest refresh of that basket. Unfortunately, this information is only available through AMS for those researchers who have the right to download the data. For convenience, we report it below. The ID column of the following table gives you the name of the file in the ''<UKBDATA>/phenotypes/Baskets'' folder. The Field IDs available in each Basket and the Field names/titles those IDs correspond to are accessible through AMS:
  * login to AMS
  * Choose "projects" in the left panel and click the blue "View/update" button
  * You will get a page describing the Norment application to UKB. You need to scroll quite far down to see the basket descriptions where you will find the mappings.

Status as of Nov 2nd, 2021:

<code>

ID	Basket ID	Description	Size (MB)	Release Date
49082	2005317	Refresh 37809	4.4e+03	22 Oct 2021
48842	2014918	Additional data	18.1	13 Oct 2021
48431	39315	Refresh 37331	2.41e+03	17 Sep 2021
48198	2013195	Additional data	128	3 Sep 2021
47910	2008608	Refresh 36940	3.16	18 Aug 2021
47785	2001809	Refresh 36723	2.83	6 Aug 2021
47474	39315	Refresh 36377	2.4e+03	2 Jul 2021
47200	2007375	Refresh 36034	554	17 Jun 2021
46836	2012523	Additional data	7.81	12 May 2021
46753	2000151	Refresh 35801	42.4	13 May 2021
45686	39315	Refresh 35168	2.4e+03	14 Mar 2021
45345	2011195	Additional data	1.64	7 Feb 2021
* 43789	2009691	Additional data	2.78	17 Sep 2020
* 42438	2008608	Additional data	2.22	20 Jun 2020
* 42012	2008063	Additional data	3.66	12 May 2020
* 41296	2007375	Additional data	494	18 Mar 2020
* 40545	11385	Refresh 30718	86	19 Feb 2020
* 40544	2000151	Refresh 30717	27.1	19 Feb 2020
* 40543	2000742	Refresh 30716	279	19 Feb 2020
* 40542	2001394	Refresh 30715	20.2	19 Feb 2020
* 40541	2001809	Refresh 30714	2.9	19 Feb 2020
* 40540	2004436	Refresh 30713	565	19 Feb 2020
* 40539	2005317	Refresh 30712	4.33e+03	19 Feb 2020
* 40502	39315	Refresh 30668	2.39e+03	12 Mar 2020
* 37384	2005317	Additional data	4.29e+03	20 Sep 2019
* 37115	2004436	Refresh 28595	556	28 Sep 2019
* 37113	2001809	Refresh 28593	3	28 Sep 2019
* 37112	2001394	Refresh 28592	20	28 Sep 2019
* 37111	2000742	Refresh 28589	279	28 Sep 2019
* 37110	39315	Refresh 28588	2.33e+03	27 Sep 2019
* 37109	2000151	Refresh 28587	27.1	27 Sep 2019
* 37108	11385	Refresh 28586	87.6	27 Sep 2019
* 37107	10643	Refresh 28585	4.88	27 Sep 2019
* 32537	2004436	Additional data	545	14 Jun 2019
* 29266	39315	Refresh	2.33e+03	19 Apr 2019
* 29060	10643	Refresh 26850	4.4	18 Apr 2019
* 28289	2003462	Automated exome 50K release	2.47	29 Mar 2019
* 27125	2001394	Refresh 24844	19.8	3 Apr 2019
* 27107	39315	Refresh 24810	2.33e+03	27 Mar 2019
* 26346	2001809	Refresh 24188	2.88	8 Feb 2019
* 26124	10643	Refresh 24022	4.41	15 Jan 2019
* 26123	11385	Refresh 24021	85	15 Jan 2019
* 26122	2000151	Refresh 24019	26.3	15 Jan 2019
* 26121	2000742	Refresh 24018	279	15 Jan 2019
* 24843	39315	Re-issuing minus sensitive field	2.3e+03	2 Nov 2018
* 24434	2001394	Additional data	19.7	2 Nov 2018
* 24184	2000742	Refresh 22710	279	2 Nov 2018
* 23669	2000742	Additional data	279	3 Sep 2018
* 23605	2000742	Additional data	279	28 Aug 2018
* 23402	2000151	Refresh 22306	25.8	14 Aug 2018
* 23401	11385	Refresh 22305	83.6	14 Aug 2018
* 23400	10643	Refresh 22304	3.67	14 Aug 2018
* 22124	10335	Refresh 21359	2.24e+03	15 May 2018
* 22123	10643	Refresh 21358	3.67	15 May 2018
* 22122	11385	Refresh 21357	83.6	15 May 2018
* 21700	2000151	Refresh 20949	25.7	16 Apr 2018
* 20713	11385	Refresh 20015	83.4	19 Mar 2018
* 20712	10643	Refresh 20014	3.67	19 Mar 2018
* 20711	10335	Refresh 20013	2.24e+03	19 Mar 2018
* 11386	11385	Dataset R11386, released 08-Dec-2017	82.5	8 Dec 2017
* 10644	10643	Dataset R10644, released 15-Sep-2017	3.67	15 Sep 2017
* 10336	10335	Dataset R10336, released 23-Aug-2017	2.22e+03	23 Aug 2017
50 expired datasets (marked with *) are no longer available.

Expired datasets (marked with *) are no longer available for download via UK Biobank showcase 
(but still in place at DATALAKE / UKBDATA folders at MMIL / TSD).
</code>

NB! The above table is not properly formatted - sorry if it's hard to read. Just to give you an idea of how it looks at UKB Showcase, here is an older screenshot taken on August 2019:
{{:ukb_buckets_jun2019.png?700|}}


==== Subject IDs in UK Biobank ====

Note that participant IDs are application-specific (e.g. specific to 27412).
Within the NORMENT application, subject IDs are consistent across phenotype (across refreshes), genetics and bulk data.

Negative IDs represent individuals who withdrew their consent to participate in the UKB study. See [[https://biobank.ctsu.ox.ac.uk/crystal/refer.cgi?id=664|here]] for more information.

For genetics data, withdrawn of participants is regularly applied to .sample files. You may look at the latest "link" file in the genetics folder (i.e. either a .fam file, or a .sample file). The "s488285" (or similar) indicate the version which gets updated as participants withdraw their status.  However, ''sample'' and 'fam' files are updated less often, and the latest withdrawal files are available in ''<UKBLAKE/participant_withdrawal>'' folder.

==== Folders structure ====

Master paths at MMIL:
<code>
# raw data imported from UKB (this is the key dataset)
UKBLAKE=/space/gwas-ds1/1/data/GWAS/UKBLAKE     

# derived and processed data (perhaps less relevant for you)
UKBDATA=/space/gwas-syn1/1/data/GWAS/UKBioBank  
</code>

Replicas at Oslo TSD, maintained by regular sync with tsd-s3api:
<code>
/tsd/p33/data/durable/s3-api/ukblake - a replica of <UKBLAKE>/phenotypes
                                      and <UKBLAKE>/genetics;
/tsd/p33/data/durable/s3-api/ukbdata - a replica of <UKBDATA>/{phenotypes,
                                      curated, covariates, subject_list}
</code>

Data lake:
<code>
<UKBLAKE>/phenotypes/ukbNNNNN.enc - raw phenotype spreadsheets
                                    (all buckets and refreshes)
                                    including enc_ukb, csv, r, tab, ... 
                                    (ukbconv output, which previously was
                                    in <UKBDATA>/phenotypes/Baskets)
<UKBLAKE>/genetics                - genotypes and imputed data
                                    (vcf from whole-exome may fit here too)
<UKBLAKE>/bulk/<FIELD>/           - bulk data, one folder per field
<UKBLAKE>/exome                   - raw whole-exome data (in the future)

(!) Everything else goes to <UKBDATA>
</code>

Processed data:
<code>
<UKBDATA>/users/<USERNAME>   - personal stuff, one folder per user
<UKBDATA>/phenotypes/<other>   project specific curated spreadsheet that 
                               intended to share, such as imaging output
                               spreadsheets
<UKBDATA>/genotypes/         - genetics data in plink format, named as follows
                               <batch>_<postfix>_<chunk>.{bim/bed/fam}
                               where
                               * batch is something like UKB33k or UKB500k,
                                 number indicates the number of subjects 
                               * postfix - qualifier, such as QCed or not
                               * chunk - positional chunk (chrX, chrX_XXX_XXX)
<UKBDATA>/processed_bulk/    - output from freesurfer pipeline from Oslo,
                               aka recon-all for each subject, structured
                               as it is (one subject per folder as the 
                               original freesurfer output unless the 
                               pipeline has been changed)
<UKBDATA>/covariates/        - standard covariates that is applicable for
                               all GWAS, including: sex, age, all genetic PCs,
                               genetic ancestry,
                               genotyping plate numbers. 
<UKBDATA>/subject_list/      - the constraint subject name list that consensus
                               to use, such as unrelated European individuals
                               that just used
</code>

The curated spreadsheets in phenotypes, covariates, and subject_list must have the version control with time stamp in naming convention as the following:
''UKB_<type>_<postfix>_<TIME STAMP>.<ext>'', where
where
''type={phenotype, covariate, or subject_list}'',
''postfix={postfix for the recognition purpose}''
and
''ext={csv, txt, etc}'' is any analyzable file extension.

Misc notes:
  * Folders ''<UKBLAKE>'' and ''<UKBDATA>'' at MMIL are jointly maintained by Chun, Alex, Dennis and Tobias - all of us should feel free to put new stuff there as it arrives.
  * in Oslo we have some UK Biobank data at NIRD system in ''/projects/NS9114K/UKBio'', but it's mainly for Dennis and Tobias to manage bulk data and import/export to/from TSD.
  * for now we won't replicate bulk data from MMIL to Oslo, but we can if it became useful for someone in Oslo.
  * Dennis, Tobias - after we have ''<UKBLAKE>'' and ''<UKBDATA>'' replicas on TSD we should clean all other duplicates from TSD and NIRD (such as ''/tsd/p33/data/durable/external/ukbio'').


==== Extract data fields =====

Phenotype spreadsheets (''ukbNNNNN.csv'') from the UK Biobank are very large and might be hard to work with. 
One option is to use the ukbconv utility, provided by the UK Biobank, to extract a smaller subset of data fields that are relevant for you. The input data for ukbconv are the files named ''ukbNNNNN.enc_ukb'' under ''/tsd/p33/data/durable/s3-api/ukbdata/phenotypes/Baskets''. Instructions for ukbconv should be available in [[https://biobank.ndph.ox.ac.uk/~bbdatan/Accessing_UKB_data_v2.1.pdf|this document]].

==== Old download location ====

These locations were used before, but now are deprecated. See above for the new folders structure.

<code>
/space/gwas-ds2/1/data/GWAS/UKBio            # at MMIL, ip24.ucsd.edu
/tsd/p33/data/durable/external/ukbio         # at TSD, project p-33
/usr/local/syn01/UKBio                       # at Ullevaal 48
</code>

The following excel spreadsheet is outdated, but at some point it described all fields that we had access to. Now there are more, please see AMS.

{{:ukb_oslo_sandiego_27412.xlsx|}}


==== Download process and other technicalities ====

To download phenotype spreadsheets we use e-mails from UK Biobank containing a link to, e.g. ''ukb10336.enc'' file which represents our application (ca. 2 GB large, download via browser). It contains packed and encrypted phenotype spreadsheet, plus allows to download bulk and genetics data. We also got a decryption key ''k2741.key'' specific to our application. The ''10336'' in the name of the file is the "run" id. Applying for more data files results in a new refresh, and new ''.enc'' file. 

Then we use 
  * utility ''ukb_unpack'' (http://biobank.ndph.ox.ac.uk/showcase/download.cgi) to unpack ''ukb10336.enc'' into ''ukb10336.enc_ukb''/
  * utility ''ukb_conv'' to extract phenotype spreadsheet from ''ukb10336.enc_ukb''
  * utility ''ukbgene'' to download genetics data, raw and imputed, more info https://biobank.ctsu.ox.ac.uk/crystal/refer.cgi?id=664
  * utility ''ukbfetch'' to download bulk data, more info https://biobank.ctsu.ox.ac.uk/crystal/refer.cgi?id=644

For information about UKB utilities see [[http://biobank.ctsu.ox.ac.uk/crystal/docs/UsingUKBData.pdf|this document]].

We did not use https://ega-archive.org - The European Genome-phenome Archive. This resource is quite confusing. UK Biobank released their genotype data (raw and imputed) as two datasets via EGA. Neither imaging nor phenotype data is released via EGA. Genotype data is also available for download from showcase (without EGA).

* MD5 sums for genotype data - http://biobank.ctsu.ox.ac.uk/crystal/refer.cgi?id=997 (it would be nice to find similar resources for imaging data).

Previously, we used the following tunnel solution for faster upload from NIRD to TSD via Abel, but now we download/upload data to TSD with the [[https://github.com/unioslo/tsd-api-client|tsd-api client]].

<code>
# Create tunnel on abel. 
# Accept connections on port 2233 on localhost. 
# Tell abel to forward connections to tsd-fx01.tsd.usit.no:22
ssh abel.uio.no -L2233:tsd-fx01.tsd.usit.no:22

# Connect to port 2233 on localhost. 
# Connection will be forwarded to sd-fx01.tsd.usit.no:22
sftp -P 2233 p33-ofrei@localhost
</code>

Example of ukbfetch for a given field:
<code>
./ukbconv ukb10336.enc_ukb bulk -s20227 -ofield.20227.ids
</code>

==== Other resources ====

  * http://imaging.ukbiobank.ac.uk/ - number of participants scanned as of today
  * http://big.stats.ox.ac.uk​ - Oxford Brain Imaging Genetics (BIG) Server
  * https://cran.r-project.org/web/packages/ukbtools/ukbtools.pdf -  useful tools to work with phenotype spreadsheets
  * www.nealelab.is/blog/2017/7/19/rapid-gwas-of-thousands-of-phenotypes-for-337000-samples-in-the-uk-biobank - rapid UKB GWAS on many phenotypes.
  *  https://doi.org/10.1192/bjo.2019.100 Mental Health Questionnaire in UKB and relevant phenotypes

==== Genetic data QC ====

  * https://github.com/Nealelab/UK_Biobank_GWAS/ resource has very useful info about QCing of UK Biobank genetic data. We reproduced similar results in ''<UKBDATA>/projects/plsa_mixer/ukb_genetics_qc'', containing data for 337145 individuals and 12926669 variants. The ''ukb_bed'' contain hard calls, and ''ukb_pgen'' contain dosage files.
  * This reference doesn't aim to be a universal solution suitable for all project, however it may be a good start for doing a quick GWAS to explore the results. Note that subjects with imaging data were not prioritized, i.e. the exclusion filter for unrelated individuals might have been applied differently too give priority for participants with MRI data. As of now this wasn't done.
  * Samples QC is based on ''ukb_sqc_v2.txt'' file: keep unrelated individuals (''used.in.pca.calculation''), keep white British ancestry subset (''in.white.British.ancestry.subset''), remove sex chromosome aneuploidy (''putative.sex.chromosome.aneuploidy''), and exclude withdrawn participants.
  * Variants QC is based on ''ukb_mfi_chr@_v3.txt'' files (provided by UKB) and afreq/hardy/vmiss files calculated by us for the subset of 337145 individuals. We keep autosomes only, apply PLINKs ''--maf 0.001 --geno 0.1 --hwe 1e-10'', in addition to filtering out variants with INFO below 0.8 and excluding  SNPs with duplicated SNP ID (all such SNPs were excluded, i.e. without keeping first or last occurrence). Some numbers: TOTAL SNPs - 93095623; MAF below 0.001 - 76572747; VMISS above 0.1 - 487462; HWE below 1e-10 - 217482; INFO below 0.8 - 64154249; duplicated SNP label - 637921; remain after filtering 12926669. (off note, the number without dropping duplicated SNP labels would be 12992936). The numbers above are for 22 autosomes, in addition we have  3917799 SNPs (in total) on chrX, and 45906 on chrXY.

<code>
  plink2 --bgen UKBLAKE/genetics/imp/ukb_imp_chr1_v3.bgen
         --sample UKBLAKE/genetics/imp/ukb27412_imp_chr1_v3_s487317.sample
         --keep UKBDATA/subject_lists/UKB337145_181219.txt
         --out ukb_imp_chr1_v3 --freq  --hardy  --missing
</code>

=== Keeping the data in sync between MMIL and Oslo ======

(!!!!!!) At MMIL servers the default shell is not bash. It's a (very) good idea to type "bash" on each new shell.

<code>
cd /space/gwas-ds1/1/data/GWAS/UKBLAKE && tsd-s3cmd sync --no-check-md5 --delete-removed phenotypes s3://ukblake
cd /space/gwas-ds1/1/data/GWAS/UKBLAKE && tsd-s3cmd sync --no-check-md5 --delete-removed participant_withdrawal s3://ukblake
cd /space/gwas-ds1/1/data/GWAS/UKBLAKE && tsd-s3cmd sync --no-check-md5 --delete-removed genetics s3://ukblake
</code>

For the new phenotype data: download from the browser using UK Biobank Show Case (NB - remember to login to AMS in Show Case). This gives ukbNNNNN.enc file. Copy over the key from e-mail. Then:

<code>
cd /space/gwas-ds1/1/data/GWAS/UKBLAKE/phenotypes
./ukbunpack ukbNNNNNN.enc <the key>
ukb_conv ukbNNNNNN.enc_ukb csv || true
ukb_conv ukbNNNNNN.enc_ukb r || true
ukb_conv ukbNNNNNN.enc_ukb docs || true
ukb_conv ukbNNNNNN.enc_ukb bulk || true
</code>

For more examples see /space/gwas-ds1/1/data/GWAS/UKBLAKE/phenotypes/Makefile (but use it wise - don't just copy those commands :))
