CINECA Synthetic Cohort EUROPE UK1

Dataset managers- Coline Thomas, Isuru Liyanage & Dylan Spalding

This dataset consists of 2504 samples which have genetic data based on 1000 Genomes data (phase3 and Geuvadis), and 76 synthetic subject attributes and phenotypic data derived from UKBiobank. The UK Biobank is a very large and detailed prospective study with over 500,000 participants aged 40–69 years when recruited in 2006–2010. The study has collected and continues to collect extensive phenotypic and genotypic detail about its participants, including data from questionnaires, physical measures, sample assays, accelerometry, multimodal imaging, genome-wide genotyping and longitudinal follow-up for a wide range of health-related outcomes. 


Subject attributes and phenotypic data

The sample attributes and phenotypic data were selected to cover the minimal metadata model and the three main CINECA use cases plus additional general and demographic attributes resulting in a total of 76 sample attributes or phenotypes. The phenotypic data were initially derived using the TOFU tool, which generates randomly generated values based on the UKBiobank data dictionary. Categorical values were randomly generated based on the data dictionary, continuous variables were generated based on the distribution of values reported by the UK Biobank showcase, and date/time values were random. Some related phenotype variables were regrouped by their category such as 'Nervous system disorders' or 'Respiratory system disorders'. Once the initial set of phenotypes and attributes were generated, the data were checked for consistency and where possible dependent attributes were calculated from the independent variables generated by TOFU. For example, BMI was calculated from height and weight data, and age at death generated by date of death and date of birth.

 
Genetic data for quality control pipelines

The genetic data are derived from the 1000 Genomes Phase 3 release (v5a phase 3 VCFs, across all autosomes, including X), using all 2504 samples. The genotype data consists of a single joint call VCF file with call genotypes for all 2504 samples, plus bed, bim, fam, and nosex files generated via plink for these samples and genotypes. The genotype data has had a variety of errors introduced to mimic real data and as a test for quality control pipelines. These include gender mismatches, ethnic background mislabelling and low call rates for a randomly chosen subset of sample data as well as deviations from Hardy Weinberg equilibrium. 

Additionally, 40 samples have raw genetic data available in the form of both bam and cram files, including unmapped data.  The gender of the samples in the 1000 genomes data has been matched to the synthetic phenotypic data generated for these samples. However, no other genotype/phenotype matching were done on this dataset. 


RNAseq data for the CINECA eQTL use-case

445 individuals from the 1000 Genomes phase3 data also had previous RNAseq data from the Geuvadis collection which could be used for eQTL analysis. These files were added to the dataset and linked to a set of corresponding fake samples.

The phenotypic sample data are available in the development version of Biosamples and licensed under the Creative Commons Licence (CC-BY). The genetic data was then linked to the synthetic phenotype data in BioSamples and submitted to EGA as dataset UK1 (dataset ID EGAD00001006673). The 1000 Genomes data is licensed under the Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0) license.