The CINECA project has produced a set of synthetic cohort datasets which were generated based on the phenotypic data from four of our participating cohorts - UK Biobank, CoLaus, H3Africa, and the CHILD Cohort Study. The datasets have no identifiable data and cannot be used to make any inference about the ‘parent’ cohort data or results. All of the synthetic datasets are open access, and are fully accessible under the Creative Commons Licences as specified with each dataset. They were developed by CINECA to increase accessibility to cohort data for standards development, whilst mitigating ethical and legal privacy concerns that arise with cohort data sharing, including pseudonymised data. Each synthetic dataset has been given a designation which indicates the reference source of the phenotypic data, while making clear that the actual values in the dataset are entirely synthesised, having been derived from the parameters of the source dataset. All sample identifiers are prefaced with 'fake' to avoid confusion with real datasets. Each cohort was consulted repeatedly throughout the process to ensure that the synthetic dataset was appropriate, suitably reflective of the cohort, and had no identifiable information (particularly for more free text data). All cohorts reviewed and approved the first version to go public and can be consulted on any subsequent versions, which are versioned by date.  The description of each dataset below includes information such as the number of samples, how the variables were selected, the tools used to generate attributed values, and where the dataset can be accessed. 

The phenotypic variables selected for each dataset were chosen to reflect a selection of the existing variables for the respective parent cohort, with particular reference to select use cases and a minimal metadata model that was developed by CINECA to support querying across jurisdictions for suitable dataset discovery. The reference version of the metadata model was based on a manual review of common fields in cohorts participating in the Maelstrom Catalogue. The model was then converted to Genomics Cohorts Knowledge Ontology (GECKO) format, and further expanded with additional terms to cover the three main CINECA use cases (cancer, cardiovascular disease, and diabetes mellitus), plus COVID-19 search terms. 

The genotypic variables attributed to the datasets were all derived from the 1000 Genomes project Phase 3 release. As the 1000 Genomes data is fully public, it was used to minimise privacy intrusion while maximising the utility of the synthesised datasets by including a range of file and data types. The 1000 Genomes data is licensed under the Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0) license. 

A paper with a detailed description of the synthetic datasets development is in preparation.



CINECA synthetic cohort EUROPE UK1

Dataset managers- Coline Thomas, Isuru Liyanage & Dylan Spalding

This dataset consists of 2504 samples which have genetic data based on 1000 Genomes data (phase3 and Geuvadis), and 76 synthetic subject attributes and phenotypic data derived from UKBiobank. The UK Biobank is a very large and detailed prospective study with over 500,000 participants aged 40–69 years when recruited in 2006–2010. The study has collected and continues to collect extensive phenotypic and genotypic detail about its participants, including data from questionnaires, physical measures, sample assays, accelerometry, multimodal imaging, genome-wide genotyping and longitudinal follow-up for a wide range of health-related outcomes. 


Subject attributes and phenotypic data

The sample attributes and phenotypic data were selected to cover the minimal metadata model and the three main CINECA use cases plus additional general and demographic attributes resulting in a total of 76 sample attributes or phenotypes. The phenotypic data were initially derived using the TOFU tool, which generates randomly generated values based on the UKBiobank data dictionary. Categorical values were randomly generated based on the data dictionary, continuous variables were generated based on the distribution of values reported by the UK Biobank showcase, and date/time values were random. Some related phenotype variables were regrouped by their category such as 'Nervous system disorders' or 'Respiratory system disorders'. Once the initial set of phenotypes and attributes were generated, the data were checked for consistency and where possible dependent attributes were calculated from the independent variables generated by TOFU. For example, BMI was calculated from height and weight data, and age at death generated by date of death and date of birth.

 
Genetic data for quality control pipelines

The genetic data are derived from the 1000 Genomes Phase 3 release (v5a phase 3 VCFs, across all autosomes, including X), using all 2504 samples. The genotype data consists of a single joint call VCF file with call genotypes for all 2504 samples, plus bed, bim, fam, and nosex files generated via plink for these samples and genotypes. The genotype data has had a variety of errors introduced to mimic real data and as a test for quality control pipelines. These include gender mismatches, ethnic background mislabelling and low call rates for a randomly chosen subset of sample data as well as deviations from Hardy Weinberg equilibrium. 

Additionally, 40 samples have raw genetic data available in the form of both bam and cram files, including unmapped data.  The gender of the samples in the 1000 genomes data has been matched to the synthetic phenotypic data generated for these samples. However, no other genotype/phenotype matching were done on this dataset. 


RNAseq data for the CINECA eQTL use-case

445 individuals from the 1000 Genomes phase3 data also had previous RNAseq data from the Geuvadis collection which could be used for eQTL analysis. These files were added to the dataset and linked to a set of corresponding fake samples.


The phenotypic sample data are available in the development version of Biosamples and licensed under the Creative Commons Licence (CC-BY). The genetic data was then linked to the synthetic phenotype data in BioSamples and submitted to EGA as dataset UK1 (dataset ID EGAD00001006673). The 1000 Genomes data is licensed under the Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0) license.


CINECA synthetic cohort Europe CH SIB

Dataset managers- Nona Naderi, Romain Tanzer, Douglas Teodoro, Jenny Copara,  Luc Mottin & Patrick Ruch

This dataset consists of 6733 synthetic samples containing both phenotypic and genotypic information. The phenotypic data were derived from CoLaus and PsyCoLaus, cohorts which include data from over 6000 Caucasian individuals aged 35 to 75 years living in Lausanne, Switzerland. Although CoLaus focuses on cardiovascular disorders, and PsyCoLaus focuses on psychiatric disorders, both cohorts collect demographic, socio-economic, life-style, and clinical information from enrolled patients. 


To generate the synthetic phenotypic data, the DataSynthesizer tool was used. This tool is specifically designed for privacy-preserving datasets, enabling the generation of either randomly or statistically correlated distributions for the synthetic data. To minimise the data used,  20 variables out of 191 available in the original dataset were selected based on their relevance to the CINECA minimal metadata model. The variables included in this dataset encode the following information: age (numeric), gender (categorical numeric, woman 0, man 1), birthplace (categorical string), residence (categorical string), job type (categorical string), family and household structure (categorical numeric, alone 0, couple 1); tobacco (categorical string), alcohol use (categorical string), and physical activity (categorical string); weight (numeric), height (numeric), blood pressure (numeric) and heart rate (numeric); and diagnoses (free text, string) and prescriptions (ATC codes, string). They were generated using a random distribution. This phenotypic synthetic dataset is open access and is fully accessible under the Creative Commons Licence. (CC-BY).


The synthetic genetic data in this dataset re-uses 100 samples extracted from the 1000 Genome (release 20130502) VCF file, where the variant position is lower than 16070000.  The genetic data was then randomly linked to the phenotypic data. The 1000 Genomes data is licensed under the Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0) license. 

The dataset can be found here.

Figure 1: The DataSynthesizer system architecture (Ping et al, DataSynthesizer: Privacy-Preserving Synthetic Datasets, 2017, Proceedings of the 29th International Conference on Scientific and Statistical Database Management)

Figure 1: The DataSynthesizer system architecture (Ping et al, DataSynthesizer: Privacy-Preserving Synthetic Datasets, 2017, Proceedings of the 29th International Conference on Scientific and Statistical Database Management)


CINECA synthetic cohort Africa H3ABioNet v1

Dataset managers- Mamana Mbiyavanga & Nicola Mulder

This dataset consists of 100 samples which have synthetic subject attributes and phenotypic data based on the Human Heredity and Health in Africa (H3Africa) consortium core phenotype model (H3Africa Core phenotype). The H3Africa initiative consists of 51 African projects, with over 70,000 participants, including population-based genomic studies of common, non-communicable disorders such as heart and renal disease, as well as communicable diseases such as tuberculosis. 

We used a modified version of the TOFU tool to generate the metadata for 100 samples. We constructed a database of fields and values using the H3Africa Core phenotype, a set of recommended questions or variables that H3Africa projects should consider when designing their data collection forms. We selected a group of 47 variables out of the 255 H3Africa Core phenotypes to provide a good overlap with the CINECA minimal metadata model. Categorical values were chosen randomly from field choices in the H3Africa Core phenotype data dictionary, and continuous values, such as age and date of birth, were randomly selected from the field ranges. For this initial version of the H3ABioNet synthetic dataset, we used the mean and median values from UK Biobank provided with TOFU to model the distribution of continuous values. More detailed information on the phenotypic data variables (description, data type, example) can be found in this document that describes the coverage with the CINECA minimal metadata model. Variables which have been harmonised with the CINECA minimal metadata model include age/birthdate, gender, ethnicity/race, height, weight, and tobacco usage. This phenotypic synthetic dataset is open access, and is fully accessible under the Creative Commons Attribution 4.0 International licence.

We used the 1000 Genomes project phase 3 data to generate genetic data. From the 2504 samples included in the 1000 Genomes project, we randomly selected 100 samples of African ancestries. We then used BCFTools to replace the 100 sample identifiers with the ones in the metadata file. We randomly selected 650K variants in chromosome 1 using BCFTools. Because we are using a Nextflow pipeline to generate this genetic data, parameters such the chromosome and variant size via the pipeline confirmation file, can easily be changed to streamlining the whole process. The 1000 Genomes data is licensed under the Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0) license. The dataset can be found here.



CINECA synthetic cohort NA Canada CHILD v1

Dataset Managers - Vivian Jin, Justin Cook & Fiona Brinkman

This dataset consists of 100 variables for 150 synthetic participants which have subject attributes and phenotypic data derived from the CHILD Cohort Study and the associated CHILDdb database. CHILD is a longitudinal study following over 3400 Canadian children and their parents, reflective of Canadian demographics, in order to better predict, prevent and treat chronic diseases. Variables were chosen from CHILDdb which covered the CINECA minimal metadata model, covered select COVID-specific and CINECA use cases, and exhibited particular variables that are reflective of the CHILD Cohort Study. Note that there are over 37 million datapoints in CHILDdb, so this synthetic data is a very small subset of the very diverse, deeply phenotyped CHILD data. The Data Synthesizer tool was used to generate correlated synthetic data for primarily anthropomorphic variables for the fake child data. In addition, this synthetic dataset contains additional uncorrelated data on the 3 populations included in the CHILD study - mother, father, and child. This results in a total of 4 synthetic datasets in Excel workbooks (each with 150 synthetic subjects): 1 dataset of correlated anthropomorphic variables concerning children (i.e. ensuring height, weight, BMI was reasonable and correlated - not just randomly chosen for a given subject), and 3 datasets of additional uncorrelated variables for children, mothers, and fathers. This phenotypic synthetic dataset is open access and is fully accessible under the Creative Commons Licence (CC-BY).

The genetic data was then added, as per the method used for producing CINECA synthetic cohort Europe CH SIB and then each fake CHILD subject_id was linked to a variant. The genotype data consists of a single joint call vcf file with call genotypes for all samples, plus bed, bim, fam, and nosex files generated via plink for these samples and genotypes. The 1000 Genomes data is licensed under the Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0) license. 

The dataset can be found here.

 
Photo by Derek Finch on Unsplash

Photo by Derek Finch on Unsplash