Synthetic Cohort Datasets

The CINECA project has produced a set of Synthetic Cohort Datasets based on the phenotypic data from four of our participating cohorts - UK Biobank, CoLaus, H3Africa, and the CHILD Cohort Study. These Synthetic Cohort Datasets have no identifiable data and cannot be used to make any inference about the ‘parent’ cohort data or results. All Synthetic Cohort Datasets are open access and fully accessible under the Creative Commons Licences as specified with each dataset.

They were developed by CINECA to increase accessibility to cohort data for standards development, whilst mitigating ethical and legal privacy concerns that arise with cohort data sharing, including pseudonymised data.

Synthetic cohorts aim to model the same characteristics and data fields as real human cohorts, but all data values are computer-generated. These datasets are artificially generated, but still capture the statistical properties and patterns form the original dataset, with no link whatsoever with any real participant.
The CINECA project is creating tools and interfaces that will result in a federated cloud-enabled infrastructure for human cohort data, so it’s necessary for the project to use cohort data to showcase and demonstrate federated research and clinical applications.
However, cohort participant’s data is of sensitive nature and it needs to comply with General Data Protection Regulation (GDPR) and participant consent agreements, which can slow down the pace at which these new tools are developed.

Therefore, Synthetic Cohort Datasets are very useful because they are considered non-identifiable data, so they can be used and shared with no privacy concerns. CINECA partners considered it indispensable to develop Cohorts with Synthetic Cohort Datasets, which would contain the same characteristics as a real cohort, but wouldn’t have any identifiable information, giving developers the freedom to use it for public cohort interoperability demonstrators.
The CINECA Synthetic Cohort Datasets include the Synthetic Cohort Europe UK1, Synthetic Cohort Africa H3ABioNet and Synthetic Cohort Europe CH SIB and Synthetic Cohort NA Canada CHILD. The designation of each Synthetic Cohort Dataset indicates the reference source of the phenotypic data, while making clear that the actual values in the dataset are entirely synthesised, having been derived from the parameters of the source dataset. All sample identifiers are prefaced with 'fake' to avoid confusion with real datasets. Each cohort was consulted throughout the process to ensure that the Synthetic Cohort Dataset was appropriate, suitably reflective of the cohort, and had no identifiable information (particularly for more free-text data).

All cohorts reviewed and approved the first version to go public and can be consulted on any subsequent versions, which are versioned by date.

The description of each dataset in the table below includes information such as the number of samples, how the variables were selected, the tools used to generate attributed values, and where the dataset can be accessed. More information on each Synthetic Cohort Dataset page.


Synthetic Cohort Dataset

Phenotypic data

Genomic data

Generated by

Publication Status

Synthetic Cohort Europe UK1

2521 samples derived from UKBiobank, relating to cancer, diabetes and cardiac

Genetic data based on 1000 Genomes data

TOFU, a tool developed in-house for generating Synthetic Cohort UKBiobank data

European Genome Archive (EGA): https://ega-archive.org/datasets/EGAD00001006673

Synthetic Cohort Africa H3ABioNet

100 samples that have synthetic subject attributes and 47 phenotypic data based on the Human Heredity and Health in Africa

1000 Genomes project phase 3 data, randomly selected 2M variants in chr 22 for 100 samples of African ancestries

Nextflow pipeline that uses a modified version of TOFU

Zenodo:https://zenodo.org/record/4955933

Synthetic Cohort Europe CH SIB

6733 samples using 21 attributes selected from the CoLaus cohort

1000 Genomes project phase 3 data, selected 100 most-common variants in chr 22

DataSynthesizer was used for generating both randomly and statistically correlated synthetic data

Zenodo:https://zenodo.org/record/5082689

Synthetic Cohort NA Canada CHILD

100 select variables for 150 participants, plus COVID and other key variables for CHILD

1000 Genomes project phase 3 data , selected most 100 common variants in chr 22

DataSynthesizer for synthesizing correlated anthropomorphic variables, other variables uncorrelated

Zenodo:https://zenodo.org/record/5122832

Phenotypic Data

The phenotype variables selected for each dataset were chosen to reflect a selection of the existing variables for the respective parent cohort, with particular reference to select use cases and a minimal metadata model that was developed by CINECA to support querying across jurisdictions for suitable dataset discovery.

The reference version of the metadata model was based on a manual review of common fields in cohorts participating in the Maelstrom Catalogue. The model was then converted to Genomics Cohorts Knowledge Ontology (GECKO) format, and further expanded with additional terms to cover the three main CINECA use cases (cancer, cardiovascular disease, and diabetes mellitus), plus COVID-19 search terms.

Genotypic Data

The genotype data were all derived from the 1000 Genomes project Phase 3 release. As the 1000 Genomes data is fully public, it was used to minimise privacy intrusion while maximising the utility of the synthesised datasets by including a range of file and data types. The 1000 Genomes data is licensed under the Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0) license.

A paper with a detailed description of the development of the Synthetic Cohort Datasets is in preparation.