Synthetic Cohort Datasets
They were developed by CINECA to increase accessibility to cohort data for standards development, whilst mitigating ethical and legal privacy concerns that arise with cohort data sharing, including pseudonymised data.
However, cohort participant’s data is of sensitive nature and it needs to comply with General Data Protection Regulation (GDPR) and participant consent agreements, which can slow down the pace at which these new tools are developed.
Therefore, Synthetic Cohort Datasets are very useful because they are considered non-identifiable data, so they can be used and shared with no privacy concerns. CINECA partners considered it indispensable to develop Cohorts with Synthetic Cohort Datasets, which would contain the same characteristics as a real cohort, but wouldn’t have any identifiable information, giving developers the freedom to use it for public cohort interoperability demonstrators.
All cohorts reviewed and approved the first version to go public and can be consulted on any subsequent versions, which are versioned by date.
The description of each dataset in the table below includes information such as the number of samples, how the variables were selected, the tools used to generate attributed values, and where the dataset can be accessed. More information on each Synthetic Cohort Dataset page.
Synthetic Cohort Dataset |
Phenotypic data |
Genomic data |
Generated by |
Publication Status |
2521 samples derived from UKBiobank, relating to cancer, diabetes and cardiac |
Genetic data based on 1000 Genomes data |
TOFU, a tool developed in-house for generating Synthetic Cohort UKBiobank data |
European Genome Archive (EGA): https://ega-archive.org/datasets/EGAD00001006673 |
|
100 samples that have synthetic subject attributes and 47 phenotypic data based on the Human Heredity and Health in Africa |
1000 Genomes project phase 3 data, randomly selected 2M variants in chr 22 for 100 samples of African ancestries |
Nextflow pipeline that uses a modified version of TOFU |
||
6733 samples using 21 attributes selected from the CoLaus cohort |
1000 Genomes project phase 3 data, selected 100 most-common variants in chr 22 |
DataSynthesizer was used for generating both randomly and statistically correlated synthetic data |
||
100 select variables for 150 participants, plus COVID and other key variables for CHILD |
1000 Genomes project phase 3 data , selected most 100 common variants in chr 22 |
DataSynthesizer for synthesizing correlated anthropomorphic variables, other variables uncorrelated |
Phenotypic Data
The phenotype variables selected for each dataset were chosen to reflect a selection of the existing variables for the respective parent cohort, with particular reference to select use cases and a minimal metadata model that was developed by CINECA to support querying across jurisdictions for suitable dataset discovery.The reference version of the metadata model was based on a manual review of common fields in cohorts participating in the Maelstrom Catalogue. The model was then converted to Genomics Cohorts Knowledge Ontology (GECKO) format, and further expanded with additional terms to cover the three main CINECA use cases (cancer, cardiovascular disease, and diabetes mellitus), plus COVID-19 search terms.
Genotypic Data
The genotype data were all derived from the 1000 Genomes project Phase 3 release. As the 1000 Genomes data is fully public, it was used to minimise privacy intrusion while maximising the utility of the synthesised datasets by including a range of file and data types. The 1000 Genomes data is licensed under the Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0) license.CINECA Synthetic Cohort Datasets
A paper with a detailed description of the development of the Synthetic Cohort Datasets is in preparation.