CINECA synthetic cohort Europe CH SIB

Dataset managers: Nona Naderi, Romain Tanzer, Douglas Teodoro, Jenny Copara,  Luc Mottin & Patrick Ruch

This dataset consists of 6733 synthetic samples containing both phenotypic and genotypic information. The phenotypic data were derived from CoLaus and PsyCoLaus, cohorts which include data from over 6000 Caucasian individuals aged 35 to 75 years living in Lausanne, Switzerland. Although CoLaus focuses on cardiovascular disorders, and PsyCoLaus focuses on psychiatric disorders, both cohorts collect demographic, socio-economic, life-style, and clinical information from enrolled patients. 

To generate the synthetic phenotypic data, the DataSynthesizer tool was used. This tool is specifically designed for privacy-preserving datasets, enabling the generation of either randomly or statistically correlated distributions for the synthetic data. To minimise the data used,  20 variables out of 191 available in the original dataset were selected based on their relevance to the CINECA minimal metadata model. The variables included in this dataset encode the following information: age (numeric), gender (categorical numeric, woman 0, man 1), birthplace (categorical string), residence (categorical string), job type (categorical string), family and household structure (categorical numeric, alone 0, couple 1); tobacco (categorical string), alcohol use (categorical string), and physical activity (categorical string); weight (numeric), height (numeric), blood pressure (numeric) and heart rate (numeric); and diagnoses (free text, string) and prescriptions (ATC codes, string). They were generated using a random distribution. This phenotypic synthetic dataset is open access and is fully accessible under the Creative Commons Licence. (CC-BY).

The synthetic genetic data in this dataset re-uses 100 samples extracted from the 1000 Genome (release 20130502) VCF file, where the variant position is lower than 16070000.  The genetic data was then randomly linked to the phenotypic data. The 1000 Genomes data is licensed under the Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0) license. 

The dataset can be found here.