CINECA synthetic cohort NA Canada CHILD v1
Dataset Managers: Vivian Jin, Justin Cook & Fiona Brinkman
This dataset consists of 100 variables for 150 synthetic participants which have subject attributes and phenotypic data derived from the CHILD Cohort Study and the associated CHILDdb database. CHILD is a longitudinal study following over 3400 Canadian children and their parents, reflective of Canadian demographics, in order to better predict, prevent and treat chronic diseases. Variables were chosen from CHILDdb which covered the CINECA minimal metadata model, covered select COVID-specific and CINECA use cases and exhibited particular variables that are reflective of the CHILD Cohort Study. Note that there are over 37 million datapoints in CHILDdb, so this synthetic data is a very small subset of the very diverse, deeply phenotyped CHILD data. The Data Synthesizer tool was used to generate correlated synthetic data for primarily anthropomorphic variables for the fake child data. In addition, this synthetic dataset contains additional uncorrelated data on the 3 populations included in the CHILD study - mother, father, and child. This results in a total of 4 synthetic datasets in Excel workbooks (each with 150 synthetic subjects): 1 dataset of correlated anthropomorphic variables concerning children (i.e. ensuring height, weight, BMI was reasonable and correlated - not just randomly chosen for a given subject), and 3 datasets of additional uncorrelated variables for children, mothers, and fathers. This phenotypic synthetic dataset is open access and is fully accessible under the Creative Commons Licence (CC-BY).
The genetic data was then added, as per the method used for producing CINECA synthetic cohort Europe CH SIB and then each fake CHILD subject_id was linked to a variant. The genotype data consists of a single joint call VCF file with call genotypes for all samples, plus bed, bim, fam, and nosex files generated via plink for these samples and genotypes. The 1000 Genomes data is licensed under the Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0) license.
The dataset can be found here.