CINECA synthetic cohort Africa H3ABioNet v1

Dataset managers: Mamana Mbiyavanga & Nicola Mulder

This dataset consists of 100 samples which have synthetic subject attributes and phenotypic data based on the Human Heredity and Health in Africa (H3Africa) consortium core phenotype model (H3Africa Core phenotype). The H3Africa initiative consists of 51 African projects, with over 70,000 participants, including population-based genomic studies of common, non-communicable disorders such as heart and renal disease, as well as communicable diseases such as tuberculosis. 

We used a modified version of the TOFU tool to generate the metadata for 100 samples. We constructed a database of fields and values using the H3Africa Core phenotype, a set of recommended questions or variables that H3Africa projects should consider when designing their data collection forms. We selected a group of 47 variables out of the 255 H3Africa Core phenotypes to provide a good overlap with the CINECA minimal metadata model. Categorical values were chosen randomly from field choices in the H3Africa Core phenotype data dictionary, and continuous values, such as age and date of birth, were randomly selected from the field ranges. For this initial version of the H3ABioNet synthetic dataset, we used the mean and median values from UK Biobank provided with TOFU to model the distribution of continuous values. More detailed information on the phenotypic data variables (description, data type, example) can be found in this document that describes the coverage with the CINECA minimal metadata model. Variables which have been harmonised with the CINECA minimal metadata model include age/birthdate, gender, ethnicity/race, height, weight, and tobacco usage. This phenotypic synthetic dataset is open access, and is fully accessible under the Creative Commons Attribution 4.0 International licence.

We used the 1000 Genomes project phase 3 data to generate genetic data. From the 2504 samples included in the 1000 Genomes project, we randomly selected 100 samples of African ancestries. We then used BCFTools to replace the 100 sample identifiers with the ones in the metadata file. We randomly selected 650K variants in chromosome 1 using BCFTools. Because we are using a Nextflow pipeline to generate this genetic data, parameters such the chromosome and variant size via the pipeline confirmation file, can easily be changed to streamlining the whole process. The 1000 Genomes data is licensed under the Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0) license. The dataset can be found here.