CINECA Virtual Platform
Data Harmonisation
To support human cohort genomic and other omic data discovery and analysis across jurisdictions, basic data such as cohort participants’ demographic data, diseases, medication etc. needs to be harmonised. Individual cohorts are constrained by size, ancestral origins, and geographic boundaries that limit the subgroups, exposures, outcomes, and interactions which can be examined. Combining data across large cohorts to address questions none of them can answer alone enhances the value of each and leverages the enormous investments already made in them to address pressing questions in global health.
CINECA has addressed the meta data representation needs for cohort aggregate and individual data across studies and over time; it has worked on developing a metadata model, on creating a workflow for semantic harmonisation and a system to generate metadata from unstructured dataset and data item descriptions, it has collaborated in the creation of a machine readable consent ontology and on the development of CINECA Synthetic Datasets.
-
In order to harmonize cohort metadata, CINECA project created the GECKO, a model ontology compliant with semantic standards used for consistent representation of cohort metadata. Combined with other tools like OxO (an Ontology Cross-references tool) and Zooma (an Ontology Annotation tool), an harmonization pipeline has been created, which has been successfully used in harmonizing CINECA cohorts
Links:
GECKO repository and Ontology Search site
More detailed information available HERE
-
CINECA has worked on developing a method to extract standardized concepts from unstructured and partially structured fields present in the cohorts' data, which resulted in the development of the CINECA Text Mining Aggregate API, which is used to provide ontology terms from free text.
Links
CINECA Text Mining Aggregate API
More detailed information HERE
-
CINECA has been involved in the development of Data Use Ontology (DUO) codes, which is a GA4GH approved technical standard, used to ease Data Access Committee’s review of data access requests.
Links:
GA4GH DUO site and repository
More detailed information available HERE
-
Synthetic datasets model the same characteristics and data fields as real cohort datasets, but their values are computer generated. These synthetic datasets can be freely used for developing tools or providing trainings, as they behave as real cohort datasets, without bearing any identifiable information.
The CINECA project has developed 4 different Synthetic Cohort Datasets based on CINECA cohorts.
Links
Information on CINECA Synthetic Datasets
Synthetic Cohort Europe UK1 repository
Synthetic Cohort Europe CH SIB repository
Synthetic Cohort Africa H3ABionet repository
Synthetic Cohort NA Canada CHILD repository