CINECA Virtual Platform

Data Harmonisation

To support human cohort genomic and other omic data discovery and analysis across jurisdictions, basic data such as cohort participants’ demographic data, diseases, medication etc. needs to be harmonised. Individual cohorts are constrained by size, ancestral origins, and geographic boundaries that limit the subgroups, exposures, outcomes, and interactions which can be examined. Combining data across large cohorts to address questions none of them can answer alone enhances the value of each and leverages the enormous investments already made in them to address pressing questions in global health.

CINECA has addressed the meta data representation needs for cohort aggregate and individual data across studies and over time; it has worked on developing a metadata model, on creating a workflow for semantic harmonisation and a system to generate metadata from unstructured dataset and data item descriptions, it has collaborated in the creation of a machine readable consent ontology and on the development of CINECA Synthetic Datasets.

  • In order to harmonize cohort metadata, CINECA project created the GECKO, a model ontology compliant with semantic standards used for consistent representation of cohort metadata. Combined with other tools like OxO (an Ontology Cross-references tool) and Zooma (an Ontology Annotation tool), an harmonization pipeline has been created, which has been successfully used in harmonizing CINECA cohorts

    Links:

  • CINECA has worked on developing a method to extract standardized concepts from unstructured and partially structured fields present in the cohorts' data, which resulted in the development of the CINECA Text Mining Aggregate API, which is used to provide ontology terms from free text.

    Links

  • CINECA has been involved in the development of Data Use Ontology (DUO) codes, which is a GA4GH approved technical standard, used to ease Data Access Committee’s review of data access requests.

    Links:

  • Synthetic datasets model the same characteristics and data fields as real cohort datasets, but their values are computer generated. These synthetic datasets can be freely used for developing tools or providing trainings, as they behave as real cohort datasets, without bearing any identifiable information.

    The CINECA project has developed 4 different Synthetic Cohort Datasets based on CINECA cohorts.

    Links