ADDRESSING CHALLENGE 3: HARMONISED COHORT LEVEL METADATA
Harmonisation of cohort metadata is essential to performing meaningful cross-cohort analyses, by increasing interoperability of diverse sets of data. We will develop a common minimal metadata model by exploring the semantic coding systems of the project cohorts, and design ontology mapping strategies to produce machine readable consent ontology.
CINECA aims to accelerate disease research and improve health by facilitating transcontinental human data exchange, empowering researchers by streamlining access to larger combined cohorts. Currently, this is challenging due to the diversity of the attributes cohorts capture, as well as their heterogeneous representation. An essential first step in this process therefore is the integration of common variables used by the project cohorts in their data catalogues to generate harmonised metadata. To achieve this, WP3 built a minimal metadata model, which was formalised as an ontology and subsequently used for semantic harmonisation. This was done in collaboration with other projects, reviewing and engaging with existing resources. Finally, in addition to cohort metadata, WP3 also formalises data access conditions for cohorts datasets.
Minimal metadata model
The attributes across cohorts representing the common key concepts, often represented heterogeneously across cohorts, were identified and harmonized. This minimal metadata model was further enriched with the selected common metadata variables collected by major data catalogues (e.g. Maelstrom), and developed to reflect the use case requirements of the researchers (WP4 and WP5). The WP1 platform uses the harmonized and formalized metadata model for queries across the cohorts enabling researchers to discover samples, patient data, and cohorts of interest for their relevant research questions; results from those queries in turn help further refine the metadata model.
Semantic harmonisation
CINECA’s model, the Genomic Cohort Knowledge Ontology (GECKO), was formalised using the Web Ontology Language OWL, a World Wide Web Consortium standard, and adopts OBO Foundry best practices for ontology development. GECKO leverages and contributes to existing resources for maximal interoperability. It is used in the International HundredK+ Cohorts Consortium (IHCC) cohort browser, and has been extended for use by the Davos Alzheimer’s Collaborative consortium. GECKO increases data findability through high-level metadata queries against an interactive, searchable registry that is updated from cohorts regularly. GECKO is available publicly (CC-BY).
WP3 are working towards a ‘minimum information’ recommendation for federated cohort studies. This is being developed through outreach and alignment with BBMRI, other EUCAN projects, European Human Exposome Network, IMI, European Medicine Agency, and IHCC. The next step will be to finalise model v1 and convert to JSON schema.
Standardisation strategy and engagement
Members of CINECA are actively engaged in global efforts to develop and promote existing and emerging standards. From the outset, the CINECA project chose not to reinvent the wheel but instead to identify gaps in existing resources, and leverage connections with corresponding communities to fill those in, or extend other efforts. In particular, we have been investing in understanding metadata cataloguing needs from cohort studies across the EU, Canada and Africa. To this aim, we have engaged many cohort networks intending to harmonize metadata definitions and increase FAIRness of cohort metadata, including:
all consortia in the EUCAN programme (EUCAN-connect, EuCanShare, iReceptor, EuCanCan, Recodid)
members of the European Human Exposome Network
members of BBMRI-ERIC and Maelstrom (responsible for EU and Canadian catalogues of biobanks and cohorts).
In addition to cohort networks we have engaged the European Joint Programme for Rare Disease (EJP-RD) and Solve-RD projects specifically concerning genomics data for rare disease, as well as to IMI projects Conception and VAC4EU, consortia which study ‘real world evidence’ collecting data in health care settings. The results of this community exchange are early harmonisation on ‘minimum data elements’ for describing human datasets in cohorts and beyond.
Our vision for the next phase of CINECA is to continue collaborations with the aforementioned networks on standard metadata elements for cohort definition (see figure below), and to demonstrate metadata exchange between CINECA and these networks truly demonstrating FAIRness in cohort metadata.
Text-mining pipeline
Harmonisation of attributes across different cohorts is not restricted to the cohort data itself; cohorts are also described in scientific publications, and supplementary sections of research papers often contain large amounts of semi-structured data. This supplementary data can contain additional information which both reaffirms and enhances original data and could be very helpful in search and data discovery. Due to its nature and scale, this data is not amenable to manual processing, and instead requires the application of machine learning and Natural Language Processing techniques to extract any useful information.
The CINECA text mining group aims to provide common tools and methods to extract additional metadata from unstructured and semi-structured fields in cohorts’ data. WP3 members have written detailed blogs on some of the tools in development - focusing on the CoLaus/PsyCoLaus cohort data, HES-SO/SIB has developed a pipeline that uses MetaMap and learning-to-rank for assigning unambiguous metadata descriptors to free text; SFU has developed a rule-based text mining tool LexMapr that cleans up and parses shorter form unstructured text to extract biomedical entities and map these to standard ontology terms; EMBL-EBI has developed Zooma and Curami pipelines to annotate and curate semi-structured data.
Data use ontology
The Data Use Ontology (DUO) is a hierarchical vocabulary of machine-readable data use terms which has been developed in collaboration with and approved as a standard by the Global Alliance for Genomics and Health (GA4GH). The DUO terms allow the consistent and unambiguous representation of data use conditions in order to discover, gain access to and integrate diverse datasets.
Over 200,000 DUO annotations have been made globally as of February 2021.
WP3 additionally contributed to the development of the GA4GH Machine Readable Consent Guidance standard which provides instructions for researchers to integrate DUO into consent forms. We have collected a number of consent forms (together with the FAIR genomes project) that are being used to evaluate the systems for completeness of coverage.
Institutions Contributing to the WP effort to date
EMBL-EBI, SFU, CSC, HES-SO, UCT, SIB, SickKids/UHN, UMCG, EMC, INSERM, ClinicaGeno, CRG, The Hyve, BBMRI-ERIC