Posts in WP3
Access representation ontology developed for project's cohorts - D3.5

Authors: Melanie Courtot (EMBL-EBI), Jonathan Dursi (SickKids), Nicky Mulder (UCT), Morris Swertz (UMCG)

Access, reuse and integration of biomedical datasets is critical to advance genomics research and realise benefits to human health. However, obtaining human controlled-access data in a timely fashion can be challenging, as neither the access requests nor the data uses conditions are standardised: their manual review and evaluation by a Data Access Committee (DAC) to determine whether access should be granted or not can significantly delay the process, typically by at least 4 to 6 weeks once the dataset of interest has been identified.

To address this, we have contributed to the development of the Data Use Ontology (DUO), which was approved as a Global Alliance for Genomics and Health (GA4GH) standard and has been used in over 200,000 annotations worldwide. DUO is a machine readable structured vocabulary that contains "Permission terms" (which describe data use permissions) and "Modifier terms" (which describe data use requirements, limitations or prohibitions) and it has already been implemented in some CINECA cohort and cohort data sharing resources (e.g. EGA, H3Africa, synthetic datasets); additional cohorts are in the process of reviewing data access policies with a view of applying DUO terms to their datasets.

https://doi.org/10.5281/zenodo.5795449

Read More
Text mining processing pipeline for semi structured data - D3.3

Authors: Jenny Copara (SIB), Nona Naderi (SIB), Alexander Kellmann (UMCG), Gurinder Gosal (SFU), William Hsiao (SFU), Douglas Teodoro (SIB)

Unstructured and semi-structured cohort data contain relevant information about the health condition of a patient, e.g., free text describing disease diagnoses, drugs, medication reasons, which are often not available in structured formats. One of the challenges posed by medical free texts is that there can be several ways of mentioning a concept. Therefore, encoding free text into unambiguous descriptors allows us to leverage the value of the cohort data, in particular, by facilitating its findability and interoperability across cohorts in the project.

Named entity recognition and normalization enable the automatic conversion of free text into standard medical concepts. Given the volume of available data shared in the CINECA project, the WP3 text mining working group has developed named entity normalization techniques to obtain standard concepts from unstructured and semi-structured fields available in the cohorts. In this deliverable, we present the methodology used to develop the different text mining tools created by the dedicated SFU, UMCG, EBI, and HES-SO/SIB groups for specific CINECA cohorts.

https://doi.org/10.5281/zenodo.5795433

Read More
Semantic and harmonisation best practice - D3.2

Authors - Melanie Courtot (EMBL-EBI), Isuru Liyanage (EMBL-EBI)

To support human cohort genomic and other omic data discovery and analysis across jurisdictions, basic data such as cohort participants’ demographic data, diseases, medication etc. (termed “minimal metadata”) needs to be harmonised. Individual cohorts are constrained by size, ancestral origins, and geographic boundaries that limit the subgroups, exposures, outcomes, and interactions which can be examined. Combining data across large cohorts to address questions none of them can answer alone enhances the value of each and leverages the enormous investments already made in them to address pressing questions in global health. By capturing genomic, epidemiological, clinical and environmental data from genetically and environmentally diverse populations, including populations that are traditionally under-represented, we will be able to capture novel factors associated with health and disease that are applicable to both individuals and communities globally.

We provide best practices for cohort metadata harmonisation, using the semantic platform we deployed in the cloud to enable cohort owners to map their data and harmonise against the GECKO (GEnomics Cohorts Knowledge Ontology) we developed. GECKO is derived from the CINECA minimal metadata model of the basic set of attributes that should be recorded with all cohorts and is critical to aid initial querying across jurisdictions for suitable dataset discovery. We describe how this minimal metadata model was formalised using modern semantic standards, making it interoperable with external efforts and machine readable. Furthermore, we present how those practices were successfully used at scale, both within CINECA for data discovery in WP1 and in the synthetic datasets constructed by WP3, and outside of CINECA such as in the International HundredK+ Cohorts Consortium (IHCC) and the Davos Alzheimer’s Collaborative (DAC). Finally, we highlight ongoing work for alignment with other efforts in the community and future opportunities.

https://doi.org/10.5281/zenodo.5055308

Read More
Useful ontologies for harmonizing cohort data

This video describes a community of practice for interoperable ontology building called the OBO Foundry, and highlights a number of well curated and maintained ontologies that are useful for annotating cohort data. The video is aimed at anyone interested in data standardization and/or the ontology approach (i.e. public, end users). No prerequisite knowledge is required, but viewers may also find our previous videos useful. This video is part of the CINECA online training series, where you can learn about concepts and tools relevant to federated analysis of cohort data.

Read More