CINECA Challenge 3: Harmonised Cohort Level Metadata
Cohort metadata is essential to perform analyses and some cohorts have thousands of variables, including longitudinal data. The cohorts are also assembled for different purposes - population longitudinal studies vs. disease progress for example. Cohort variables are considered separately, and in groups where they may be grouped to represent a diagnosis e.g. those measured variables related to metabolic syndrome or risk factors for stroke. Standardisation and interoperability of these data are critical for this project and application of the FAIR (Findable, Accessible, Interoperable and Reusable) principles brings benefits to cohort owners and the wider community.
Work Package 3 addresses the metadata representation needs for cohort aggregate and individual data across studies and time. To address this complexity, we will develop a cross cutting matrix with which to organise the cohorts and will explore their existing data representation, for example, variables recorded, variable values and coding systems used. We will then construct a common minimal metadata model enabling the project’s use cases, for example, federated analyses and meta analyses. This work is informed by successful projects from other domains (e.g. Monarch Initiative, see LoS Haendel). We will explore the semantic coding systems and design harmonisation and ontology mapping strategies in support of Work Packages 1-5. These will leverage CINECA partner’s experience and will be aligned with outputs from international standards activities, e.g. ELIXIR, GA4GH, BBMRI, EOSCpilot, P3G and will also leverage previous EC, national and international investment into harmonisation tools, e.g. DATS (NIH), CORBEL (EC) project semantic toolkit such as the Ontology Lookup Service, BioBankUniverse/Connect (EC), MIABIS (EC) and other relevant checklists hosted at fairsharing.org.