This report details the development of a demonstration tool for CINECA Task T5.1, which integrates WP1-3 CINECA services to advance federated biobank search. UML diagrams and mockups were used to model the search process and create a user interface. The architecture of the system and strategies for accessing and storing data were also described. The report highlights efforts to integrate pre-analytical metadata, which are important for determining the quality of biospecimens, and a synthetic dataset was prepared to develop search services for accessing and visualizing such data. The evaluation of the robustness and search effectiveness of the implemented services is also discussed.
https://doi.org/10.5281/zenodo.6783295
Authors: Nicola Mulder (UCT), Mamana Mbiyavanga (UCT), Saskia Hiltemann (EMC), Vera Matser (EMBL-EBI), Marta Lloret Llinares (EMBL-EBI)
In this deliverable document, we report on the activities in task 6.4 - Training Programme, describe the CINECA training activities that took place in months 25-36 of the project and provide the Training Plan for the final year.
The CINECA training programme aims to train people within the CINECA consortium as well as external users. Different approaches have been employed, including face-to-face and online courses, hackathons, training videos and staff exchanges. While we waited for CINECA products to be completed, many of the training efforts for the year again focused on internal learning opportunities and knowledge exchanges, but some externally facing events were held to disseminate outputs. All the training and outreach events continued to be heavily impacted by COVID-19, which removed our ability to hold face-to-face workshops and staff exchanges. The staff exchanges were suspended and replaced with virtual meet-ups.
https://doi.org/10.5281/zenodo.5795482
Authors: Melanie Courtot (EMBL-EBI), Jonathan Dursi (SickKids), Nicky Mulder (UCT), Morris Swertz (UMCG)
Access, reuse and integration of biomedical datasets is critical to advance genomics research and realise benefits to human health. However, obtaining human controlled-access data in a timely fashion can be challenging, as neither the access requests nor the data uses conditions are standardised: their manual review and evaluation by a Data Access Committee (DAC) to determine whether access should be granted or not can significantly delay the process, typically by at least 4 to 6 weeks once the dataset of interest has been identified.
To address this, we have contributed to the development of the Data Use Ontology (DUO), which was approved as a Global Alliance for Genomics and Health (GA4GH) standard and has been used in over 200,000 annotations worldwide. DUO is a machine readable structured vocabulary that contains "Permission terms" (which describe data use permissions) and "Modifier terms" (which describe data use requirements, limitations or prohibitions) and it has already been implemented in some CINECA cohort and cohort data sharing resources (e.g. EGA, H3Africa, synthetic datasets); additional cohorts are in the process of reviewing data access policies with a view of applying DUO terms to their datasets.
https://doi.org/10.5281/zenodo.5795449
Authors: Jenny Copara (SIB), Nona Naderi (SIB), Alexander Kellmann (UMCG), Gurinder Gosal (SFU), William Hsiao (SFU), Douglas Teodoro (SIB)
Unstructured and semi-structured cohort data contain relevant information about the health condition of a patient, e.g., free text describing disease diagnoses, drugs, medication reasons, which are often not available in structured formats. One of the challenges posed by medical free texts is that there can be several ways of mentioning a concept. Therefore, encoding free text into unambiguous descriptors allows us to leverage the value of the cohort data, in particular, by facilitating its findability and interoperability across cohorts in the project.
Named entity recognition and normalization enable the automatic conversion of free text into standard medical concepts. Given the volume of available data shared in the CINECA project, the WP3 text mining working group has developed named entity normalization techniques to obtain standard concepts from unstructured and semi-structured fields available in the cohorts. In this deliverable, we present the methodology used to develop the different text mining tools created by the dedicated SFU, UMCG, EBI, and HES-SO/SIB groups for specific CINECA cohorts.
https://doi.org/10.5281/zenodo.5795433
Authors - Mikael Linden (CSC), Martin Kuba (MU), Jorge Izquierdo Ciges (EBI), Mamana Mbiyavanga (UCT)
GA4GH Passport (a.k.a. GA4GH Researcher ID) is the GA4GH standard for expressing an authenticated researcher’s roles and data access permissions (a.k.a. passport visas). Together with the GA4GH Authentication and Authorisation Infrastructure (AAI) specification, it describes how passport visas are issued and delivered from their authority (such as a Data Access Committee) to the environment where the data access takes place (such as an analysis platform or cloud). The work package has contributed to the Passport and AAI standards which were approved by the GA4GH in October 2019.
This deliverable provides an overview of the GA4GH Passport and AAI standards (version 1.0) in general. It then describes in detail how ELIXIR AAI has implemented the specification as a Passport broker service. Some directions of the next version of the GA4GH Passport and AAI standards, which are still in development in the GA4GH DURI workstream, are displayed. Finally, a hands-on experiment presents how passports could alternatively leverage self-sovereign identity, an emerging identity and access management paradigm in the industry.
https://doi.org/10.5281/zenodo.5795407
Authors - Melanie Courtot (EMBL-EBI), Isuru Liyanage (EMBL-EBI)
To support human cohort genomic and other omic data discovery and analysis across jurisdictions, basic data such as cohort participants’ demographic data, diseases, medication etc. (termed “minimal metadata”) needs to be harmonised. Individual cohorts are constrained by size, ancestral origins, and geographic boundaries that limit the subgroups, exposures, outcomes, and interactions which can be examined. Combining data across large cohorts to address questions none of them can answer alone enhances the value of each and leverages the enormous investments already made in them to address pressing questions in global health. By capturing genomic, epidemiological, clinical and environmental data from genetically and environmentally diverse populations, including populations that are traditionally under-represented, we will be able to capture novel factors associated with health and disease that are applicable to both individuals and communities globally.
We provide best practices for cohort metadata harmonisation, using the semantic platform we deployed in the cloud to enable cohort owners to map their data and harmonise against the GECKO (GEnomics Cohorts Knowledge Ontology) we developed. GECKO is derived from the CINECA minimal metadata model of the basic set of attributes that should be recorded with all cohorts and is critical to aid initial querying across jurisdictions for suitable dataset discovery. We describe how this minimal metadata model was formalised using modern semantic standards, making it interoperable with external efforts and machine readable. Furthermore, we present how those practices were successfully used at scale, both within CINECA for data discovery in WP1 and in the synthetic datasets constructed by WP3, and outside of CINECA such as in the International HundredK+ Cohorts Consortium (IHCC) and the Davos Alzheimer’s Collaborative (DAC). Finally, we highlight ongoing work for alignment with other efforts in the community and future opportunities.
https://doi.org/10.5281/zenodo.5055308
Read MoreThe CINECA project aims to develop a common infrastructure to support federated data analysis across national cohorts in Europe, Canada, and Africa. In this report, the progress made over the past four years is discussed, which involves the development of six modular workflows to quantify and normalize molecular traits, pre-process genotype data, and test for associations between molecular traits and genotypes. The approach improves on the previous state of the art by packaging software dependencies into Docker/Singularity containers, using the Nextflow language to orchestrate complex multi-step workflows, and using the HASE(1) framework to reduce the amount of data that needs to be transferred between cohorts. The project provides training materials and open access datasets to encourage adoption and demonstrates how the workflows can be used to perform federated analysis across multiple real cohorts located in Switzerland, Germany, the Netherlands, and Estonia.
https://doi.org/10.5281/zenodo.7464116
Authors - Vivian Jin, Fiona Brinkman (SFU)
To support human cohort genomic and other “omic” data discovery and analysis across jurisdictions, basic data such as cohort participant age, sex, etc needs to be harmonised. Developing a key “minimal metadata model” of these basic attributes which should be recorded with all cohorts is critical to aid initial querying across jurisdictions for suitable dataset discovery. We describe here the creation of a minimal metadata model, the specific methods used to create the minimal metadata model, and this model’s utility and impact.
A first version of the metadata model was built based on a review of Maelstrom research data standards and a manual survey of cohort data dictionaries, which identified and incorporated overlapping core variables across CINECA cohorts. The model was then converted to Genomics Cohorts Knowledge Ontology (GECKO) format and further expanded with additional terms. The minimal metadata model is being made broadly available to aid any project or projects, including those outside of CINECA interested in facilitating cross-jurisdictional data discovery and analysis.
https://doi.org/10.5281/zenodo.4575460