In EC-funded projects such as CINECA, our grant agreement with the European Commission commits us to producing agreed Deliverables at specified time points in the project timeline and delivering them to the EC. These Deliverables are outputs such as data, reports, a software milestone or other building block of the action. For Deliverables that are not in writing (e.g. demonstration of software implementation, or a database), the project must submit a short written description identifying the Deliverable.
Here, we list some of the Deliverables which CINECA project has submitted to the EC to date, with short descriptions and the doi.
This report details the development of a demonstration tool for CINECA Task T5.1, which integrates WP1-3 CINECA services to advance federated biobank search. UML diagrams and mockups were used to model the search process and create a user interface. The architecture of the system and strategies for accessing and storing data were also described. The report highlights efforts to integrate pre-analytical metadata, which are important for determining the quality of biospecimens, and a synthetic dataset was prepared to develop search services for accessing and visualizing such data. The evaluation of the robustness and search effectiveness of the implemented services is also discussed.
https://doi.org/10.5281/zenodo.6783295
Authors: Nicola Mulder (UCT), Mamana Mbiyavanga (UCT), Saskia Hiltemann (EMC), Vera Matser (EMBL-EBI), Marta Lloret Llinares (EMBL-EBI)
In this deliverable document, we report on the activities in task 6.4 - Training Programme, describe the CINECA training activities that took place in months 25-36 of the project and provide the Training Plan for the final year.
The CINECA training programme aims to train people within the CINECA consortium as well as external users. Different approaches have been employed, including face-to-face and online courses, hackathons, training videos and staff exchanges. While we waited for CINECA products to be completed, many of the training efforts for the year again focused on internal learning opportunities and knowledge exchanges, but some externally facing events were held to disseminate outputs. All the training and outreach events continued to be heavily impacted by COVID-19, which removed our ability to hold face-to-face workshops and staff exchanges. The staff exchanges were suspended and replaced with virtual meet-ups.
https://doi.org/10.5281/zenodo.5795482
Authors: Melanie Courtot (EMBL-EBI), Jonathan Dursi (SickKids), Nicky Mulder (UCT), Morris Swertz (UMCG)
Access, reuse and integration of biomedical datasets is critical to advance genomics research and realise benefits to human health. However, obtaining human controlled-access data in a timely fashion can be challenging, as neither the access requests nor the data uses conditions are standardised: their manual review and evaluation by a Data Access Committee (DAC) to determine whether access should be granted or not can significantly delay the process, typically by at least 4 to 6 weeks once the dataset of interest has been identified.
To address this, we have contributed to the development of the Data Use Ontology (DUO), which was approved as a Global Alliance for Genomics and Health (GA4GH) standard and has been used in over 200,000 annotations worldwide. DUO is a machine readable structured vocabulary that contains "Permission terms" (which describe data use permissions) and "Modifier terms" (which describe data use requirements, limitations or prohibitions) and it has already been implemented in some CINECA cohort and cohort data sharing resources (e.g. EGA, H3Africa, synthetic datasets); additional cohorts are in the process of reviewing data access policies with a view of applying DUO terms to their datasets.
https://doi.org/10.5281/zenodo.5795449
Authors: Jenny Copara (SIB), Nona Naderi (SIB), Alexander Kellmann (UMCG), Gurinder Gosal (SFU), William Hsiao (SFU), Douglas Teodoro (SIB)
Unstructured and semi-structured cohort data contain relevant information about the health condition of a patient, e.g., free text describing disease diagnoses, drugs, medication reasons, which are often not available in structured formats. One of the challenges posed by medical free texts is that there can be several ways of mentioning a concept. Therefore, encoding free text into unambiguous descriptors allows us to leverage the value of the cohort data, in particular, by facilitating its findability and interoperability across cohorts in the project.
Named entity recognition and normalization enable the automatic conversion of free text into standard medical concepts. Given the volume of available data shared in the CINECA project, the WP3 text mining working group has developed named entity normalization techniques to obtain standard concepts from unstructured and semi-structured fields available in the cohorts. In this deliverable, we present the methodology used to develop the different text mining tools created by the dedicated SFU, UMCG, EBI, and HES-SO/SIB groups for specific CINECA cohorts.
https://doi.org/10.5281/zenodo.5795433
Authors - Mikael Linden (CSC), Martin Kuba (MU), Jorge Izquierdo Ciges (EBI), Mamana Mbiyavanga (UCT)
GA4GH Passport (a.k.a. GA4GH Researcher ID) is the GA4GH standard for expressing an authenticated researcher’s roles and data access permissions (a.k.a. passport visas). Together with the GA4GH Authentication and Authorisation Infrastructure (AAI) specification, it describes how passport visas are issued and delivered from their authority (such as a Data Access Committee) to the environment where the data access takes place (such as an analysis platform or cloud). The work package has contributed to the Passport and AAI standards which were approved by the GA4GH in October 2019.
This deliverable provides an overview of the GA4GH Passport and AAI standards (version 1.0) in general. It then describes in detail how ELIXIR AAI has implemented the specification as a Passport broker service. Some directions of the next version of the GA4GH Passport and AAI standards, which are still in development in the GA4GH DURI workstream, are displayed. Finally, a hands-on experiment presents how passports could alternatively leverage self-sovereign identity, an emerging identity and access management paradigm in the industry.
https://doi.org/10.5281/zenodo.5795407
Authors - Melanie Courtot (EMBL-EBI), Isuru Liyanage (EMBL-EBI)
To support human cohort genomic and other omic data discovery and analysis across jurisdictions, basic data such as cohort participants’ demographic data, diseases, medication etc. (termed “minimal metadata”) needs to be harmonised. Individual cohorts are constrained by size, ancestral origins, and geographic boundaries that limit the subgroups, exposures, outcomes, and interactions which can be examined. Combining data across large cohorts to address questions none of them can answer alone enhances the value of each and leverages the enormous investments already made in them to address pressing questions in global health. By capturing genomic, epidemiological, clinical and environmental data from genetically and environmentally diverse populations, including populations that are traditionally under-represented, we will be able to capture novel factors associated with health and disease that are applicable to both individuals and communities globally.
We provide best practices for cohort metadata harmonisation, using the semantic platform we deployed in the cloud to enable cohort owners to map their data and harmonise against the GECKO (GEnomics Cohorts Knowledge Ontology) we developed. GECKO is derived from the CINECA minimal metadata model of the basic set of attributes that should be recorded with all cohorts and is critical to aid initial querying across jurisdictions for suitable dataset discovery. We describe how this minimal metadata model was formalised using modern semantic standards, making it interoperable with external efforts and machine readable. Furthermore, we present how those practices were successfully used at scale, both within CINECA for data discovery in WP1 and in the synthetic datasets constructed by WP3, and outside of CINECA such as in the International HundredK+ Cohorts Consortium (IHCC) and the Davos Alzheimer’s Collaborative (DAC). Finally, we highlight ongoing work for alignment with other efforts in the community and future opportunities.
The CINECA project aims to develop a common infrastructure to support federated data analysis across national cohorts in Europe, Canada, and Africa. In this report, the progress made over the past four years is discussed, which involves the development of six modular workflows to quantify and normalize molecular traits, pre-process genotype data, and test for associations between molecular traits and genotypes. The approach improves on the previous state of the art by packaging software dependencies into Docker/Singularity containers, using the Nextflow language to orchestrate complex multi-step workflows, and using the HASE(1) framework to reduce the amount of data that needs to be transferred between cohorts. The project provides training materials and open access datasets to encourage adoption and demonstrates how the workflows can be used to perform federated analysis across multiple real cohorts located in Switzerland, Germany, the Netherlands, and Estonia.
https://doi.org/10.5281/zenodo.7464116
Authors - Álvaro González (CSC), Shubham Kapoor (CSC), Kirill Tsukanov (EMBL-EBI)
The federated analysis platform defined by this task aims to provide technological solutions for three exemplar use cases: Federated joint cohort genotyping; Polygenic Risk Scores (PRS) workflow across two similar ethnic background sample sets; Federated QTL analysis for molecular phenotypes. In this deliverable, we gathered the technical requirements based on these use case descriptions and wrote a short design document which explains the requirements and lists the different options for a solution.
Three distinct frameworks were considered to address the requirements from the use-cases. The chosen framework supports different computing environments, which is a requirement for true federated analysis. The framework also supports extending compatibility with GA4GH standards, such as WES, htsget, and AAI / Passports. Plans to extend this proposed solution beyond these initial sites will be carried out after the initial phase of validation.
Authors - Romain Tanzer (HES-SO), Nona Naderi (HES-SO), Douglas Teodoro (HES-SO), Anais Mottaz (HES-SO), Patrick Ruch (HES-SO), Jonathan Dursi (SickKids), Jordi Rambla de Argila (CRG)
CINECA aims to support federated queries and analyses of distributed cohorts across continents. But human health datasets are extremely diverse; many different types of data are collected for many different kinds of health studies by many different health research communities. As a result, different cohort datasets often use different ontologies to describe similar kinds of entities, or represent concepts, such as genomic variation differently.
CINECA must span this diversity of data representations in order to achieve its goals of connecting health research cohort data. The work of WP3 partially addresses discoverability of datasets by defining a standard minimal cohort-level data representation which will be common across all cohorts; but that does not address cohort-level data that falls outside of the minimal common data model, nor does it address the representation of patient-level data. WP1’s role is to design and deploy API access to both cohort- and patient-level data, and a fundamental functionality of the infrastructure is to allow the user to find the appropriate dataset independently of the ontology used to map locally the different cohorts or indifferently of the format and syntax used to describe the variants.
This report describes the work done on query expansion, by implementing and demonstrating a query expansion service API that improves findability and searchability of distributed cohort data. Multiple kinds of query expansions are available for enabling further data integration and interoperability, including horizontal expansion, i.e., across ontological systems, and vertical expansion, i.e., within sublevels of the same ontological resource.
Authors - Vivian Jin, Fiona Brinkman (SFU)
To support human cohort genomic and other “omic” data discovery and analysis across jurisdictions, basic data such as cohort participant age, sex, etc needs to be harmonised. Developing a key “minimal metadata model” of these basic attributes which should be recorded with all cohorts is critical to aid initial querying across jurisdictions for suitable dataset discovery. We describe here the creation of a minimal metadata model, the specific methods used to create the minimal metadata model, and this model’s utility and impact.
A first version of the metadata model was built based on a review of Maelstrom research data standards and a manual survey of cohort data dictionaries, which identified and incorporated overlapping core variables across CINECA cohorts. The model was then converted to Genomics Cohorts Knowledge Ontology (GECKO) format and further expanded with additional terms. The minimal metadata model is being made broadly available to aid any project or projects, including those outside of CINECA interested in facilitating cross-jurisdictional data discovery and analysis.
In this deliverable document, we report on the activities in task 6.4 - Training Programme, describe the CINECA training activities in the first 24 months of the project and provide the Training Plan for the next 12-24 months. For training interventions targeted at a broader audience, we have set up a webinar series, providing quarterly online learning interventions. We ran a total of 6 webinars (3 of these webinars in 2019, and 3 in 2020), with 23 attendees on average, 68% on average of those who registered. In addition, a series of short training videos (https://www.cineca-project.eu/short-videos) was created to facilitate the uptake of CINECA outputs. Eight short videos were produced by work packages on different topics. The short videos were submitted to ELIXIR’s training portal to increase engagement and disseminated via CINECA’s various communication channels.
https://doi.org/10.5281/zenodo.6223125
Authors - Éloïse Gennet, Melanie Goisauf, Delphine Pichereau, Emmanuelle Rial-Sebbag
Remaining liberties that GDPR provides to EU Member States, as well as remaining ambiguities on GDPR interpretation, continue to feed debates in the ethical and legal literature. Projects like CINECA, which is seeking to facilitate health data exchanges between cohorts in Europe, Canada and Africa, offer valuable experience and input on essential ethical and legal gaps between countries and cohorts on questions such as the ethical lawful basis for international health data sharing and secondary processing for research purposes.
The focus of this deliverable will be on answering, both from a legal and an ethical point of view, two priority questions: How to choose a legal basis for CINECA’s data processing? And how should CINECA apprehend broad consent to further data processing? The goal will be to study how the CINECA project could be efficiently conducted (especially data sharing) while being legally compliant with relevant laws and regulations across all member states, and most of all, being compliant with established ethical guidelines and practices across three continents.
This deliverable demonstrates authentication and authorisation interoperability between the ELIXIR and CanDIG infrastructures. Users from one infrastructure can access services from the other. The interoperability covers user identification and authentication as well as the transfer of the authorisation claims following the GA4GH Passport and AAI OpenID Connect protocol (OIDC) profile specifications.
CINECA aims to support the federated queries and analyses of distributed cohorts across continents. A vital component of this work is building a machine readable catalogue of cohorts and sites that support the efforts of Work Package 1 discovery and analysis APIs, which can be programmatically queried so that API calls can be made to relevant sites and results gathered and presented to the researcher.
Deliverable D1.1, Discovery Service Catalogue, supports the work of dependent work packages by implementing and demonstrating an open-source extended implementation of the Service Registry standard of the Global Alliance for Genomics and Health (GA4GH) for WP1’s discovery queries, the GA4GH Beacon queries. The Service Registry standard is now supported by the ELIXIR Beacon Network that CINECA WP1 uses to federate discovery queries across cohorts, and this demonstrator deliverable demonstrates the use of the service registry and its open source implementation.
The CINECA consortium was formed in response to the EU call ‘Better Health and care, economic growth and sustainable health systems’ (H2020-SC1-BHC-2018-2020) with a proposal for an international collaboration with Canada and Africa for a federated cloud enabled infrastructure making population scale genomic and biomolecular data accessible across international borders. The CINECA consortium will create one of the largest cross-continental implementations of human genetic and phenotypic data federation and interoperability with a focus on common (complex) disease, one of the world’s most significant health burdens.
The CINECA Kick off meeting was held on January 24th-25th 2019 at the Wellcome Genome Campus Conference Centre, Hinxton UK. The key objective of the meeting was to bring together consortium members to facilitate discussion on the project’s goals and action plan. The report focuses on an overview of the Work Packages as presented to the consortium (focusing on deliverables due in the first reporting period), the cohorts included in the project, and the decisions made by the Executive Board for actions to implement in year 1 of the project.
In this deliverable, we report on the activities for Deliverable 4.1 - Report on trust model for partner sites, and between sites and controlled-access researchers. The Work Package 4 (WP4) goals concern the development of a set of tools that can facilitate federated analyses of new and diverse genetic and genomic datasets, based on specific use cases. The tools selected will be using the common federated infrastructure established in WP1 and WP2, and the datasets will be described with metadata standards identified in WP3.
In our report we considered trust as the extent to which one party is willing to depend on the other party in a given situation with a feeling of relative security, even though negative consequences are possible. This work has contributed towards establishing a description of the trust model and four different levels of data access concerning specific cohort’s data, identifying use cases for the development of federated analysis workflows and describing existing data access models to inspire subsequent WP4 deliverables related to the implementation of the federated analysis workflow. Active communication and engagement of several WP4 members in other CINECA work packages enabled the inclusion in this document of other aspects of the CINECA cohort data access model. Examples include the WP2 cohort survey, the Data Use Ontology framework that was adopted by WP3 and by WP1, and WP5 provided input on the harmonisation and formalisation of clinical use cases.
Authors: Éloïse Gennet & Melanie Goisauf
Keywords: ELSI, data sharing, secondary data, FAIR, data processing, consent, international data sharing
The aim of this deliverable is to give an overview of all the different ethical, legal and societal issues that the CINECA project might be confronted with: public health ethics, personal data protection, ethics of data sharing, protection of consent and vulnerability as well as compliance issues between Canada, Africa and Europe in pursuit of the project goal to enable the exchange of population scale health data across international borders to allow and promote the reuse of data for health research. The rationale for sharing and reusing data in public health research is deeply rooted in the promotion of a fair distribution of research risks and benefits, and it has become an essential and powerful tool for public health research. D7.1 has been elaborated in a bottom up approach, starting from the practical legal and ethical issues encountered notably through Work Package 9 (EC Ethics Requirements). As a basis for the lawful and ethical guarantees for data sharing and reuse within CINECA, all cohorts and consortiums have provided for the copies of their own ethics approvals (Deliverable 9.4), and they are all independently responsible for ensuring researchers accessing data have their own research ethics approval. This deliverable will serve as a starting point for the future deliverable 7.2 which will be aimed at identifying and discussing the gaps in the different legislative or regulatory frameworks and corresponding literature.
As a consequence, this deliverable will be divided into two main parts, the first one focusing on the collective perspectives of international data sharing in public health research (I), the second one examining the opposite perspective of the protection of individual data subjects when their personal data is used for secondary processing (II). Afterwards, future developments will be briefly mentioned (III) before highlighting some of the difficulties encountered in Work Package 9 (IV) and finally listing the references (V).
The overarching purpose of CINECA is to achieve federated human data interoperability between 10 existing cohorts from Canada, Europe and Africa which represent >1.4 million individuals. This will enable population scale genomic and biomolecular data access across international borders, accelerating research and improving the health of individuals resident across continents. This project will not generate novel data from human data, rather it relies on integrating existing resources which hold or store data to deliver new knowledge and innovation. All data access is determined by the existing data access committees (DACs) and the respective data processes for each dataset. The informed consent and ethics approvals for CINECA cohorts were documented in D9.3 and D9.4, and will be respected by this data management plan, which will be an integral part of the consortium’s Governance Framework. Given the international nature of this project, the data management plan is a component of Work Package 7 - Ethical and legal governance framework for transnational data-sharing. We note that all partners in the project have relevant national and/or international experience in acquiring, storing, analysing and sharing high complexity datasets and in the implementation of the FAIR principles for these resources.
The Data Management Plan (DMP) was developed based on the core requirements for DMPs as described by The Science Europe Practical Guide to the International Alignment of Research Data Management (https://www.scienceeurope.org). The DMP is a living document, expected to be updated during the lifetime of the project.
Deliverable D6.2a
Training Programme - M12
Authors: Nicola Mulder (UCT), Mamana Mbiyavanga (UCT), Saskia Hiltemann (EMC), Vera Matser (EMBL-EBI)
In this deliverable document, we report on the activities in task 6.2a - Training Programme and describe the CINECA training activities in the first 12 months of the project.
In the first phase of the CINECA project, many of the training efforts are focused on internal learning opportunities and knowledge exchanges. There are many interdependencies between the different CINECA work packages, so to this end we have set up a staff exchange program. For training interventions targeted at a broader audience, we have set up a webinar series, providing online learning interventions. Feedback from these webinars is collected and is used to further increase the utility of the webinars going forward. We have disseminated a survey to identify the training needs of CINECA’s stakeholder community. Additionally, we ran our first stakeholder engagement session at the International Hundred Thousand Cohorts Consortium (IHCC) Meeting in Reykjavik, Iceland in April 2019. During this session, we gathered feedback on the challenges of managing cohorts and cohort data harmonisation. More extensive face-to-face workshops and training events are being planned for 2020.
In this deliverable document, we report on the activities in task 6.1 – Stakeholder analysis, with the outreach and dissemination plan, as well as the training plan presented in this report, we have completed this task. We have used a combination of surveys and face-to-face meetings to verify and extend or identify key stakeholder groups, understand their interest in the project, and identify bottlenecks that can be addressed by outreach and training. Outreach activities will be accomplished by task 6.2 and training by task 6.3.
https://doi.org/10.5281/zenodo.7733981