Training Programme Annual Report 2021 - D6.5

Authors: Nicola Mulder (UCT), Mamana Mbiyavanga (UCT), Saskia Hiltemann (EMC), Vera Matser (EMBL-EBI), Marta Lloret Llinares (EMBL-EBI)

In this deliverable document, we report on the activities in task 6.4 - Training Programme, describe the CINECA training activities that took place in months 25-36 of the project and provide the Training Plan for the final year.

The CINECA training programme aims to train people within the CINECA consortium as well as external users. Different approaches have been employed, including face-to-face and online courses, hackathons, training videos and staff exchanges. While we waited for CINECA products to be completed, many of the training efforts for the year again focused on internal learning opportunities and knowledge exchanges, but some externally facing events were held to disseminate outputs. All the training and outreach events continued to be heavily impacted by COVID-19, which removed our ability to hold face-to-face workshops and staff exchanges. The staff exchanges were suspended and replaced with virtual meet-ups.

https://doi.org/10.5281/zenodo.5795482

Read More
Access representation ontology developed for project's cohorts - D3.5

Authors: Melanie Courtot (EMBL-EBI), Jonathan Dursi (SickKids), Nicky Mulder (UCT), Morris Swertz (UMCG)

Access, reuse and integration of biomedical datasets is critical to advance genomics research and realise benefits to human health. However, obtaining human controlled-access data in a timely fashion can be challenging, as neither the access requests nor the data uses conditions are standardised: their manual review and evaluation by a Data Access Committee (DAC) to determine whether access should be granted or not can significantly delay the process, typically by at least 4 to 6 weeks once the dataset of interest has been identified.

To address this, we have contributed to the development of the Data Use Ontology (DUO), which was approved as a Global Alliance for Genomics and Health (GA4GH) standard and has been used in over 200,000 annotations worldwide. DUO is a machine readable structured vocabulary that contains "Permission terms" (which describe data use permissions) and "Modifier terms" (which describe data use requirements, limitations or prohibitions) and it has already been implemented in some CINECA cohort and cohort data sharing resources (e.g. EGA, H3Africa, synthetic datasets); additional cohorts are in the process of reviewing data access policies with a view of applying DUO terms to their datasets.

https://doi.org/10.5281/zenodo.5795449

Read More
Text mining processing pipeline for semi structured data - D3.3

Authors: Jenny Copara (SIB), Nona Naderi (SIB), Alexander Kellmann (UMCG), Gurinder Gosal (SFU), William Hsiao (SFU), Douglas Teodoro (SIB)

Unstructured and semi-structured cohort data contain relevant information about the health condition of a patient, e.g., free text describing disease diagnoses, drugs, medication reasons, which are often not available in structured formats. One of the challenges posed by medical free texts is that there can be several ways of mentioning a concept. Therefore, encoding free text into unambiguous descriptors allows us to leverage the value of the cohort data, in particular, by facilitating its findability and interoperability across cohorts in the project.

Named entity recognition and normalization enable the automatic conversion of free text into standard medical concepts. Given the volume of available data shared in the CINECA project, the WP3 text mining working group has developed named entity normalization techniques to obtain standard concepts from unstructured and semi-structured fields available in the cohorts. In this deliverable, we present the methodology used to develop the different text mining tools created by the dedicated SFU, UMCG, EBI, and HES-SO/SIB groups for specific CINECA cohorts.

https://doi.org/10.5281/zenodo.5795433

Read More
CINECA Report on GA4GH Researcher ID - D2.2

Authors - Mikael Linden (CSC), Martin Kuba (MU), Jorge Izquierdo Ciges (EBI), Mamana Mbiyavanga (UCT)

GA4GH Passport (a.k.a. GA4GH Researcher ID) is the GA4GH standard for expressing an authenticated researcher’s roles and data access permissions (a.k.a. passport visas). Together with the GA4GH Authentication and Authorisation Infrastructure (AAI) specification, it describes how passport visas are issued and delivered from their authority (such as a Data Access Committee) to the environment where the data access takes place (such as an analysis platform or cloud). The work package has contributed to the Passport and AAI standards which were approved by the GA4GH in October 2019.

This deliverable provides an overview of the GA4GH Passport and AAI standards (version 1.0) in general. It then describes in detail how ELIXIR AAI has implemented the specification as a Passport broker service. Some directions of the next version of the GA4GH Passport and AAI standards, which are still in development in the GA4GH DURI workstream, are displayed. Finally, a hands-on experiment presents how passports could alternatively leverage self-sovereign identity, an emerging identity and access management paradigm in the industry.

https://doi.org/10.5281/zenodo.5795407

Read More
The Data Use Ontology to streamline responsible access to human biomedical datasets

The GA4GH Data Use Ontology (DUO) provides unambiguous, machine-readable standard language for consent forms and the data sharing policies they represent. Lawson et al. describe the DUO standard and implementations throughout the data access workflow to expedite data access while maintaining or improving compliant processes.

https://doi.org/10.1016/j.xgen.2021.100028

Read More
GA4GH Passport standard for digital identity and access permissions

Voisin et al. report the GA4GH Passport, a new international standard to encode machine-readable data access permissions for individual users. Passports are used as part of a federated data regulatory process to authenticate and authorize data users in managing access to human biomedical datasets and have been successfully implemented in international research programs and data infrastructures.

https://doi.org/10.1016/j.xgen.2021.100030

Read More