CINECA: Sharing patient genomic and biomolecular data across continents

This month we have a blog from Elisa Cirillo, co-lead of WP4 - Federated Joint Cohort Analysis. Elisa is a member of the team at the Hyve, Utrecht. The Hyve has been involved in the CINECA project since kickoff in January 2019. Over the past year WP4 has been developing the Federated Analysis Workflow.  In this blog, Elisa focuses on the first deliverable from WP4, the Report on the trust model for partner sites, and between sites and controlled-access researchers.

The CINECA project aims to bring together patient genomic and biomolecular data from multiple institutions and biobanks spanning three continents, creating a virtual cohort of 1.4 million individuals globally. Performing analyses on such a large cohort should help scientists to better explore the effect of a range of mutations in humans, establishing links between genetic variants and disease.  It should also speed up the development of drugs tailored to genetic variants for some diseases such as cancer and cystic fibrosis.

To achieve this goal, WP4 is developing a federated cloud-based network. Some of the partners have already successfully implemented workflows for analysing patient genomic and biomolecular data. The challenge is now to scale up the effort and make the analysis available in different institutions and biobanks participating in the CINECA project. 

Personal health train

One issue you quickly encounter with such an initiative is that patient data often cannot be exported from the hospital or research institution for safety and privacy reasons. The solution that the CINECA project will adopt, is the Personal Health Train concept. The idea behind this concept is that the data stays in the original physical location, but scientists from participating institutions can access the relevant data via a workflow that is adopted by different partners. The advantage of this approach is that data can be shared selectively, without exposing privacy-sensitive information. At the same time, pooling cohorts means that researchers can now analyse data from thousands or even a million individuals. This is a huge advantage for research areas such as genomics, where big data analysis is essential if you want to determine if a mutation is associated with an increased or a decreased risk on certain diseases or types of cancer. 

The first year of WP4: Deliverable 4.1 Report on trust model

In the first year WP4 has set up its goals and organised the work concerning the development of a set of tools that can facilitate federated analyses of new and diverse genetic and genomic datasets, based on specific use cases. The tools selected will be using the common federated infrastructure established in WP1 - Federated Data Discovery and Querying and WP2 - Interoperable Authentication and Authorisation Infrastructure, and the datasets will be described with metadata standards identified in WP3 - Cohort Level Meta Data Representation.

This work has contributed towards establishing a description of the trust model and four different levels of data access concerning specific cohort’s data, identifying use cases for the development of federated analysis workflows and describing existing data access models to inspire subsequent WP4 deliverables related to the implementation of the federated analysis workflow.  These topics are extensively described in the Deliverable 4.1 document entitled:  Report on trust model for partner sites, and between sites and controlled-access researchers

survey-data-transfer.png

Active communication and engagement of several WP4 members in other CINECA work packages enabled the inclusion in this document of other aspects of the CINECA cohort data access model. Examples include the WP2 cohort survey, the Data Use Ontology framework that was adopted by WP3 and by WP1, and WP5 provided input on the harmonisation and formalisation of clinical use cases. 

Key learnings from the deliverable report

In this deliverable we considered trust as the extent to which one party is willing to depend on the other party in a given situation with a feeling of relative security, even though negative consequences are possible. Such consideration is applicable in a series of circumstances evaluated in the report such as cohort data sharing or model of data access described in the report.

In addition, with this deliverable, the project has reached:

  1. Description of different levels of data sharing and mapping of those levels to the cohorts available in CINECA.

  2. Identification of the use cases that will be implemented in (or possibly are dependent on) WP4, including the cohorts’ data that will be used.

  3. Identification of existing models of data access that will be considered by WP4 in the subsequent tasks of the project.

WP4 use cases

A useful outcome of the deliverable report is that the WP4 members have agreed on the use cases to focus on for the implementation of the federated analysis workflows that provide a meaningful scientific output. In addition, the use case identification has brought guidance on the selection of CINECA cohorts suitable for the WP4 federated analysis based on the data sharing levels and data access models described in the deliverable document. 

The first use case involves running federated joint cohort variant genotyping across cohorts. Joint genotyping can be performed and implemented in various ways, based on data access levels of the participating cohorts. Data requirements could involve raw sequencing data (FASTQ/long read formats), reference-aligned data (BAM/CRAM) and/or individual genotypes (VCF/GVCF). 

The second use case concerns running a federated Expression Quantitative Trait Loci (eQTL) analysis across different cohorts. eQTL profiles are based on genetic variants that explain variation in gene expression levels. An eQTL may regulate different genes depending on the tissue type and disease state. Cohort data required to perform eQTL analysis are genotype (DNA-data) and gene expression data (RNA-data). 

The third use case is running a federated Polygenic Risk Score (PRS) analysis across different cohorts, calculating the likelihood that an individual develops a particular phenotype given his or her genotype. A PRS weighs trait-associated alleles across many loci. Data requirements from these cohorts are genotypes and phenotypes for the reference data or a pre-existing PRS, genotypes and phenotypes from individuals of similar ethnic background to evaluate the efficacy of the PRS.

Future plans

Currently WP4 is focused on continuing the work to execute the three use cases mentioned above, which means establishing the technological infrastructure to run the workflow of such use cases.  The first stage of implementation is to gather the technical requirements & frameworks for a federated analysis platform. Although the data sources needed for the analysis varies, there are similarities in the overall architecture of computing environments used for the analysis.

Throughout the project CINECA aims to be compatible with GA4GH standards, and our current proposal is that the APIs should be compatible with GA4GH cloud API standards. GA4GH's Cloud WS proposes 4 API standards that allow one to share tools/workflows (TRS), execute individual jobs on computing platforms using a standard API (TES), run full workflows on execution platforms (WES), and read/write data objects across clouds in an agnostic way (DRS). These API standards are inspired by large-scale, distributed compute projects & in theory could be developed for different computing & data archive environments.