This month’s blog was written by Dylan Spalding (EMBL-EBI), Coordinator of the European Genome-phenome Archive and co-WPL of CINECA WP4 - Federated Joint Cohort Analysis. This blog is the first in our new series, presenting an overview of GA4GH standards being developed and implemented by CINECA.

Introduction

The aim of the CINECA project is to deliver a federated infrastructure for data discovery of human genetic and phenotypic data, facilitating transcontinental human data exchange for research and clinical applications. CINECA has assembled a virtual cohort of 1.4 million individuals from population, longitudinal and disease studies such as CanDIG, H3Africa, and the European Genome-phenome Archive (EGA). Key objectives of the CINECA project include developing solutions both to the challenges of delivering transcontinental security requirements for data access, and to the ethical, legal and societal commitments where data cannot move outside a legal jurisdiction.

Enabling federated data discovery, data access, and research and analysis, implementing the same standards across the federated network, is crucial in order to permit the same or similar analyses to be performed across the different cohorts. As CINECA aims to support trans-continental federated analysis across cohorts located in Canada, Africa, and Europe, choosing these standards is critical to the success of the project. The Global Alliance for Genomics and Health (GA4GH) is a standards setting organisation that aims to facilitate research in genomic and healthcare data. Some of the standards being developed by GA4GH, and their use in CINECA, are described here. Future posts will go into more detail on the development and implementation of some of these standards for different CINECA work packages.

GA4GH

GA4GH is comprised of driver projects and work streams (Figure 1). The driver projects have a specific set of use-cases for one or more standards, and the work streams are there to develop the standards required for the driver projects use-cases. While CINECA is not a GA4GH driver project, members of each work package are either co-leads, or members of relevant work streams within GA4GH. Due to the reasons described above, it is essential to the success of CINECA that appropriate global standards are used across the CINECA project.

There are 2 main types of work streams within GA4GH, foundational work streams and technical work streams. The foundational work streams - Data Security, and Regulatory and Ethics (shown at the bottom of Figure 1) - comprise core issues that relate to all GA4GH work, and as such provide guidance to both driver projects and technical workstreams. The technical work streams develop and extend the technical standards that are required by the driver projects. The technical work streams include the Clinical and Phenotypic Data Capture, the Cloud, the Data Use and Researcher Identity, the Discovery, the Genomic Knowledge Standards, and the Large Scale Genomics work streams. The standards developed by the technical work streams must be reviewed by both the foundational workstreams and a specially convened product review committee, before finally being approved by the GA4GH steering committee (https://www.ga4gh.org/how-we-work/ga4gh-product-approval/).

Standards

Many of the standards that have been developed by the GA4GH are already in use within CINECA or associated cohorts; below is a brief overview of some of those applicable standards, and how they facilitate CINECA processes.

Beacon

The Beacon standard is a GA4GH standard that supports queries on alleles (gene variants) but aims to protect the identity of the participants. To do this, it is restricted to specific queries on the existence of an allele within a cohort, delivering a response which is dependent on the user’s access level. The Beacon standard allows for anonymous users to perform the queries, so that users need not apply for access to the data a priori. In this case the query will return a binary Yes or No as to the presence of the allele in the cohort. It also supports registered access, where a user has a specific identity and as such can access additional information, such as the allele frequency. Examples: if registered users include those with ‘bona-fide’ status within the ELIXIR AAI (see Passports section below), and this access level can be adapted depending on the requirements of the individual Beacon and the data it contains.

The Beacon standard allows for Beacons to be networked, so that a single query may be federated to multiple Beacons, for example a Beacon on each CINECA cohort. An extended Beacon standard (Beacon V2) is being developed in collaboration with CINECA WP1 - Federated Data Discovery and Querying. Plans for Beacon V2 include enabling queries by type, filtering of matched variants by additional conditions, and specification of access levels.

Phenopackets

Transfering phenotypic data between different resources in a standardised way is a requirement to ensure that both phenotypic queries and data transfer are consistent across a federated network. Phenopackets provide the information model that different levels of clinical and phenotypic data require in order to be exchanged. CINECA WP3 - Cohort Level Meta Data Representation are collaborating with GA4GH on developing this standard.

Passport

The Passport defines a standard way to represent the role and access rights of a particular user. It can be used to indicate if a particular user can access registered access resources, such as the Beacon, using the ‘bona-fide’ researcher attribute as implemented by ELIXIR (https://elixir-europe.org/services/compute/aai/bonafide). It also details which datasets the user has full controlled access to (via Controlled Access Visas), their affiliation(s), and their roles. CINECA WP2 - Interoperable Authentication and Authorisation Infrastructure are closely involved with implementing this standard across the CINECA cohorts, and bringing the CINECA use case to GA4GH as the standard evolves. [See here for a previous blog on this implementation by Michal Procházka.]

Authentication and Authorisation Infrastructure

For Passports to work, there must be a standard for the infrastructure that details how to define the users identity, and how to pass this information between federated cohorts in a consistent way. GA4GH approved the Authentication and Authorisation Infrastructure (AAI) standard for this purpose, and this leverages the existing OpenID Connect (OIDC) standard to ensure maximum compatibility with existing identities. By making the AAI standard OIDC compatible, users may link and use different OIDC identities to access different resources, for example a Google identity to access a particular cohort. As with the Passport standard, CINECA WP2 are helping to implement this standard across the cohorts.

Data Use Ontology

There are a diverse range of conditions which different cohorts apply to restrict use of their data. These can range from geographical restrictions, to restricting research for particular purposes, to requiring results to be published. To standardise the way these different requirements were communicated, the Data Use Ontology (DUO) was developed. This ensures that equivalent data use conditions are expressed in the same way across different cohorts. This is especially important when the cohorts are based in different legal jurisdictions, as they are with CINECA. The DUO also facilitates data discovery, for example when a researcher is investigating heart disease, the DUO can be used to ensure that only datasets that can be used for research into heart disease are returned as possible cohorts of interest. CINECA WP3 are closely involved with extending the ontology, based on the requirements of the cohorts.

Htsget

Htsget is a data streaming specification to allow the secure transfer of genomic data between locations. The standard supports range queries, for example allowing the user to just stream the gene of interest from a whole genome CRAM file, instead of downloading the whole file. This standard is crucial for the work of both CINECA WP4 - Federated Joint Cohort Analysis and WP5 - Healthcare Interoperability and Clinical Applications, as it allows the transfer of data to the appropriate cloud.

Crypt4GH

Genomic data contains sensitive data on individuals, and as such preventing accidental disclosure of these data is extremely important. Crypt4GH is a file format that stores such data in an encrypted state. It supports fast decryption of the data, and also indexing allows random access of the files without having to decrypt the whole file. Such advantages are leveraged by other standards, such as htsget. This standard has applications within individual cohorts where the data are stored, but also for WP4 and WP5 to allow secure data storage local to the compute infrastructure that performs the analyses.

File Formats

GA4GH standards include the VCF, BAM, and CRAM file formats among others (https://github.com/samtools/hts-specs). These ensure that genomic information is consistently represented, and help underpin the bioinformatic analysis tools and APIs. For example, using such consistent file formats allows the implementation of htsget on Crypt4GH encrypted CRAM files, allowing secure federated access to remote genetic data. These standards are applicable to the cohorts, but also to WP1 as it allows discovery of the appropriate data, WP3 as the metadata should help define the types and format of data present in each cohort, and WPs 4 and 5 as their analyses run on these data so need to know the format the data are in a priori.

Task Execution Service

When data may not leave a certain jurisdiction, as defined by DUO, to facilitate data analysis there needs to be a way of representing the task or analysis that is interoperable between different locations and compute infrastructure. The Task Execution Service (TES) defines a standardised way of defining such a task. This includes the input files, the containers to execute, the output files, and any required logging or metadata. In CINECA the primary WPs which use this standard are WP4 and 5.

Example Use Case

An example of how all these standards can work together to facilitate federated analysis can be seen in the use-case where a researcher is looking for data relating to a specific gene and associated phenotype. The researcher can query a Beacon, or the more powerful Beacon V2 being developed by WP1 in collaboration with GA4GH and others, to determine if there are any cohorts with data that may be applicable to the proposed research. The query can include the gene of interest, plus the proposed research type using DUO, so that only applicable datasets are returned. Once a list of applicable datasets with relevant information is returned, the researcher can apply for access to these datasets using their federated identity, for example an ELIXIR identity (https://elixir-europe.org/services/compute/aai). Once the application has been successful, the list of datasets the researcher has access to is contained as Controlled Access Visas within their GA4GH Passport. The researcher can then log into any services supported by the relevant cohort to access these data, for example, they could use htsget to stream the relevant gene to a genomic viewer to view the coverage over that gene in the BAM file, and to access the associated variants in the VCF file. Additionally the researcher may run an analysis, for example re-calling the variants from the BAM file, using a remote cloud and specifying the analysis pipeline to run using TES.

Figure 2: After data has been uploaded to EGA and a user granted access, using the htsget standard genomic regions can be streamed to remote locations securely. This allows visualisation of regions of files, for example using Integrated Genomic View… — **Figure 2**: After data has been uploaded to EGA and a user granted access, using the htsget standard genomic regions can be streamed to remote locations securely. This allows visualisation of regions of files, for example using Integrated Genomic Viewer (IGV), or even on-the-fly analysis of files hosted at EGA, for example in the Embassy Cloud.

Conclusion

Within CINECA some of these standards are already in use, for example the European Genome-phenome Archive maintains a Beacon, supports access using the AAI and Passport standards (hence allowing users to link their EGA and ELIXIR identities to access data), the Data Use Ontology to help find relevant datasets, Phenopackets for phenotypic data submission and distribution, and htsget for data distribution (Figure 2). EGA also uses the Crypt4GH standard for file encryption. In the coming months, we will continue our series of GA4GH blogs to provide more detail on the standards that CINECA is working to develop and implement.

Implementation of GA4GH standards in CINECA