Passport is the glue between the researcher, data and computing
This month’s blog was written by Mikael Linden, Senior application specialist at CSC - IT Center for Science, and co-Work Package Lead of CINECA WP2 - Interoperable Authentication and Authorisation Infrastructure. This blog is the third in our GA4GH standards series, presenting an overview of how GA4GH standards are being developed and implemented by CINECA. For the first blog in the series, giving a broad overview of how CINECA is facilitating federated data discovery, access and analysis, please see Dylan Spalding’s blog Implementation of GA4GH standards in CINECA.
Genomic and other human data is sensitive for ELSI (ethical, legal and social implications) reasons, thus requiring that a researchers’ access to data needs to be managed. The classic paradigm is called ‘controlled access’ -- a researcher may receive a permission to access a sensitive dataset for secondary purposes by presenting a Data Access Request (DAR) to a Data Access Committee (DAC) representing the original data collector. Following a positive evaluation, the researcher and their home organisation as the legally responsible entity signs a Data Access Agreement (DAA) with the DAC that grants the researcher the permission to access the data.
Bring data to compute
In the classic approach, researchers who have been granted access to data receive the credentials (such as, username, password, download link and the necessary data decryption keys) to download the data from the data archive’s download service. It may be possible also to configure the credentials to a client software that downloads the data directly to the analysis tool.
The data download is a one-off event, after which the original data collector loses the line of sight to the dataset. The data downloader remains bound by the DAA which typically imposes an obligation to protect the data appropriately, use a secure environment for computing with the data, and delete the data when the research is finished. However, the data owner has no technical controls to enforce that.
Bring compute to data approach
In the “bring compute to data” paradigm, a researcher who has received the data access permissions from the DAC and signed the DAA does not download the data to their own environment. Instead, the datasets are made available to them in a secure environment where they can bring and run their analysis tool and export only the results. The secure environment is provided by the data archive or a third party (such as a cloud provider) and can, for instance, make use of the GA4GH Cloud standards.
The “bring compute to data” paradigm emphasises the role the Authentication and Authorisation Infrastructure (AAI) plays in the data access. When instead of downloading the data, the researcher accesses the data in the computing environment, the environment needs to make sure that the researcher (a) is the same person who received the data access permission from the DAC and (b) the permission is still temporally effective. For instance, the data access permission expires in the computing environment if the research project ends or the researcher departs from their home organisation.
The three relations of a researcher can be illustrated as a triangle whose corners are
The DAC represents the project that originally collected the data and makes the decision to grant the researcher access to it.
The cloud which provides a secure computing environment where the datasets are made available for the researchers who have received permissions from the DAC.
The home organisation who signs the Data Access Agreement with the DAC, and potentially also an agreement with the cloud provider about the use of cloud capacity for the analysis.
A researcher’s access to the data and cloud is effective only as long as they are employed by (or otherwise representing) their home organisation. If the researcher’s affiliation with the home organisation comes to an end, access should be closed promptly. Optimally, the researchers’ continuing affiliation is checked each time they log in to the cloud environment. The AAI enables this by relying on federated identity management and the home organisation logins, as mediated for instance by the eduGAIN interfederation service. The federated login passes also a researcher’s fresh affiliation from their home organisation to the computing environment.
So far, there has been no common standard which the three corners of the triangle can use to communicate. GA4GH Passport introduces the concept of visas, which a DAC issues to a researcher when they receive permission to a dataset. A visa is a digitally signed token that describes an identified researcher’s permission to access an identified dataset, and the expiration of the permission. Optimally, the visa is coupled to their affiliation at the home organisation. When the researcher later accesses the dataset in a cloud, they need to present their visa to access the data and demonstrate their affiliation at the home organisation is still effective (for instance, by logging in with the home organisation credentials).
WP2 has implemented Passports to the AAI
CINECA WP2 has implemented GA4GH Passport support to ELIXIR AAI. An ELIXIR user can log in using their home organisation, enabling them to attach their affiliation to their passport. CINECA WP2 has also delivered REMS (Resource Entitlement Management System), an electronic tool that DACs can use to review the DARs. Following a positive evaluation by the DAC, the REMS tool issues a GA4GH Visa for the dataset that is stored in EGA, the European Genome-phenome Archive. When the researcher wants to access the datasets in a secure computing environment, ELIXIR AAI pulls together their visas and presents them to the computing environment for access control enforcement.
Links:
REMS screencast video on REMS/EGA integration at CINECA AGM 2021
Passport poster at ELIXIR AHM 2020
ELIXIR Webinar on access control enforcement in a cloud in 2018
Acronyms used:
DAA Data Access Agreement
DAC Data Access Committee
DAR Data Access Request
ELSI Ethical, Legal and Societal Implications
GA4GH Global Alliance for Genomics and Health
REMS Resource Entitlement Management System