Joint Variant genotyping use case
CINECA WP4 aims to implement a technical framework to run different types of federated analysis. It will enable the analysis of input information that is split across several controlled access human data cohorts without exporting the raw data. It does so by bringing the analysis to the data in a secure cloud-based infrastructure.
Recently WP4 has delivered a simple demonstrator pipeline to perform a federated joint variant genotyping analysis. The goal of this use case is to demonstrate how a simple metric (in this case, allele frequency) can be computed in a federated manner, without requiring ever collecting the individual level data in a central location.
The use case pipeline has been implemented following a federated genomic analysis approach that was designed and developed in 2020 by the WP4 members.
This approach is meant to be adopted as a technical framework for all CINECA use cases that need to run different types of federated analysis. The framework takes into account GA4GH standards which were summarised in a previous CINECA blogpost.
One of the GA4GH standards is Task Execution Service (TES) API is a schema and API for describing and executing batch execution tasks. A task defines a set of input files, a set of containers and commands to run, a set of output files and some other logging and metadata. TES servers accept task documents and execute them asynchronously on available compute resources. A TES server could be built on top of a traditional HPC queuing system, such as Grid Engine, Slurm or cloud style compute systems such as AWS Batch or Kubernetes. The reference implementation for TES is TESK.
The demonstration of the joint variant genotyping is the first use case that has successfully used the technical framework for federated analysis. It computes simple metrics like allele number and counts from two different datasets located in different environments. Then the computation is integrated using another pipeline, and the combined result can also be exported for further analysis. In this video it is explained how to run the pipelines that use Nextflow language.
The general approach for each use case is to split the analysis pipeline into two parts:
Part A of the pipeline can in principle be run in the environments of the appropriate cohorts and reduces private, individual level data to intermediate summary level products.
The results of the analysis from different cohorts are collected at a central location.Part B of the pipeline aggregates the summary level products into the final scientific product, which is made available to the end user.
Part A and Part B of the pipelines have to be workflows written in Nextflow.
The Nextflow manager acts as a WES, by interpreting the workflow provided and sending the individual tasks to TESK, using the GA4GH standard API. It was necessary to send a few pull requests upstream on the Nextflow repository (#1589, #1664, #1666, #1696) for improving the GA4GH TES compatibility. The code and documentation of this work can be found on the CINECA github page.
Thanks to Nextflow versatility, the pipelines designed in this way are able to support a variety of scenarios and can run in most existing computing environments. Instructions are provided for running them on:
In summary, this demonstrator shows the ability of the pipeline to run in different environments, accessing the data via different protocols, and applying the appropriate normalisations to the datasets. The next step for WP4 members is to apply the same common federation framework to implement the other two use cases with more elaborated pipelines such as: eQTL analysis and Polygenic Risk Score computation.
Links:
https://github.com/CINECA-project/wp4-federated-joint-cohort-analysis/blob/master/4.3-pipelines/README.md
https://github.com/CINECA-project/wp4-federated-joint-cohort-analysis/tree/master/4.3-pipelines/demonstrators/4.3.1-genotyping
https://www.cineca-project.eu/blog-all/a-common-framework-for-designing-portable-federated-pipelines