Exploring Secure Multi-Party Computation for Clinical Cohort Management: A Feasibility Study
Visit at the EBI, experience and outcomes.
During my stay at the European Bioinformatics Institute (EBI), I had the opportunity to learn from the knowledgeable team and to explore from their suggestions.
Author: Esteban Gaillac
Affiliation: HESGE, Patrick Ruch group.
Secure multi-party computation (SMPC) holds great promise for conducting machine learning tasks on sensitive and distributed datasets while maintaining data privacy. In this blog post, we present a feasibility study conducted on the Medical Information Mart for Intensive Care (MIMIC) dataset to evaluate the application of SMPC in clinical cohort management. We discuss our objectives, methods, and results, and provide a comprehensive analysis of the implications and limitations of using SMPC in this context.
During my stay at the European Bioinformatics Institute (EBI), I had the opportunity to learn from the knowledgeable team and explore their suggestions. As I come from machine learning, this visit provided valuable insights and opened new avenues for exploration. I had the chance to understand much more concerning the EBI and especially the European Genome-Phenome Archive (EGA). What are the issues, goals, who are the people working at the EGA, and what do they do? How is the organization structured? I will also be able to pursue a search on SMPC now that I had the chance to explore at the EBI. The next step would be to try working on a gene and search for genome-wide possibilities.
Objectives
Our study aimed to achieve the following objectives:
Evaluate the feasibility of implementing SMPC on the MIMIC dataset.
Assess the additional costs associated with SMPC, including computation, storage, and its impact on analytics.
Investigate how SMPC can support fair access to data within the context of FEGA.
Determine the potential of SMPC in conducting diagnosis support tasks using genomic data.
Methods
We began by acquiring access to the MIMIC-III dataset, a rich collection of clinical information from critical care units. We explored the application of SMPC to perform computations on this dataset while preserving data privacy. However, we encountered a limitation with genomic data, as SMPC is designed for numerical inputs rather than letter-based sequences. To work around this limitation, we should focus our analysis on a select few genes rather than genome-wide computations.
We will be performing a logistic regression on the MIMIC dataset, more precisely on admission’s length at the hospital regarding their diagnosis. The dataset is embedded, as shown in Figure 1. We then compare the quality of results obtained with and without SMPC while closely monitoring the computational costs involved.
In that case, we need to give each diagnosis an identification, just like we would for sequences in a gene. Give an identification for each sequence and encrypt a gene that way.
Results
Our findings revealed that SMPC implementation on the MIMIC dataset showed promising results, with a threefold increase in computational costs compared to traditional methods, as we can see in Table 1. This increase can be attributed to the additional overhead introduced by the privacy-preserving techniques used in SMPC. Despite this higher computational cost, the privacy protection provided by SMPC remains a valuable advantage in the context of clinical data.
The additional cost being the encryption of data, it only depends on the input size. In our case, it is three times more costly, but the input isn’t very big.
Computation time for a logistic regression | |
Without SMPC | 7 seconds |
With SMPC | 21 seconds |
Discussion
While our feasibility study demonstrated the potential of SMPC in clinical cohort management, it is important to acknowledge its limitations. The inability to directly apply SMPC to genomic data due to its non-numerical nature poses a challenge. However, by narrowing the focus to specific genes, we were able to conduct meaningful analysis and gain insights.
The increased computational costs associated with SMPC raise important considerations, particularly in resource-constrained environments. Further optimizations and advancements in SMPC techniques are required to minimize these costs while maintaining the desired level of data privacy.
Conclusion
In conclusion, our feasibility study on SMPC for clinical cohort management using the MIMIC dataset highlighted the benefits and challenges of employing this privacy-preserving technique. While SMPC proved effective in protecting data privacy, its application to genomic data requires special considerations. We recommend exploring alternative methods or adaptations to address this limitation. Additionally, efforts should be directed towards optimizing the computational efficiency of SMPC to ensure its practicality in large-scale genomic datasets.
Despite these challenges, SMPC showcases its potential to enable FAIR access to data while respecting local jurisdiction requirements. Further research and collaboration in this area are crucial for unlocking the full potential of SMPC in clinical research and cohort management.
References
Medical Information Mart for Intensive Care (MIMIC): https://registry.opendata.aws/mimiciii/
https://mit-lcp.github.io/mimic-schema-spy/tables/admissions.html
Secure Multi-Party Computation: https://arxiv.org/abs/2109.00984