Powering up data discovery and access using the Data Use Ontology
This month’s blog was written by Melanie Courtot, metadata standards coordinator at EMBL-EBI and co-Work Package Lead of CINECA WP3 - Cohort Level Metadata Representation. This blog is the fourth in our Global Alliance for Genomics and Health (GA4GH) standards series, presenting an overview of how GA4GH standards are being developed and implemented by CINECA. In our April post about Passport, Mikael from CINECA WP2 explained the importance of controlled-access to protect sensitive data, federated data access in the cloud and how Passport enables researchers to authenticate - prove they are who they say they are.
An important complementary aspect of authenticating users for data access is proving that a researcher is authorised to access the data - given who they are, their institution, their project… they should be allowed to access the data.
CHALLENGES IN DATA ACCESS
Datasets available to researchers have varying ethical or legal conditions for secondary data use - derived from informed consent processes or other authorisations (e.g., laws, policies or agreements). For example, some datasets are available only for non-commercial organisations, preventing access by pharmaceutical companies, or consented only for research about specific diseases. Ethical requirements may mandate that ancestry research must not be performed, and legal frameworks may forbid transferring data to another country.
Every institution uses unique language in their informed consent forms to describe the secondary use restrictions and conditions on their datasets. This means that each data access request must be manually evaluated against the data use limitations that specifies how the dataset can be used. Consequently, Data Access Committees typically respond to such requests in two to six weeks, considerably slowing down the pace of research.
To address this challenge, we have developed the Data Use Ontology (DUO), a hierarchical vocabulary of terms representing permissions associated with secondary data use. DUO allows to annotate datasets consistently and unambiguously; each DUO term is developed by the community, and includes human readable metadata such as a definition, example of usage etc. The DUO hierarchy has been improved based on user feedback in the February 2021 release, and reflects the functional split between permissions and additional modifiers that further specify those permissions:
This allows DUO to provide an unambiguous, shared understanding of data use conditions. DUO terms are encoded in the machine readable W3C standard OWL Web Ontology Language, and follow Open Biological and Biomedical Ontologies development principles. A researcher can query the European Genome-phenome Archive (EGA) at EMBL’s European Bioinformatics Institute and the Centre for Genomic Regulation, or any database that has implemented DUO, for discovery of datasets annotated with DUO terms, to only retrieve data that matches their intended use.
DUO can also be implemented for automated matching to allow authenticated users to gain access to datasets compliant with their research. For example, an industry researcher working on cancer would be matched to any dataset that is allowed for commercial use and for cancer research and offered the opportunity to fetch them automatically using a DUO-powered algorithm.
DUO STEP-BY-STEP
At data deposition time, the data depositor provides their datasets annotated with DUO terms. These can originate from consent forms natively when they follow the Machine readable consent guidance, or can be derived from it by the depositor.
At data request time, a scientist encodes their research purpose using DUO terms. The Data Access Committee can rely on the DUO matching algorithm or make a manual determination of access permissions.
The data access committee approval, if granted, is shared with the data repository, and the dataset is made available to the requestor.
DUO HAS BEEN IMPLEMENTED WORLDWIDE
CINECA WP3 has implemented GA4GH DUO to annotate cohort data from H3Africa. Based on their feedback, several improvements have been made to the ontology, for example creating new terms to differentiate between non-commercial entities accessing the data, and entities accessing the data for non-commercial purposes.
As of April 2021, DUO has been used in over 200,000 annotations worldwide, and its community of users keeps growing. In CINECA, the CHILD cohort study is in the process of reviewing their consent forms to annotate their datasets using DUO terms. Further contributions to the DUO standard are encouraged on the DUO issue tracker.
DUO is distributed under CC-BY, and the latest released version of the DUO is always available at http://purl.obolibrary.org/obo/duo.owl. DUO can be browsed online using the EMBL-EBI Ontology Lookup Service. Documentation is available from the DUO Github repository.