Assigning standard descriptors to free text
This post is part of a series on a text-mining pipeline being developed by CINECA in Work Package 3. In previous instalments, first, Zooma and Curami pipelines were explained in "Uncovering metadata from semi-structured cohort data". Then, LexMapr was introduced in "LexMapr - A rule-based text-mining tool for ontology term mapping and classification". In this third instalment we are going to explain the normalisation pipeline developed at SIB/HES-SO.
Normalisation pipeline
Our text mining team has been working on a normalisation pipeline for biomedical free text. This work was primarily focused on the free text fields identified in the CoLaus/PsyCoLaus cohort. The pipeline is summarised in Figure 1. It combines MetaMap, a machine learning and rule-based tool developed by the US National Library of Medicine, with a learning to rank model to extract information from free text and map it to concepts of the UMLS Metathesaurus, a meta-ontology covering and integrating NCI Thesaurus, ICD-10 and HPO concepts, among others. The first step in the pipeline is to query MetaMap with a free text passage. MetaMap will then provide a ranked list of candidate concepts according to the input text. To improve the precision of MetaMap, the next step relies on a process of reordering this ranked list using algorithms of learning to rank. Finally, the first k candidate concepts in the re-ordered ranked list are taken as the normalised concepts for the input text.
Several learning to rank algorithms based on neural networks were explored, including, RankMSE, RankNet, LambdaRank, ListNet, ListMLE. According to our experiments, RankMSE achieved the best performance for the learning to rank task. We are also investigating the use of specific parameters in MetaMap that could generate a better set of candidates for the learning to rank step.
Dataset
In order to assess the pipeline, we are using the Medical Concept Normalisation (MCN) corpus (1), where UMLS concepts are manually annotated. This corpus contains 100 discharge summaries with 3,792 unique concepts annotated including medical problems, treatments and tests. See Table 1 for an example of annotation in MCN.
Sentence in data | The patient is a 60-year-old male with a past medical history notable for coronary artery disease and CABG x2 in 2001 . | |
---|---|---|
Annotation | Biomedical phrase | coronary artery disease |
UMLS CUI | C1956346 |
Table 1: Annotation example in MCN
Results and next steps
For the normalisation task, we achieved an accuracy of 74.21% using only MetaMap and 76.12% when we include the learning to rank process. Our next step will be to apply this pipeline to the free text fields of CoLaus/PsyCoLaus. This will enhance FAIRness of data usage by way of increased ‘Findability’, because the normalised text will make the free text data in the cohort more findable.
Putting the Pipeline Together
Together with other specialised text normalisers developed in the project, such as LexMapr, Zooma and Curami, our approach will be exposed as web services to be consumed by cohort data owners that need to normalise free text attributes available in their data. The SIB/HES-SO text normalisation pipeline will be particularly specialised in the extraction of diagnosis concepts from text data.
Access
The code of the normalisation pipeline developed by SIB/HES-SO will become available in a public repository.
Luo YF, Sun W, Rumshisky A. MCN: A Comprehensive Corpus for Medical Concept Normalization. Journal of biomedical informatics. 2019 Feb 22:103132.