LexMapr - A rule-based text-mining tool for ontology term mapping and classification
This blog is the second in a series on a text-mining pipeline being developed by CINECA. For the previous instalment "Uncovering metadata from semi-structured cohort data" please click here.
The Common Infrastructure for National Cohorts in Europe, Canada, and Africa (CINECA) has a vision of a federated cloud enabled infrastructure making population scale genomic and biomolecular data accessible across international borders, accelerating research, and improving the health of individuals across continents. CINECA’s Work Package 3 (WP3) addresses the metadata representation needs for cohort aggregate and individual data across studies and over time. The CINECA text mining group in WP3 has been collaborating to provide tools and methods to harmonize data from structured, semi-structured and narrative content fields in cohorts’ data. The group has initiated development of a text mining pipeline based on the different approaches from contributing partners and initially focussing on linked cohorts. The integrated pipeline draws from the MetaMap rule-based framework (HES-SO/SIB), the LexMapr rule-based text-mining tool (SFU) and the Zooma entity annotation tool (EMBL-EBI).
Background to the development of LexMapr
At SFU, we have developed a rule-based text mining tool LexMapr that cleans up and parses shorter form unstructured text to extract biomedical entities and map these to standard ontology terms. To complement other text mining tools in WP3, LexMapr provides harmonization tools by focussing on the narrative contents i.e. field values of cohort data. It combines basic lexicographic transformation with Natural Language Processing (NLP), synonymy, ontology and other resource lexicons to produce a tokenized equivalent description suitable for keyword and ontology-driven search of database contents. The LexMapr pipeline addresses many challenges in the processing of short biomedical phrases. The tool was initially developed to fulfill biosample metadata harmonization objectives of public health surveillance networks like the US FDA’s GenomeTrakr system and the US National Antimicrobial Resistance Monitoring System (NARMS). Their objective is to harmonize short phrases describing food pathogen source data using standard vocabularies for reporting of transmission dynamics in public health foodborne pathogen surveillance and investigation realms.
Cleaning and ontology term mapping
The initial focus of LexMapr development has been on providing a text-mining tool to clean up the short free-text biosample metadata that contained inconsistent punctuation, abbreviations and typos, and to map the identified entities to standard terms from ontologies. Because the problem space of biosample phrases has a very focused semantic domain of text, and short phrases pose very specific challenges to deal with, we have employed a rule-based approach that draws upon wide-ranging lexical resources. LexMapr implements different rules for pre-processing, normalization, entity recognition and ontology term mapping tasks, and makes use of domain-specific customized lexicons for abbreviation and acronyms normalization, non-English usage, and spelling correction.
LexMapr pre-processes the input biosample descriptions by implementing a series of steps for data cleaning, punctuation and case treatment, singularization and spelling correction. The pre-processing phase improves output by providing cleaned phrases for subsequent steps in the processing for entity recognition and term mapping by LexMapr. The normalization phase transforms the entities to their normalized forms before term mapping is performed. LexMapr normalizes usage of abbreviations or acronyms and non-English language in biosample descriptions by successively applying rules on the pre-processed phrases from the previous phase. In the term mapping phase, LexMapr makes use of several rules on pre-processed and normalized samples to support the detection of relevant entities and map to ontology terms. Two key food biosample domain ontologies, FoodOn and GenEpiO that cover clinical, epidemiological and food semantics, have been selected as the target ontologies for standardizing biosamples.
The different rules have been implemented to deal with the irregular case usage, long names, naming variations and word ordering in input phrases and ontology term labels and suffix addition to input text. In case of no direct mention of the entities mentioned in the biosamples, LexMapr enables entities to map, if possible, to standard ontology terms (indirectly) by making use of synonyms. For synonym substitution, LexMapr primarily makes use of the exact synonyms for standard terms available in the selected ontologies. If not found in the ontologies, LexMapr looks for the additional synonyms in the specimen domain stored in a customized lookup-table SynLex that houses the synonyms not available in the ontologies and are candidates for curation and inclusion in the corresponding ontologies. The mapped set of terms are further refined with the ontology-driven
pruning (using hierarchical structure of the ontologies) to retain more specific terms in the case where multiple mappings are obtained. Figure 1 shows the high-level architecture of LexMapr and its different enabling components.
Table 1 shows a snapshot of term mapping and classification results exemplifying the usage of different rules and treatments.
Specimen description | Matched ontology terms with ids | Rule applied | IFSAC+ classification |
---|---|---|---|
Fish-meal | fish meal:foodon_03301620 | Punctuation treatment | ['fish'] |
quail, frozen | quail:foodon_03411346, frozen:pato_0001985 | ['avian'] | |
Soil | soil:envo_00001998 | Change of case | ['environmental'] |
Garlic Powder | garlic powder:foodon_03301844 | ['root/underground (bulbs)'] | |
stool | feces:uberon_0001988 | Synonym substitution | ['clinical/research'] |
bird | avian animal:foodon_00002616 | ['companion animal'] | |
Pecans | pecan (whole, raw):foodon_03315232 | Singularization | ['nuts'] |
sesame seeds | sesame seed:foodon_03310306 | ['seeds'] | |
Mackeral | mackerel:foodon_03411043 | Spelling correction treatment | ['fish'] |
homo spaiens; Stool | homo sapiens:ncbitaxon_9606, feces:uberon_0001988 | ['clinical/research', 'human'] | |
spice mix | spice mixture:foodon_03304292 | Abbreviation-Acronym treatment | ['herbs'] |
frz frog legs | frog leg (frozen):foodon_03305167 | ['other aquatic animals'] | |
poultry | poultry meat food product: foodon_00001131 |
Suffix Addition- meat food product to the Input | ['poultry'] |
sediment, stream | stream sediment:envo_00002127 | Permutation of tokens in input text | ['environmental'] |
sesame seeds, hulled | sesame seed (hulled):foodon_03304876 | Permutation of tokens in bracketed resource term | ['seeds'] |
methi | fenugreek food product:foodon_00001837 | Non-English substitution treatment | ['herbs'] |
tulsi powder | basil food product:foodon_00003044, food (powdered):foodon_00002976 |
['herbs'] | |
frz. Fish | fish (frozen):foodon_03301083 | Multiple Rules: Abbreviation-Acronym treatment, Permutation of tokens in bracketed resource term |
['fish'] |
frz cooked shrimp | shrimp (cooked, frozen):foodon_03308827 | Multiple Rules: Abbreviation-Acronym treatment, Permutation of tokens in bracketed resource term |
['crustaceans'] |
frz. lobster tails | lobster tail (frozen):foodon_03305435 | Multiple Rules: Permutation of tokens in bracketed resource term, Inflection (Plural) treatment, Abbrev. treatment |
['crustaceans'] |
Table 1. A snapshot of term mapping and classification results obtained by LexMapr based on the application of different rules/treatments.
Ontology-Driven Classification
Once the primary task of linking free text to standard ontology terms is accomplished by LexMapr in terms of a mapping and reporting framework, it provides a platform for many potential ontology-driven applications. The tool has been equipped with a functionality to classify input phrases as per institution-specific classification schemes. LexMapr’s classification functionality has been initially used for the ontology-driven classification of biosample metadata based on the epidemiology-focused food classification scheme for categorizing foods implicated in outbreaks, Interagency Food Safety Analytics Collaboration (IFSAC), provided by GenomeTrakr and NARMS.
For the classification task, LexMapr uses predefined nodes of ontologies as buckets (containers) to characterise specific third-party classes (Figure 2). The LexMapr pipeline classification component provides functionality for biosamples to be linked to these buckets (ontology IDs) and hence to be categorized according to third-party classes (IFSAC initially).
To support specific requirements of third-party classification schema, LexMapr uses multiple classification rules to further refine the preliminary classification results. For example, a post-refinement rule classifies an input phrase (macaroni and cheese) as “Multi-ingredient” (an IFSAC class) in case it contains more than one food ingredient combined together. When applied to real-world GenomeTrakr and NARMS biosample metadata, LexMapr exposed and reported the incompleteness of the existing third-party scheme in describing a variety of biosamples. Subsequent deliberations with GenomeTrakr and NARMS has led to the development of an enhanced and improved classification scheme “IFSAC+” which has greater coverage of the biosamples. LexMapr has enabled the reorganization of classes and introduction of many new classes in the schema, and has generated many new candidate terms as potential terms for curation and inclusion in the ontologies.For example LexMapr has helped ontologies such as FoodOn to add new terms, and use of their synonyms has enabled the capture of missed terms in GenomeTrakr food descriptions. Work is in progress to equip LexMapr with a mechanism for performing ontology-driven classification configured to any sort of institution-specific classification schemas provided by the user.
Next steps
Although LexMapr has been initially developed to serve the biosample domain, our general approach of cleaning and harmonization of data can be used to address different cohort domains in the CINECA project by adding selected domain specific ontologies and rules. The tool has recently been configured to allow customized selection of ontologies and lexical resources. LexMapr is being adapted to provide a framework for automated cleaning-up of common errors or inconsistencies in the field descriptions that describe different features in cohort domains such as disease, physical environment, laboratory measures and others. We will also be enhancing the tool to facilitate input documents of additional types, e.g. Case Report Forms, cohort text descriptions from cohort specific databases, and quality reports on samples.
Access
The LexMapr source code is publicly available at https://github.com/Public-Health-Bioinformatics/LexMapr. LexMapr is available both as a locally installable command-line tool and via a Django-based website providing a simple graphical interface (http://watson.bccdc.med.ubc.ca:8000/lexmapr/) that is being enhanced in usability and functionality. We would like to acknowledge our funding bodies USDA (Agreement Number: 58-8040-8-014-F), CIHR (Award reference: PJT-159456), Genome British Columbia / Genome Canada (286GET).