On October 15, 2015 me, Todor Primov, a Healthcare expert with Ontotext, presented Mining Electronic Health Records for Insights: Beyond Ontology Based Text Mining. This webinar highlighted some of the challenges in text mining clinical patient data and the solutions which Ontotext provides to overcome them, including:
The presentation also addressed many of the issues raised in our earlier blog post Overcoming the Next Hurdle in the Digital Healthcare Revolution: EHR Semantic Interoperability.
During the webinar Todor covered some of the challenges in applying NLP over clinical patient data and the solutions which Ontotext provides to overcome them.
Some really interesting questions were raised by the audience:
Q: Pre-coordinated vs. post-coordinated vocabularies. Why are pre-coordinated vocabularies still used? Are there any advantages of pre-coordinated compared to post-coordinated vocabularies?
A: There are lots of pre-coordinated ontologies which are primarily used for medical coding purposes, like ICD9-CM, ICD10-CM and ICPC. In many use cases a particular medical observation must be identified and referred unambiguously. So for that purpose, a fully qualified concept will be needed and the pre-coordinated ontologies are a good reference source. Just the opposite, with the post-coordinated ontologies we can model complex medical findings using relations between the “seed concept” and additional qualifiers or other classes of instances.
However the post-coordination pattern definition approach, requires to reference a finding not to a single concept, but to a relation between concepts. Some ontologies benefit from both approaches, like SNOMED CT. It is always a trade off which approach to apply and this is usually determined by the particular use case.
Q: How we can stop the explosion of possible mappings using flexible gazetteers? How many mappings are acceptable until they loose meaning for practitioners or domain experts?
A: To enrich our dictionaries, we use a predefined sequence of routines. Each routine performs a specific task and they follow an exact order, starting with applying particular ignore rules, rewrite rules and synonym/term inversion enrichment. The output from a routine serves as an input for the next step in the workflow. In each routine there are multiple rules that are applied just once, so that the different routines in the workflow are not applied iteratively and there is no risk for “explosion”. However even applying each set of rules just once, this results in a significant increase of the literals compared to the initial set. It is always a good practice to validate the newly generated terms against a large corpus of domain specific documents (like medical journal articles or anonymized EHR) in order to validate that the newly generated terms are naturally used by the medical professionals. The generated dictionary is used both by standard and the so called flexible gazetteers. The flexible gazetteers are able to identify any term from the dictionary even it’s tokens are split with an additional token in the real text.
Q: Are you able to normalize all of the qualifiers to concepts from an ontology?
A: When we use post-coordination patterns to identify and fully specify a concept in the text, we use qualifiers that are already defined by an ontology. However, we have identified many cases in which we identify a qualifier in the noun phrase ,but we cannot normalize it to a valid concept from an ontology. This requires to model your extracted data in RDF in a way that it will allow to store also the text/tokens which was not possible to be grounded to an ontology concept. This also require new implementation of new approaches for exploration of the data extracted from text.
Q: How do you model relations between extracted entities?
A: If the extraction rules are defined for extraction of different concept classes and relation between them, we model the semantics of the relation with the usage of special predicates. This is the case when we extract drug dosage information, where we identify a drug concept, a disease concept and the relation that the disease concept is an indication for the drug concept – in this example we model the relation as drug “hasIndication” disease. Other more trivial relations in the knowledgebase are modelled using the SKOS schema – related, closeMatch or exactMatch based on their type of relations and the mechanism used to define the mapping.
The slides from this presentation are available on SlideShare and a recording of the presentation is available on demand by clicking below.