In the not too distant past, analysts were all searching for a “360 degree view of their data”. Most of the time this phrase referred to integrated RDBMS data, analytics interfaces and customers. But with the onslaught of unstructured data, the phrase takes on a slightly different but related meaning – The Full Semantic Circle.
This post summarizes how the integration between text mining and triplestores provides closed loop semantics. As you will read, a discussion like this can’t avoid to mention a few important characteristics of triplestores.
One thing you’ll learn when researching RDF and graph databases is that documents are not born in RDF format – they are transformed into RDF format. This structured form of data (sometime called subject–predicate–object, RDF statements or semantic triples) feeds triplestores and graph databases using some form of transformation process.
Off-the-shelf tools which make RDF out of structured and semi-structured data are available. But the resulting output, as one can imagine, is very basic. It lacks richness. There are a few ways to produce more robust RDF that we need to discuss. Standard RDF provides some basic classifications of terms and relationships. Tom is a person, for example. People work for Organizations. Organizations exist in specific Locations. But this can be taken even further.
RDF graph helps you understand the meaning of concepts and put them in the right context by exploring their links. For example the string “Boston” will match multiple entities like Boston (US) and Boston (UK). Both concepts have one and the same label, but what makes them different is the context in which they are mentioned. The first is part of US and the other is part of the UK. RDF helps us synthesize complex knowledge models and differentiate “things from strings”, by modeling relationships in a semantic network.
Text Mining pipelines provide the ability to enrich RDF by incorporating packaged extractors for specific domains such as bio medical, humanities or news. The extractors create RDF and automatically insert the statements in the triplestore. Domain-specific thesauri, taxonomies and ontologies in either proprietary or open source form can be leveraged to further compliment your text mining pipeline. The combination of these two approaches produces deep relationship extraction and a rich set of RDF statements. With this knowledge base, search and discovery applications take on new potential. Native RDF triplestores, a type of graph database, are a natural place to store the results born from a text mining. This process is also referred to as ‘semantic annotation’.
Leveraging open standard domain-specific taxonomies and ontologies can ensure that this enriched content fits a specific audience and is linked to standard open data repositories across the globe making your data highly intelligent and actionable.
One of the challenges in arriving at this end state is that graph databases and text mining processes are usually not tightly coupled. Organizations either run an RDF converter with some simple data modelling or leverage a basic text mining process which results in data stored in sub-optimal format. While most graph databases simply provide a repository for this information, GraphDB™ is tightly integrated with the text mining pipeline through its Concept Extraction Service (CES) API. This powerful coupling means that as RDF statements are created they can be classified and inserted into GraphDB™ leading to a constant stream of new information.
How does this work?
In general, the REST Client API calls out a GATE-based annotation pipeline and sends back enriched data in RDF form. Organizations typically customize these pipelines which consist of any GATE-developed set of text mining algorithms for scoring, machine learning, disambiguation or any of the other wide range of text mining techniques.
It is important to note that these text mining pipelines create RDF in a linear fashion and feed GraphDB™. Once the RDF is enriched in this fashion and stored in the database, these annotations can then be modified, edited or removed. This is particularly useful when integrating with Linked Open Data (LOD) sources. Updates to the database are populated automatically when the source information changes.
For example, let’s say your text mining pipeline is referencing Freebase as its Linked Open Data source for organization names. If an organization name changes or a new subsidiary is announced in Freebase, this information will be updated as reference-able metadata in GraphDB™.
In addition, this tightly-coupled integration includes a suite of enterprise-grade APIs, the core of which is the Concept Extraction API. This API consists of a Coordinator and Entity Update Feed. Here’s what they do:
Other APIs include Document Classification, Disambiguation, Machine Learning, Sentiment Analysis & Relation Extraction. Together, this complete set of technology allows for tight integration and accurate processing of text while efficiently storing resulting RDF statements in GraphDB™
As mentioned, the value of this tightly-coupled integration is in the rich metadata and relationships which can now be derived from the underlying RDF database. It’s this metadata that powers high performance search and discovery or website applications – results are compete, accurate and instantaneous.
Relationships are important and are represented as new and dynamic properties (predicates) within GraphDB™. GraphDB™ can also take that statement and apply inferencing capabilities to materialize all the possible inferred relationships to that statement. Let’s start with a known fact or statement:
Barak Obama was elected as president of US
A classical text mining engine can make the relation between Barack Obama as a President of US. Still, this is only temporal information. In 4 years, it will be different. RDF helps you model the fact as follows:
<Barak Obama (person)> <is_president> <USA (country)> <Document ID> <Document ID> dc:date <document date>
Once this provenance is preserved, you can ask via SPARQL who the president of US was back in 2002. This is known as “multiple versions of the truth”. A new statement can be materialized resulting in additional intelligence and faster queries.
Let’s look at another example. Let’s say the text states “Semprana was rejected in June of 2014 for treatment of migraines by the FDA”.
The text mining pipeline would determine “Semprana” as a prescription drug and insert definitions or knowledge from other sources. It would do the same for the other concepts in the sentences such as “migraine” and “FDA” and identify the date of June, 2014. This is where the power of inference kicks in. GraphDB™ now takes this statement and is able to report all migraine drugs rejected by the FDA.
Although detail is beyond the scope of this post, it’s worth mentioning that one of the unique attributes of GraphDB™ is its ability to update the RDF repository together with all the inferred relationships without a substantial performance hit when an inferred statement is retracted.
When developing text mining pipelines, each solution may utilize a different set of tools. Disambiguation, for example, can take place solely within a text mining pipeline through machine learning and “training” a pipeline on a specific domain. “Orange” within health and wellness context would refer to the fruit while the same term in a geographical context could refer to the southern county in California. How does this process work?
Disambiguation takes place when several repositories of linked open data are brought together. This enriched data provides context which used by the pipeline to gain clarity with respect to entities. GraphDB™ leverages the OWL technique of “sameAs” which declares that two different URIs denote the same resource or term. GraphDB™ optimizes these potentially long lists using a single maser-node in its indices to represent a class of “sameAs-equivalent URIs. It is leveraged to rank the most relevant of the multiple terms and bring forth the single most applicable result.
The text mining approach ensures that you annotate the correct term while the graph database approach ensures your search results return the most relevant result from that domain returning “things not strings.”
In terms of taxonomy or ontology support, these are used as rules for structured annotations and can be stored directly within GraphDB™ to facilitate dynamic queries on relationships and inferred facts.
Tightly-coupled integration of text mining and graph databases make the end-to-end process of structuring unstructured data, enriching domain-specific content and feeding a dynamic repository of facts and statements much easier to operationalize. This blend of technology also happens to be unique in the marketplace. The result is a true semantic repository on which dynamic curation, authoring and reporting can be executed on an enterprise scale. Ontotext offers all of this technology within our semantic platform.