Graph databases, also known as triplestores, have a very powerful capability – they can store hundreds of billions of semantic facts (triples) from any subject imaginable. The number of free semantic facts on the market today from sources such as DBpedia, GeoNames and others is high and continues to grow every day. Some estimates have this total between 150 and 200 billion right now. As a result, Linked Open Data (LOD) can be a good source of information with which to load your graph databases.
But Linked Open Data is just one source of data. When does it become really powerful? When you create your own semantic triples from your own data and use them in conjunction with LOD to enrich your database. This process, commonly referred to as text mining or natural language processing, extracts the salient facts from free flowing text and stores the results in a graph database. Users then analyze enriched data, visualize it, aggregate it and report on it. In a recent project Ontotext undertook on behalf of the EDM Council who publishes FIBO (Financial Information Business Ontology), FIBO ontologies were enhanced with Linked Open Data allowing users to query company names and stock prices at the same time to show the lowest trading prices for all public stocks in North America in the last 50 years. To do this, Ontotext integrated data sources using the Ontotext Semantic Platform.
The market for this type of technology is fragmented. Some vendors only sell the graph database and leave it up to you to determine how to do the text mining. Other vendors only sell text mining and leave it up to you to figure out where to store the results. Ontotext supports both along with a semantic platform and pre-built solutions for life sciences, media & publishing, compliance & document management and recruitment.
What does a typical text mining process look like? Here’s the simple explanation described in a 5 step process. Text mining purists can surely add to this discussion and we encourage you to. At the most basic level, here’s what happens…
The text it inside of unstructured data (documents, blogs, news articles, research reports) is read by the text mining engine. Sentences are split into words which are analyzed. There are technical names for detailed processes that take place in this step like “tokenization” and “natural language processing” but basically it’s the process of extracting the text from the document and understanding the words. For example, words that are similar (walk, walks, walked) may be used throughout the document. In this first phase, these are identified and grouped when necessary. Most importantly, the extraction phase identifies entities that are critical to the overall set of unstructured data.
As we mentioned, there are free sets of triples available that describe places, people, music and more. These sources are really important since they allow us to learn more about the words just extracted. We can use this Linked Open Data to identify entities extracted from the text and other characteristics about those entities. This “semantic enrichment process” allows us to take names like “Bruce Springsteen”, link that name to the Music Brainz Linked Open Dataset and derive a wealth of additional information about Bruce including songs he has written, albums, biographical information and more.
The machine learning phase includes two important steps – classification and disambiguation. During this step, extracted entities are classified. “Types” are assigned. Is Jim a person? Is Bank of America a company? Is Dallas a person or a city? By analyzing the text in context with the sentence structure and the enriched data, these questions can be resolved using statistical and logical techniques.
In many cases text mining results in entities that appear to be similar? For example, my name could be listed as Tony Agresta in one document, Anthony Agresta in another and Anthony Joseph Agresta in a third. Because text mining can analyze these entities in context and uses other data about the entities, it can determine that they refer to the same person. This process is known as disambiguation or “identity resolution” and is very powerful. Financial services companies need to understand who they are doing business with. Government agencies want to resolve identities that may exist across many documents. Researchers need to know when two entities really refer to the same person.
One additional point about this step – ontologies or thesauri to classify the extracted entities can be extremely useful as organizations build out their knowledge bases. These powerful classification systems are used to formally represent knowledge in a domain. They provide a common vocabulary to denote the types, properties and interrelationships of concepts within that domain. It’s this common vocabulary that leads to massive gains in productivity for organizations needing to speak a common language. Search and discovery applications not only search for the entities extracted from the text but they also have the flexibility of searching on types of entities like people, places, events, organizations and more.
Let’s recap. We extracted text. We linked it to open data. We classified the entities and we identified when the same entity is referred to differently, even when this happens across different documents. Now the magic begins. We process the text and identify RELATIONSHIPS between the subjects and the objects. Here are a series of examples: Sally worked at Banking Corp. Gary lives in Tampa. Tamps is a city in Florida. Gary worked with Sally. Sally plays golf. Some of this text is explicitly described in the document and some of it has been added through enrichment.
These semantic facts (also known as RDF statements) contain relationships within the fact and also across the set of facts. Because these relationships exist, the data stored can be represented in the form of a graph and hence, the name “graph database”. What’s equally important is the idea of inference where new facts can be created from existing facts. When new facts are materialized and stored inside the graph database, queries run faster and users get more complete, accurate results. Here’s a simple examples of inference using two pre-existing facts: Fido is a dog – A dog is a mammal. From these facts we can infer that Fido is a mammal. When a query takes place asking for names given to mammals, the result will include “Fido.”
So where does all of this intelligence go? Semantic facts created AND THE REFERENCE TO THE ORIGINAL DOCUMENT are stored inside the graph database. At Ontotext, our database is called GraphDB™. The ontology (or taxonomy or thesaurus) used to classify the entities is also stored inside the graph database. This provides powerful query capabilities. For example, if a query is looking for “all the people living in the Southwest US that work for a financial services company”, the ontology can inform the query that Arizona is a state in the Southwest and that Jim is a person. The ontology classifies organizations into industries. Semantic facts such as “Jim lives in Arizona” and “Jim works Banking Corp” are queried along with ontological relationships such as “Arizona is a state in the southwest” and “Banking Corp is a financial services company” yielding accurate, complete responses to the query.
When you think of Text Mining in these simple terms, it’s not too difficult to understand the basic processing steps to go from free flowing text to meaning. This is a fantastic way to understand what’s inside the massive amounts of unstructured data you have today. The ability to process unstructured data transforming it into structured intelligence with results stored inside a graph database along with a classification system will provide a big advantage for you over the competition. Today, text mining is becoming more and more common as organizations make the critical decision to reveal the hidden meaning behind their unstructured data. Connecting this process to a graph database like GraphDB™ provides you with the flexibility to query, aggregate, report and visualize this data in support of improved decision making.
Run your first query and discover meaning in your data