Text Mining & Graph Databases – Two Technologies that Work Well Together

Text Mining and Graph Databases work well together because natural language processing can be used to extract meaning from free flowing text and then stored in graph databases to be used in search, discovery and analysis.

Graph databases, also known as triplestores, have a very powerful capability – they can store hundreds of billions of semantic facts (triples) from any subject imaginable.  The number of free semantic facts on the market today from sources such as DBpedia, GeoNames and others is high and continues to grow every day.   Some estimates have this total between 150 and 200 billion right now.   As a result, Linked Open Data (LOD) can be a good source of information with which to load your graph databases.

But Linked Open Data is just one source of data. When does it become really powerful?  When you create your own semantic triples from your own data and use them in conjunction with LOD to enrich your database.  This process, commonly referred to as text mining or natural language processing, extracts the salient facts from free flowing text and stores the results in a graph database.  Users then analyze enriched data, visualize it, aggregate it and report on it.  In a recent project Ontotext undertook on behalf of the EDM Council who publishes FIBO (Financial Information Business Ontology), FIBO ontologies were enhanced with Linked Open Data allowing users to query company names and stock prices at the same time to show the lowest trading prices for all public stocks in North America in the last 50 years.   To do this, Ontotext integrated  data sources using the Ontotext Semantic Platform.

The market for this type of technology is fragmented.  Some vendors only sell the graph database and leave it up to you to determine how to do the text mining.  Other vendors only sell text mining and leave it up to you to figure out where to store the results.  Ontotext supports both along with a semantic platform and pre-built solutions for  life sciences, media & publishing, compliance & document management and recruitment.

What does a typical text mining process look like?  Here’s the simple explanation described in a 5 step process.   Text mining purists  can surely add to this discussion and we encourage you to.  At the most basic level, here’s what happens…

Before you start reading the steps download GraphDB Free. It will help you visualize them on an actual graph database.

Step 1 – Text Extraction

The text it inside of unstructured data (documents, blogs, news articles, research reports) is read by the text mining engine.  Sentences are split into words which are analyzed.  There are technical names for detailed processes that take place in this step like “tokenization” and “natural language processing” but basically it’s the process of extracting the text from the document and understanding the words.  For example, words that are similar (walk, walks, walked) may be used throughout the document.   In this first phase, these are identified and grouped when necessary.  Most importantly, the extraction phase identifies entities that are critical to the overall set of unstructured data.

Step 2 – Linking to Data Dictionaries

As we mentioned, there are free sets of triples available that describe places, people, music and more.   These sources are really important since they allow us to learn more about the words just extracted.  We can use this Linked Open Data to identify entities extracted from the text and other characteristics about those entities.   This “semantic enrichment process”   allows us to take names like “Bruce Springsteen”, link that name to the Music Brainz Linked Open Dataset and derive a wealth of additional information about Bruce including songs he has written, albums, biographical information and more.

Step 3 – Machine Learning

The machine learning phase includes two important steps – classification and disambiguation.  During this step, extracted entities are classified.   “Types” are assigned.  Is Jim a person?   Is Bank of America a company?  Is Dallas a person or a city?  By analyzing the text in context with the sentence structure and the enriched data, these questions can be resolved using statistical and logical techniques.

In many cases text mining results in entities that appear to be similar?   For example, my name could be listed as Tony Agresta in one document, Anthony Agresta in another and Anthony Joseph Agresta in a third.  Because text mining can analyze these entities in context and uses other data about the entities, it can determine that they refer to the same person.  This process is known as disambiguation or “identity resolution” and is very powerful.   Financial services companies need to understand who they are doing business with.   Government agencies want to resolve identities that may exist across many documents.  Researchers need to know when two entities really refer to the same person.

One additional point about this step – ontologies or thesauri to classify the extracted entities can be extremely useful as organizations build out their knowledge bases.  These powerful classification systems are used to formally represent knowledge in a domain.  They provide a common vocabulary to denote the types, properties and interrelationships of concepts within that domain.  It’s this common vocabulary that leads to massive gains in productivity for organizations needing to speak a common language.  Search and discovery applications not only search for the entities extracted from the text but they also have the flexibility of searching on types of entities like people, places, events, organizations and more.

Step 4 – Rules Processing

Let’s recap.  We extracted text.   We linked it to open data.  We classified the entities and we identified when the same entity is referred to differently, even when this happens across different documents.  Now the magic begins.   We process the text and identify RELATIONSHIPS between the subjects and the objects.    Here are a series of examples:  Sally worked at Banking Corp.  Gary lives in Tampa.  Tamps is a city in Florida.   Gary worked with Sally.  Sally plays golf.  Some of this text is explicitly described in the document and some of it has been added through enrichment.

These semantic facts (also known as RDF statements) contain relationships within the fact and also across the set of facts.  Because these relationships exist, the data stored can be represented in the form of a graph and hence, the name “graph database”.  What’s equally important is the idea of inference where new facts can be created from existing facts.  When new facts are materialized and stored inside the graph database, queries run faster and users get more complete, accurate results. Here’s a simple examples of inference using two pre-existing facts:  Fido is a dog – A dog is a mammal. From these facts we can infer that Fido is a mammal. When a query takes place asking for names given to mammals, the result will include “Fido.”

Step 5 – Semantic Indexing

So where does all of this intelligence go?  Semantic facts created AND THE REFERENCE TO THE ORIGINAL DOCUMENT are stored inside the graph database.  At Ontotext, our database is called GraphDB™.  The ontology (or taxonomy or thesarus) used to classify the entities is also stored inside the graph database.  This provides powerful  query capabilities.  For example, if a query is looking for “all the people living in the Southwest US that work for a financial services company”, the ontology can inform the query that Arizona is a state in the Southwest and that Jim is a person.   The ontology classifies organizations into industries.   Semantic facts such as “Jim lives in Arizona” and “Jim works Banking Corp” are queried along with ontological relationships such as “Arizona is a state in the southwest” and “Banking Corp is a financial services company” yielding accurate, complete responses to the query.

When you think of Text Mining in these simple terms, it’s not too difficult to understand the basic processing steps to go from free flowing text to meaning.   This is a fantastic way to understand what’s inside the massive amounts of unstructured data you have today.  The ability to process unstructured data transforming it into structured intelligence with results stored inside a graph database along with a classification system will provide a big advantage for you over the competition.   Today, text mining is becoming more and more common as organizations make the critical decision to reveal the hidden meaning behind their unstructured data.  Connecting this process to a graph database like GraphDB™ provides you with the flexibility to query, aggregate, report and visualize this data in support of improved decision making.

GraphDB Free Download

GraphDB Free

Run your first query and discover meaning in your data

Download

Milena Yankova

Milena Yankova

Director Global Marketing at Ontotext
A bright lady with a PhD in Computer Science, Milena's path started in the role of a developer, passed through project and quickly led her to product management.For her a constant source of miracles is how technology supports and alters our behaviour, engagement and social connections.
Milena Yankova
  • Belgrade Waterfront

    Nice. How does your platform fare for semantic (not keyword based) analysis of sentiment across web and social media channels. Is there a solution you can provide for something like this?

  • Milena Yankova

    hi, thanks for asking. Out-of-the-box sentiment analysis doesn’t come with the platform as it requires specific training for each domain.

    But I’d encourage you try the twitter semantic analysis pipeline available as SaasS on S4.

    http://docs.s4.ontotext.com/display/S4docs/Twitter+IE

    (nice nickname, by the way)

Related Posts

  • Open data fosters a culture of creativity and innovation

    Open Data Innovation? Open Your Data And See It Happen.

    As more and more companies and startups are creating business and social value out of open data, the open data trend-setting governments and local authorities are not sitting idle and are opening up data sets and actively encouraging citizens, developers, and firms to innovate with open data.

  • Linked Open Data Sets

    Linked Data Innovation – A Key To Foster Business Growth

      ‘Data is the new oil’, once said Neelie Kroes,  former Vice-President of the European Commission responsible for the Digital Agenda, aptly describing how the growing amounts of data are changing businesses and our lives. The year…

  • featured image

    Linked Open Data for Cultural Heritage and Digital Humanities

    The Galleries, Libraries, Archives and Museums (GLAM) sector deals with complex and varied data. Integrating that data, especially across institutions, has always been a challenge. On the other hand, the value of linked data is especially high in this sector, since culture by its very nature is cross-border and interlinked.

Back to top