Triplestores are Proven as Operational Graph Databases

Claiming that RDF triplestores are typically used for offline analytics suggest unfamiliarity with their most popular use cases. Triplestores are often used in very dynamic operational database setups such as metadata-based content management at world’s largest media and publishers like BBC, FT, Wiley, Elsevier, Oxford University Press and DK.

As new approaches to data management are gaining popularity, we start seeing more texts that compare the different NoSQL and particularly graph database engines. A recent example is “Graph Databases for Beginners: Other Graph Data Technologies”.

While such comparisons do great job helping developers understand “how stuff works” some times they tend to be imprecise when authors comment engines beyond their core area of expertise. The post referred above makes statements about triple stores, also called semantic graph databases, like these:

“However, triple stores are not “native” graph databases because they don’t support index-free adjacency, nor are their storage engines optimized for storing property graphs.”

and

“… the most common use case for triple stores is offline analytics rather than for online transactions”.

And my 20+ years of piled up expertise urges me to comment.

To provide a bit of background, let’s start with:

    • Triplestores are graph database engines that, unlike engines based on Property Graphs, implement a set of comprehensive, vendor-independent standards: RDF (the data model), RDFS and OWL (schema languages) and SPARQL (query language).
    • Triplestores work with globally unique identifiers – together with few other features, this makes them very suitable for integration of data – be it the thousands Linked Open Data datasets or proprietary data.

Download  GraphDB Free RDF triplestore

Index-free Adjacency in Triplestores

How we’ll dive deeper into the theory of data representation and indexing, so if you want to understand how these are actually implemented, don’t skip this section.

Indeed, traversal from one node of the graph to another is not the most typical operation for triplestoires, so, many triplestores do not provide efficient support for it out of the box. Still, the leading triplestores can be configured so that such operations are efficiently supported.

The leading triplestores can be configured to efficiently support graph-traversal. Please share this. Twitter_logo_orange_mini

To understand how triplestores work I will provide a quick intro to the most typical designs.

Most triplestores have some sort of dictionaries, which assign each node in the graph an integer number as an internal identifier. Technically, they map the actual entity identifiers (URIs such as “http://company.com/data/person.101”) and literals (such as “Frank Lampard” and “2015-02-28T23:39:07Z”^^xsd:dateTime) to an integer number unique for the database instance.

The most popular such index is PSO (Predicate Subject Object), where triples are ordered first by their predicate (the type of the relationship), than by subject (the end node) and finally by object (the start node). Each of the elements of the triple is represented in such index by its internal integer IDs for efficiency purposes.

pso

The PSO index handles efficiently queries where the predicate and the subject are known, e.g.

SELECT ?team WHERE { :Frank_Lampard :plays_for ?team}

and even when only the predicate is known:

SELECT ?who ?team WHERE { ?who ::plays_for ?team}

Usually, triplestores maintain several such indices, to be able to efficiently deal with different triple patterns. The concrete indices to be used are easy to configure in accordance to the typical loads and the performance requirements for the database instance.

Note that triplestores do not “store” triples for the sake of storing them. Indexing triples in PSO and other similar indices is also the way to store them. Each triple is stored in each of the indices, which is not a problem, because its internal representation by integer IDs is sufficiently compact.

To support efficiently graph traversal from one node into another, a triplestore needs to be configured to enable its subject-object-predicate (SOP) index. With this index switched on, a triplestore becomes de-facto “index-free adjacency” engine. One can consider the SOP index being “the storage” of the graph database.

There are multiple deployments of triplestores which are tuned this way and do support efficient graph traversal.

sop

Triplestores as Dynamic Operational Databases

Triplestores are often used for dynamic operational databases. Plenty of such applications can be found in publishing and media, where triplestores are used for dynamic management of content, based on rich metadata descriptions.

Give it a try, download  GraphDB Free triplestore

This usage pattern is known as “dynamic semantic publishing” and it is embodied into the LDBC Semantic Publishing Benchmark (SPB). In SPB news, images and other “creative works” are described with metadata: namely, Dublin Core-like attributes and links to entities and concepts that are most relevant to them. Entities and concepts are described as reference data: huge knowledge graph derived from Linked Open Data datasets such as DBpedia, GeoNames and others.

Both the metadata and the reference data are stored in a triplestore that is accessed by two types of clients (agents):

    • Aggregation agents retrieve information on specific subjects. For instance, at the Sport section of the website of BBC, each topic web page (e.g. the one for Chelsea) is dynamically generated by several SPARQL queries to the underlying triplestore;
    • Editorial agents are constantly making changes to the database, either inserting metadata for newly coming content or updating the reference data (e.g. the number of goals scored by Frank Lampard this season).

chelsea

In dynamic semantic publishing scenarios, triplestores typically handle hundreds of read queries per second, while in parallel processing tens of update transactions per second for knowledge graphs that contain hundreds of millions of edges.

These are real statistics from mission critical deployments like the triplestore behind BBC SPORT – a very dynamic operational database backing the website 24×7 since year 2012.

How To Evaluate A Graph Database

Linked Data Benchmark Council (LDBC) is an industry consortium governing and developing TPC-like benchmarks for graph databases and triplestores. Its members include leading vendors in both fields, e.g. IBM, ORACLE, Neo Technologies, Openlink Software, Ontotext and others. At LDBC’s website one can find benchmarks and benchmark results alongside blog posts on related subjects and event announcements.

Dynamic Semantic Publishing (DSP) application pattern was invented and first implemented by BBC’s team for their website for the FIFA World Cup 2010. A great blog post describing this project was published by Jem Rayfield. One can read about the “dynamic semantic publishing” use case and the Semantic Publishing Benchamrk in blog post that I wrote earlier this year.

On the technology side in DSP the graph database engine needs to interplay closely with text-mining technology used for automated metadata generation. Particularly when Linked Open Data (LOD) is used for text analytics and tagging purposes. This is something that triplestores have proven to do very well – not a surprise given that LOD comes in RDF.

In April this year Philip Howard from Bloor Research completed a report on the graph database market. One can download full version of the “Graph and RDF databases 2015” report. Philip also refers to Property Graphs as “operational graph databases” and considers triplestores incapable of “index-free adjacency”, reflecting historical trends and attitudes.

The summary about RDF databases there is as follows:

“Often semantically focused… for use in operational environments but have inferencing capabilities. Require indexes even in transactional environments. Often ACID compliance”.

Back in January, Robin Bloor, published “The Graph Database and the RDF Database” that provides a number of good insights about the differences and commonalities between RDF databases and Property Graph databases:

“Where the RDF databases really score is when you want to do set processing (a la SQL) at the same time that you want to do graph processing. Consider a query such as “Who are the biggest influencers on Twitter over the past six months?”

Both the RDF and Graph database would handle such a query and return the same results quickly. But if you ask the very different question, “Which influencers have had the same pattern of influence on Twitter over the last six months?” you are asking both for graph processing and set processing at the same time to get to the answer, and the RDF databases do both well. Not only that, but this is an area of analytics, which was virtually untapped until recently, because there was no software that could easily do it.”

Final words

Popularity of graph databases is growing based on good track record of projects where these engines delivered to the expectations.

That’s true for all types of graph database technology: property graphs, RDF and other graph analytics. People start paying more attention to the differences between the different graph database standards in order to choose the one most appropriate for their application.

In this post I provided some insights on how triplestores work and how they can support graph-traversal efficiently, despite “index-free adjacency” is not central for their design. I also presented the “dynamic semantic publishing” pattern – a typical use case where triplestores are used as dynamic operational database.

I also provided references to recents posts and reports that touch the RDF vs. Property Graphs subject. The later excel in graph analytics and despite triplestores can do this too.

I will summarize the advantages of RDF-based graph engines as follows:

To stay tuned with the development of all sorts of graph database technology, consider joining us at the 7th LDBC Technical User Community meeting on 9th of Nov, in IBM Watson center near NYC. We hold such meetings twice a year, with agenda that covers: recent developments of benchmarks, use case presentations and sometimes relevant academic speakers.

Meanwhile, please let me know what you think in the comments below.

Atanas Kiryakov

Atanas Kiryakov

CEO at Ontotext
Atanas is a leading expert in semantic databases, author of multiple signature industry publications, including chapters from the widely acclaimed Handbook of Semantic Web Technologies.
Atanas Kiryakov

Related Posts

Back to top