Exciting as the things GraphDB allows you to do (explore heterogenous datasets, build relationships between facts, uncover meaning inside unstructured data, infer new knowledge, to mention just a few), they all start with, to put it mildly, the not so inspiring task of cleaning your data and further transforming it into RDF.
In practice, before the leaps of data-driven insights and actions come the heaps of inconsistent, unfiltered and heterogenous data that need to be cleaned up. For the data worker having to deal with these messy data is not unlike the fifth labor of Hercules where the hero gets the dirty job of cleaning the Augean Stables.
Saving Time and Effort with GraphDB’s OntoRefine
With plenty of tools for cleaning and conversion of data, the question of leveraging legacy data is not so much how to get these data transformed into interoperable and easy to query and integrate data pieces (read RDF – the so called backbone of the Semantic Web) but rather about how to do this this with maximum productivity and minimum wasted effort.
And this is where OntoRefine comes into play.
OntoRefine is a new addition to GraphDB that allows you to do many ETL (extract, transform and load) tasks over tabular data through an intuitive user interface. Based on the open source tool for working with messy data – OpenRefine (formerly called Google Refine), and embedded in GraphDB, OntoRefine makes the process of filtering and editing inconsistent data easy and frictionless.
To get back to the Augean Stables parallel, think of OntoRefine as the witty little tool of the brave data hero tasked with the dirty job of data cleanup and transformation.
Before OntoRefine, to turn tabular into interlinked graph data, data had to be loaded in a tool, cleaned manually, further exported and then imported into another tool as to be transformed into RDF. Finally, after yet another import and export, the RDF dataset had to be loaded into GraphDB. With OntoRefine these processes can happen within GraphDB. Thus cleaning up and transforming a non-RDF dataset is a fast and easy process, leaving more time for the things that really matter: running queries to discover interesting relationships within data, integrating data – in short, enjoying the full power of working with data as a graph.
Key to what OntoRefine does is the heavy lifting of removing inconsistencies, filtering data simultaneously, converting them into RDF and then importing the dataset into the repository. OntoRefine can be used for converting tabular data into RDF and importing it into a GraphDB repository, using simple SPARQL queries and a virtual endpoint. The supported formats include various line-based files, TSV, CSV, *SV, XLS, XLSX, JSON, XML, RDF as XML, and Google sheet.
From the vantage point of understanding the power of working with data as a graph, OntoRefine is a tiny yet important step toward thinking outside the table.