On February 26, 2015 Marin Dimitrov, CTO of Ontotext presented Text Mining and Knowledge Graphs in the Cloud: The Self-Service Semantic Suite (S4). This hour long webinar, now available on-demand, discussed the capabilities of S4 and how organizations can benefit from it. Following the presentation, the volume of audience participation was so great that we were unable to answer all of the questions. So we sat down with Marin to get the answers.
Q: Our worldwide audience showed a great deal of interest in the S4 News Analytics. Are there plans to provide support for languages in addition to English?
MD: Yes, multilingual support is on our roadmap. Ontotext has already provided solutions for media & publishing for customers dealing with languages like German, Dutch and Italian, so multi-lingual analytics covering other European languages will be available on the S4 platform in 2015.
Q: Can structured data, e.g. from an SQL DB dump, be added to S4?
MD: There is an existing W3C standard, RDB2RDF which deals with ways to transform relational data into RDF. The standard defines two ways for RDF-izing relational data: a simpler, direct mapping approach, and a more complex and expressive mapping language R2RML. Several open source and commercial tools support the RDB2RDF standard and can be used right now to RDF-ize relational data. In the future the S4 platform may be extended to include such data import capabilities, but at present, for an SQL dump to be imported into the RDF database in the Cloud, it has to be RDB2RDF pre-processing first.
Q: Why does S4 utilize JSON instead of JSON-LD? Are there any plans to use JSON-LD in the future?
MD: JSON-LD is indeed on our short-term roadmap and it will be supported by the S4 platform in Q2’2015. We will also be introducing a 3rd output format, which is very suitable for documents with rich formatting, where the original formatting needs to be preserved. While JSON-LD indeed provides a better way to describe the RDF enriched output of the S4 analytics services, it is still a relatively new serialisation format (from Jan’2014) and many developers still prefer to use plain old JSON. S4 will support a variety of text analytics output formats, so that developers can choose the one best suited to their expertise and needs.
Q: Does S4 convert the content to RDF?
MD: Yes. As explained in the previous answer, JSON-LD – which is a valid RDF serialisation format – will be one of the supported text analytics output formats soon. At the same time transforming the current plain JSON output into RDF (JSON-LD, Turtle, etc.) is fairly trivial.
Q: What is the S4 Browser plugin? We saw a Firefox plugin API in the demo, can we use it?
MD: The Firefox plugin for S4 makes it easy for developers to test S4 text analytics services right from the web browser. It is open source and available via the Mozilla marketplace. A S4 plugin for the Chrome browser will be available soon as well. The API keys are specific for the user (in this case, the keys are for my S4 account), not for the plugin itself, and you can generate as many personal API keys as needed when you register a S4 account at http://s4.ontotext.com/
Q:Do you have any recommended applications to use for those who are not fluent in SPARQL?
MD: SPARQLgraph is a powerful open source visual query builder for SPARQL tailored for querying biomedical databases in particular. Another interesting open source tool is Quepy, which is a framework for translating simple natural language questions into SPARQL queries against DBpedia. A few other IDEs make SPARQL query creation easier (syntax highlighting, auto-completion, etc.) but some fluency of SPARQL is required. Other tools such as GraphRover and Information Workbench provide faceted search and navigation over RDF data.
Q: What are the options for modifying the RDF inference rules in your solutions?
MD: The rule-based inference of GraphDB™ works according to the defined entailment rules. GraphDB rules are expressed in a simple rule language, which allows for rule sets to be defined. At present the following pre-defined rule-sets are bundled with GraphDB by default: empty (no reasoning), rdfs, owl-horst, owl-max (RDFS + a subset of OWL Lite), owl2-ql and owl2-rl for the OWL RL and QL profiles respectively. At the same time custom rule sets can be defined for custom inference profiles, tailored to specific use cases.
Q: How do you recommend converting the content of non-RDF documents, such as Word and PDF files, into RDF format?
MD: That depends on the format of the document and the target RDF schema that you need to conform to. A variety of frameworks support RDF-ization from structured and semi-structured formats like XML and CSV. You can also directly analyse Word documents with the S4 text analytics services (PDF support will be available soon as well) and the result will be the JSON (JSON-LD) output with RDF data for the entities extracted from the text. See the S4 documentation for details on processing MS Office document formats with the S4 text analytics services.
Q:Does S4 distinguish between UK English and USA English?
MD: The training data (articles) is a mixture of British and American English content, and external resources such as DBpedia contain aliases for both, so in most cases such variations should be successfully handled by the text analytics services.
Q: Can S4 be used to analyse content from log files?
MD: Yes, as long as the files contain some entities of interest (people, locations, organisations, etc) they will be identified by the S4 text analytics services.
Q: When utilizing text analytics services in S4 is it possible to link to data sources other than the ones currently available?
MD: At present, S4 text analytics services are tailored to provide mappings (interlinking) to predefined data sources such as DBpedia, Freebase, GeoNames or a set of biomedical data sources (see LinkedLifeData for details). It is possible for the text analytics services to provide mappings and interlinking to custom datasets and knowledge bases but this process requires adaptation (updates of the rules, machine learning models, training data, etc.) and such custom solutions are provided currently via the Ontotext solutions for media & publishing or healthcare & life sciences
Q: Regarding extraction of relations between entities using S4: Is the set of possible relations pre-defined, e.g. for the news and biomed domain? Or is there a way to add or train custom relations?
MD: The current set of relations extracted from the S4 text analytics services is pre-defined and includes the most generic and basic relations which are likely to be useful for a large number of use cases. In our custom solutions there is indeed a training and customisation phase, so that relations specific to the customer domain and need are extracted as well (for example: economic sector indicators, investor/asset relations, etc.)
Q: Is S4 only available on Amazon Web Services (AWS) or could an organization host it in their own cloud environment?
MD: S4 is designed to utilise the AWS cloud platform for maximum scalability, performance and reliability. Individual components of the S4 platform (GraphDB, text analytics services, knowledge bases) can of course be deployed on-premises or on private clouds, but aspects such as guaranteed availability, reliability and scalability (for processing large volumes of data and serving lots of concurrent requests) are not be available off-the-shelf.
Q: Do you handle entities such as retail/consumer products?
MD: At present S4 text analytics services do not identify product mentions in text, though such capability may be added in future releases.
Q: Are there any plans to have text analytics for physics, history or science available?
MD: S4 does not provide text analytics for these domains, but such capabilities may be available from other (research) projects. The incoming Extended Semantic Web Conference (2015) has a workshop related to the topic: Workshop on Semantic Web for Scientific Heritage.
Q: Does GraphDB™, the RDF database in S4, support SPIN query? If no would you think of supporting it in the near future?
MD: GraphDB does not support SPIN at present, though SPIN support is on the long-term product roadmap.
Q: Does S4 have the ability to summarize the content of a document, paragraph or sentence and categorize it to its proper topic?
MD: S4 text analytics does not provide a text summarisation capability at present, though it will be available in the near future. The text classifier service of S4 can also be used on the paragraph/sentence level, though its precision will be higher on longer text snippets and documents.
Q: How does one build an ontology?
MD: The ontology modelling process is not different than any schema / entity-relationship modelling process – the main goal is to identify all the relevant classes, their properties and their relations to other classes, for the scope and the purpose of the specific problem being solved. There are various alternative approaches for ontology modelling, but the simplest one is probably from Noy & McGuinees (2001): “Ontology Development 101: A Guide to Creating Your First Ontology” . The main steps of the modelling process described there are:
It is also important to note the three fundamental rules guiding the ontology design process according to Noy & McGuiness:
An additional useful resource for ontology modelling is the Ontology Design Patterns website.
Q: Can S4 handle typos in the input text?
MD: Yes, to some extent, since the gazetteers and dictionaries are enriched with the most common misspellings for important entities.
Q: S4 stands for “Self -Service Semantic Suite.” Can you please talk about the “self-service” aspect?
MD: The goal of S4 is to make various capabilities for text analytics and metadata management instantly accessible, available on demand and affordable to developers in SMEs. With S4, developers get more freedom to experiment with new approaches to data management, and quickly develop proof-of-concept prototypes without being restricted by budgeting, planning, licensing and operations constraints. The various capabilities for semantic data management are accessible via simple RESTful services and a developer can quickly start prototyping.
Q: Is there a limit on the number of documents that can be processed?
MD: Currently S4 provides a free quota of 250 MB of text processed. Depending on the types of documents processed this averages out to around 1.5 million tweets, or to 50,000 web pages (assuming a 5 KB average size of text). The pricing plans will be introduced in Q2’2015 and will offer flexible pay-per-use options for the various text analytics services of S4. Beyond the free quota there’s really no limit to the volume of data that S4 can process, since it’s designed to quickly scale up the computing infrastructure and accommodate large data volumes for processing.
Q: We saw the Biomedical annotation pipeline, has this been used in production and for what types of applications?
MD: Yes, the biomedical text analytics service of S4 is based on the text analytics solutions that Ontotext has provided to customers in the healthcare & life sciences domain. The AstraZeneca use case was mentioned during the webinar: the goal there was to analyse a large number of clinical trial documents and extract and interlink entities of interest, so that powerful semantic search can replace the traditional full-text search over the document repositories.