(Semi-)automated semantic market analysis
Traditional market analysis is a tedious process that is customized and carried out manually for each beneficiary. Currently, keyword analysis of competitive websites determines the key concepts around which the competition is trying to position themselves on the online market. Research analysts in a company gather information relevant to the market from a variety of resources, including government agencies, primary and secondary research.
Semantic technologies can automate much of the workflow of market research. Given a set of resources pointed as relevant to the company, our tools can extract the entities that denote the big players in the industry, their relations, dynamics in the news, etc. This was a non-technical project, but it provided a good use case for a platform like FactForge-News, used in the Today’s News project (see below).
Social Media Monitor
Ontotext’s focus on Social Media analysis has been secondary to the traditional publishing channels. We are involved in the Pheme project, which gives the premises for future development of Social Media analysis. This small toy-project involved the analysis of our own Twitter accounts, which revealed the main topics that Ontotext cares and speaks about.
Stop accidental consumption of banned substances
Every year, each sport updates and publishes a list of banned substances against doping. Ontotext’s Linked Life Data (LLD) can be used for automated recognition of banned substances in text, may that be a list of ingredients, a list of active substances in a drug, etc. The team managed to augment LLD with additional data from PubChem and successfully implemented such a checker during the hackathon. As a test, the ITF prohibited drug list was normalized to 11 955 distinct compounds with an overall of 97 330 literals. The implemented pipeline was able to identify both meldonium and mildronate, which were recently added to the ITF banned drugs list, and because of which Maria Sharapova was suspended.
Word vectors to improve text analysis models
Word vectors are the promising first steps of deep learning in the field of Natural Language Processing. We want to take Ontotext’s Text analysis tools to a higher level of performance by implementing word vectors features and deep neural networks to serve document classification (including sentiment analysis, topic classification, keyword assignment), Named Entity Recognition, document clustering, entity clustering, topic modelling, content recommendation.
At the end of the day, we improved the current models for multi-label classification by 4%! And they were pretty good to start with, showing more than 80% F1 on a challenging use case that we developed for The IET. We also managed to cluster named entities occurring in interviews with Holocaust victims (we are part of the EHRI project), which should give us a start for discovering rare spellings of locations, people, organizations involved in the Holocaust.
Today’s News Map
Ontotext’s news showcase NOW.ontotext.comhas been accumulating semantically annotated news for more than a year, linking them to a rich collection of Linked Open Data like DBpedia, Geonames, etc. Think of 120 thousand news, linked with 7 million tags to a knowledge graph of 500 million statements, describing more than 7 million entities and resources. On average, there are 70 annotations of news with identifiers of specific concepts in the knowledge graph. The integrated resources reveal subtle and rich information, suffice it to know SPARQL well enough! And if you don’t, it’s definitely worth learning!
With a SPARQL query and some normalization of frequencies using z-scores, one can find the most relevant entity mentions for the day in the news. Here’s what’s going on in the news between Feb 14th and Feb 20th 2016 (the picture below). With a proper UI, all these entities would be clickable and leading to news and DBpedia articles.
This platform will be released for public access and experimentation during the March 24 webinar “Boost Your Data Analytics with Open Data and Public News Content”.
SPARQL Tree Matcher
We have been writing SPARQL queries for a long time in Ontotext. It’s about time we analyze them, too! We love the recursive logic here: let’s data mine our data mining tools.
GraphDB gives us access to log files that represent submitted queries and their parse trees. We wanted to visualize the population of queries and to cluster them. So we used a trick: we represented SPARQL statements as protein amino acids, in order to take advantage of so many algorithms for computing phylogenetic trees. So, we had our own population of queries, with their sisters and cousins and distant relativеs. On a practical note, this kind of tooling can help our support team to faster analyze the logs of our enterprise clients’ production systems and to be much more efficient in identifying problematic query patterns.
Ontotext’s semantic analysis and search technologies are in a process of diversification. One important direction is to offer multi-language support. We already have solid experience with German, Italian, Dutch, Bulgarian and French. This project took the necessary steps towards publishing Ontotext’s NOW.ontotext.com semantic news showcase in Bulgarian. So far, it only contains news from one big Bulgarian on-line publisher (a client of ours) and it is not public. Here’s an early preview: