Smart Ideas Proven at Ontotext Hackathon in March 2016

What better to do on a rainy working Saturday in Sofia than a Hackathon? Teams are ready, ideas are ambitious, and pizzas are hardly enough for the high calorie consumption that thinking out of the box requires.

Needless to say, we learned a lot. We teamed with colleagues we didn’t work with closely before. We were goal-oriented and pragmatic enough and, for a day and a half, we managed to deliver a meaningful result, convincing the public that there was real value in the prototypes developed. And maybe the most valuable takeaway was that Ontotext technologies are not rigid or limited to a set of commercial products, but can be exploited in new, innovative ways. Expect to see some of those put in practice soon!

Here’s an overview of the projects:

(Semi-)automated semantic market analysis

Traditional market analysis is a tedious process that is customized and carried out manually for each beneficiary. Currently, keyword analysis of competitive websites determines the key concepts around which the competition is trying to position themselves on the online market. Research analysts in a company gather information relevant to the market from a variety of resources, including government agencies, primary and secondary research.

Semantic technologies can automate much of the workflow of market research. Given a set of resources pointed as relevant to the company, our tools can extract the entities that denote the big players in the industry, their relations, dynamics in the news, etc. This was a non-technical project, but it provided a good use case for a platform like FactForge-News, used in the Today’s News project (see below).

Social Media Monitor

Ontotext’s focus on Social Media analysis has been secondary to the traditional publishing channels. We are involved in the Pheme project, which gives the premises for future development of Social Media analysis. This small toy-project involved the analysis of our own Twitter accounts, which revealed the main topics that Ontotext cares and speaks about.

Stop accidental consumption of banned substances

Every year, each sport updates and publishes a list of banned substances against doping. Ontotext’s Linked Life Data (LLD) can be used for automated recognition of banned substances in text, may that be a list of ingredients, a list of active substances in a drug, etc. The team managed to augment LLD with additional data from PubChem and successfully implemented such a checker during the hackathon. As a test, the ITF prohibited drug list was normalized to 11 955 distinct compounds with an overall of 97 330 literals. The implemented pipeline was able to identify both meldonium and mildronate, which were recently added to the ITF banned drugs list, and because of which Maria Sharapova was suspended.

Word vectors to improve text analysis models

Word vectors are the promising first steps of deep learning in the field of Natural Language Processing. We want to take Ontotext’s Text analysis tools to a higher level of performance by implementing word vectors features and deep neural networks to serve document classification (including sentiment analysis, topic classification, keyword assignment), Named Entity Recognition, document clustering, entity clustering, topic modelling, content recommendation.

At the end of the day, we improved the current models for multi-label classification by 4%! And they were pretty good to start with, showing more than 80% F1 on a challenging use case that we developed for The IET. We also managed to cluster named entities occurring in interviews with Holocaust victims (we are part of the EHRI project), which should give us a start for discovering rare spellings of locations, people, organizations involved in the Holocaust.

Today’s News Map

Ontotext’s news showcase NOW.ontotext.comhas been accumulating semantically annotated news for more than a year, linking them to a rich collection of Linked Open Data like DBpedia, Geonames, etc. Think of 120 thousand news, linked with 7 million tags to a knowledge graph of 500 million statements, describing more than 7 million entities and resources. On average, there are 70 annotations of news with identifiers of specific concepts in the knowledge graph. The integrated resources reveal subtle and rich information, suffice it to know SPARQL well enough! And if you don’t, it’s definitely worth learning!

With a SPARQL query and some normalization of frequencies using z-scores, one can find the most relevant entity mentions for the day in the news. Here’s what’s going on in the news between Feb 14th and Feb 20th 2016 (the picture below). With a proper UI, all these entities would be clickable and leading to news and DBpedia articles.

This platform will be released for public access and experimentation during the March 24 webinar “Boost Your Data Analytics with Open Data and Public News Content”.

 

Top line news

SPARQL Tree Matcher

We have been writing SPARQL queries for a long time in Ontotext. It’s about time we analyze them, too! We love the recursive logic here: let’s data mine our data mining tools.
GraphDB gives us access to log files that represent submitted queries and their parse trees. We wanted to visualize the population of queries and to cluster them. So we used a trick: we represented SPARQL statements as protein amino acids, in order to take advantage of so many algorithms for computing phylogenetic trees. So, we had our own population of queries, with their sisters and cousins and distant relativеs. On a practical note, this kind of tooling can help our support team to faster analyze the logs of our enterprise clients’ production systems and to be much more efficient in identifying problematic query patterns.

BG NOW

Ontotext’s semantic analysis and search technologies are in a process of diversification. One important direction is to offer multi-language support. We already have solid experience with German, Italian, Dutch, Bulgarian and French. This project took the necessary steps towards publishing Ontotext’s NOW.ontotext.com semantic news showcase in Bulgarian. So far, it only contains news from one big Bulgarian on-line publisher (a client of ours) and it is not public. Here’s an early preview:
BG news analytics

text analytics in Bulgarian

 

Laura Tolosi

Laura Tolosi

Senior Data Scientist at Ontotext
Laura is an enthusiastic data scientist, always searching to improve semantic technologies with latest machine learning tools. She believes that merely extracting patterns from big data should not be the ultimate goal of predictive modelling, but understanding why certain patterns occur and thus provide with an understanding of causality.
Laura Tolosi

Related Posts

  • Panama_papers_200x200

    Linked Leaks: A Smart Dive into Analyzing the Panama Papers

    Ever since the Panama Papers news story broke in early April, people have been curious to know what names come out and how they are connected with other companies and shareholders. However, releasing the massive of 2.6TB of data could be a challenge for data enthusiasts and investigative journalists to effectively search and explore the Panama Papers data. That’s how Linked Leaks was born.

  • Journalism in the Age of Open Data

    Journalism in the Age of Open Data

    Open Data has the potential to enrich the sources for journalists and give the stories they tell new perspectives. Journalism, in turn, filters open data to discover new angles to topics and tell richer stories to the audience. However, to turn data into meaning, we need context. Semantic technology provides that context allowing media organizations to extract better insights and ultimately improve story telling capabilities.

  • Relationships between keywords

    Semantic Search: The Paradigm Shift from Results to Relationships

    “Sorry, no content matched your criteria” is probably one of the most frustrating messages we can get after a search in the times when more and more of the world’s information is supposed to be at our…

Back to top