AstraZeneca: Causality Data Mining

Early Hypotheses Testing Through Linked Data

AstraZeneca is a global research-based bio-pharmaceutical company with skills and resources focused on discovering, developing and marketing medicines for some of the world’s most serious illnesses, including cancer, heart disease, neurological disorders such as schizophrenia, respiratory disease and infection.

The Goal

Success in the pharmaceutical research and discovery process is highly dependent on the availability and accessibility of high quality research data. The quality of the data can be assessed by its accuracy, correctness, completeness, currency and relevance. While the accuracy and the correctness of data are purely defined by the methods used to generate the data, the latter three – completeness, currency and relevance, could be determined partially or completely by an effective semantic data integration approach, which:

  • Aggregates all relevant information;
  • Removes redundancy and ambiguities in the data
  • Interlinks the related entities.

Researchers gather information from a broad range of biomedical data sources in an iterative way in order to generate or expand a certain theory, to test hypotheses, and to make educated, informed assertions about which relationships are causal, and about exactly how they are causal. They need a mechanism, which will allow them to mine all data scattered among different relevant resources and to identify visible (direct) and invisible (distant) relations between biomedical entities studied along the pharmaceutical research and discovery process.

The Challenge

Develop a platform for Interactive Relationship Discovery, which allows the identification of long causal relationship chains between the biomedical objects in the Linked Life Data cloud. The platform will be used for early hypothesis testing, which requires identification of direct and non-direct relations between biomedical entities and giving a hint for possible mechanism, which usually remains hidden.

To facilitate the process of relationship discovery, the platform should provide an easy and intuitive tool, which will allow the researchers to interactively mine and explore the causal relations.

The Solution: Linked Life Data Cloud

The semantic warehousing is a suitable approach to assist researchers in getting an overview on the existing relationships within the scientific and clinical data by utilizing causality data mining. Linked Life Data is used as a platform for Interactive Relationship Discovery between biomedical entities as it:

  • Integrates of over 25 diverse data sources
  • Aligns the data to more than 17 different biomedical objects (genes, proteins, molecular functions, biological processes/pathways, molecular interactions, cell localization, organisms, organs/tissues, cell lines, cell types, diseases, symptoms, drugs, drug side effects, small chemical compounds, clinical trials, scientific publications, etc)
  • Identifies explicit relationships between entities locked in the original data sets and categorize them to causality relation ontology.
  • Mines unstructured data in order to identify relations hidden within text (inclusion/exclusion criteria for clinical studies)


Since the entities in the Linked Life Data are usually strongly interlinked, the approach for simply crawling/querying the repository for relationships and listing them is not sufficient in most cases. That’s why, in addition Linked Life Data provides defines user-centered process and interactive tools for assistance in the discovery of even very large numbers of causal relations.

Users are able to efficiently get an overview on found relationships, to interactively explore them and to easily spot and separate relationships that are of relevance in a certain use case.

Back to top