The Ontotext Case – Automated Detection of Anomalous Industry Classification in Linked Data
Business problem formulation
Classification of companies into industry sectors is a fundamental task for unlocking advanced business intelligence capabilities. However different data sources rarely uses the same classification system if any. This is a huge obstacle for taking advantage of the available details in Open Data and very niche commercial data sources that lack or use inconsistent industry classifications.
Industry information is often recorded as part of the initial manual data collection and if not an appropriate industry is available in the classification, it is often left empty. On the other hand, when automatically mined e.g., from a textual document, the data about company industry is often missing in the original source and is therefore not assigned.
In this case, we aim at developing an automated and standardized classification model that can be used on any source to enrich the originally available data with industry sector information.
Industry classifications vary a lot ranging from flat 15 class classification to very complex taxonomies. Extrems in development of industry classifications are bottom-up application-centric approaches for collection and use data with limited ambition for share and reuse, or top-down government or institutional statistics which appear to be too abstract and too complex to be practically useful.
Besides explicit classifications made by experts, a number of other clues can point to a company’s particular line of business. Textual company descriptions, information about product lines and news mentions are all valid indicators of a company’s potential classification. The objective of this case is to leverage such clues in order to improve the existing explicit classification.
What is the business problem?
High-quality commercially available company data may be unaffordable for many data analytics and business analytics. Many of the niche, but highly valuable data sources come short of details about industry sector. At the same time, the amount of Open Data (official or crowdsourced) is growing but it often lacks a standardized but practical approach to industry classification.
Why does the business need to solve the problem?
A standardized industry classification is an enabling feature for various data processing and and advanced analytical task like:
- Reconciliation of company records from different sources
- Measuring similarity between companies
- Calculating company ranking score e.g. Popularity Rank of Global Banks
What are the important problem specifics (from business sight), which have to be accounted for in the solution?
It is good for experimentation and comparison of the precision of different approaches and techniques – all the way from classification based solely on text descriptions to more advanced settings where richer context is derived from a knowledge base.
All the data for the task can be found through this storage FactForge. Currently the dataset contains 260152 organisations classified in 32 top-level categories . We will provide a csv dump with simple features for each organisation and a number of SPARQL queries that can be used for more complicated graph-based feature extraction from FactForge.
Unlike many established industry classifications, the one we will be using was partly crowdsourced and partly derived in a bottom-up manner from information in DBpedia. When a user creates an entry in Wikipedia, he specifies an “industry” attribute. We then manually normalised these attributes in order to derive a stable classification. This query shows all the attributes that are currently used to infer the “Transport” top-level class.
Extensive instructions, additional to the case are found here: https://gitlab.ontotext.com/trainings/global_datathon
Technical Challenges around the experiments:
It will be interesting to experiment with methods that have tolerance with respect to noise in the classifications that come from the training data. Whatever training set we can think of, there will always be bad examples. In rare cases, it would be because someone did a mistake. But very often, it will be because of one sector is only marginally relevant to a specific company, in which case the relationship is weak in both directions: the company is not a typical representative for the sector and the sector classification is not important as part of the description of the business.
See the discussion for this case in the Data.Chat HERE
The Ontotext experts for the Datathon
Nikola Tulechki, Data Scientist and Semantic consultant at Ontotext AD
Nikola Tulechki has a PhD from Université Fédérale de Toulouse where he worked on natural language processing of incident and accident reports with applications to risk management in the aviation safety sector. He moved to Bulgaria in 2017 and works at Ontotext AD as a data scientist and semantic consultant.
Atanas Kiryakov – founder and CEO of Ontotext – Semantic Web pioneer and vendor of semantic technology.
Atanas is member of the board of the Linked Data Benchmarking Council – standardization body, who’s members include all major graph database vendors.
Atanas is an expert in semantic graph databases, reasoning, knowledge graph design, text mining, semantic tagging, linking and search. Author of signature academic publications with more than 2500 citations. Until 2010 Atanas was product manager of GraphDB (formerly OWLIM) – a leading RDF graph database serving mission critical projects in organizations like BBC, Financial Times, Nikkei, S&P, Springer Nature, John Wiley, Elsevier, UK Parliament, Kadaster.NL, The National Gallery of US and top-5 US Banks.
As CEO of Ontotext, Atanas supervised multiple high profile semantic technology projects across different sectors: media and publishing, market and investment information, cultural heritage and government.