Classification systems

Monthly Challenge – Ontotext case – Solution – Team epistemi


Week 1. Data Understanding + Feature Extraction

We have been provided with a dataset consisting of 277419 records, which are extracted from DBpedia. It has 6 features and 1 label of 32 classes as summarised in below table:

ID Item Name Item Type Observations
1 org Feature URL; may be useful for extraction of additional features
2 names Feature Literal name of Wikipedia Organisations
3 types Feature Suspect this is extracted from property “rdf:type”
4 descriptions Feature Extracted from property “dbo:abstract”
5 locations Feature Extracted from property “dbo:country”
6 categories Feature Extracted from property “dict:subject”
7 industries Label Suspect this is manually classified

Jupyter Notebook used for EDA


Initial Understanding of the Target Label “industries”

Out of the 32 industry classes, two thirds of the dataset are taken up by just three industries. Nearly half of the classes are less than 0.5% of total number of instances. The imbalance is something we will have to take note of, since it will certainly impact our performance measurements and model selections. Especially for this case study, where we are focusing on Type I and II errors, F1 Score and ROC curves spring to mind.

While there are 21874 companies that have 2 or more industry categories, 17942 companies are missing an industry category in the data set. Of these companies with missing labels, based on their ‘types’ feature, I believe that companies listed under “Company;LawFirm” could be classified under “Justice_and_law”. As to the rest of the missing labels, I might need to drop these, unless I can extract the label from existing dataset.


Initial Understanding of the Features

Similarities between ‘types’ and ‘categories’

Preliminary check between ‘types’ and ‘categories’ suggests that these are very similar features. Using spaCy, I was able to determine such similarity based on 1% sample pool and the results supported the idea. Looking at the ECDF graph, we can see that about 60% of the similarities are at least 70% or more.

Missing features in ‘names’ and ‘descriptions’

There are 875 missing ‘names’ and 12522 missing ‘descriptions’ from the dataset. As long as we have other features like ‘types’ and ‘categories’ available for the affected organisations, I believe we still have enough info to work with.


Relevance of ‘locations’

A quick glance at the data seems to suggest that ‘locations’ is not that relevant in classifying the industry label.


Share this

5 thoughts on “Monthly Challenge – Ontotext case – Solution – Team epistemi

  1. 0

    The first week of the challenge is explorational. However, we encourage you to give feedback to those who published their first findings in their articles.
    Your assignments to peer review (and give feedback below the corresponding articles) for week 1 of the Monthly challenge are the following teams:

Leave a Reply