Week 1. Data Understanding + Feature Extraction
We have been provided with a dataset consisting of 277419 records, which are extracted from DBpedia. It has 6 features and 1 label of 32 classes as summarised in below table:
|ID||Item Name||Item Type||Observations|
|1||org||Feature||URL; may be useful for extraction of additional features|
|2||names||Feature||Literal name of Wikipedia Organisations|
|3||types||Feature||Suspect this is extracted from property “rdf:type”|
|4||descriptions||Feature||Extracted from property “dbo:abstract”|
|5||locations||Feature||Extracted from property “dbo:country”|
|6||categories||Feature||Extracted from property “dict:subject”|
|7||industries||Label||Suspect this is manually classified|
Initial Understanding of the Target Label “industries”
Out of the 32 industry classes, two thirds of the dataset are taken up by just three industries. Nearly half of the classes are less than 0.5% of total number of instances. The imbalance is something we will have to take note of, since it will certainly impact our performance measurements and model selections. Especially for this case study, where we are focusing on Type I and II errors, F1 Score and ROC curves spring to mind.
While there are 21874 companies that have 2 or more industry categories, 17942 companies are missing an industry category in the data set. Of these companies with missing labels, based on their ‘types’ feature, I believe that companies listed under “Company;LawFirm” could be classified under “Justice_and_law”. As to the rest of the missing labels, I might need to drop these, unless I can extract the label from existing dataset.
Initial Understanding of the Features
Similarities between ‘types’ and ‘categories’
Preliminary check between ‘types’ and ‘categories’ suggests that these are very similar features. Using spaCy, I was able to determine such similarity based on 1% sample pool and the results supported the idea. Looking at the ECDF graph, we can see that about 60% of the similarities are at least 70% or more.
Missing features in ‘names’ and ‘descriptions’
There are 875 missing ‘names’ and 12522 missing ‘descriptions’ from the dataset. As long as we have other features like ‘types’ and ‘categories’ available for the affected organisations, I believe we still have enough info to work with.
Relevance of ‘locations’
A quick glance at the data seems to suggest that ‘locations’ is not that relevant in classifying the industry label.