Monthly Challenge: https://www.datasciencesociety.net/events/text-mining-data-science-monthly-challenge/
Mentors’ Weekly Instructions: https://www.datasciencesociety.net/text-mining-data-science-monthly-challenge/
Real Business Problem
Classification of companies into industry sectors is a fundamental task for unlocking advanced business intelligence capabilities. However different data sources rarely use the same classification system if any. This is a huge obstacle for taking advantage of the available details in Open Data and very niche commercial data sources that lack or use inconsistent industry classifications.
Industry information is often recorded as part of the initial manual data collection and if not an appropriate industry is available in the classification, it is often left empty. On the other hand, when automatically mined e.g., from a textual document, the data about company industry is often missing in the original source and is therefore not assigned.
Industry classifications vary a lot ranging from flat 15 class classification to very complex taxonomies. Extremes in development of industry classifications are bottom-up application-centric approaches for collection and use data with limited ambition for share and reuse, or top-down government or institutional statistics which appear to be too abstract and too complex to be practically useful.
Besides explicit classifications made by experts, a number of other clues can point to a company’s particular line of business. Textual company descriptions, information about product lines and news mentions are all valid indicators of a company’s potential classification. The objective of this case is to leverage such clues in order to improve the existing explicit classification.
High-quality commercially available company data may be unaffordable for many data analytics and business analytics. Many of the niche, but highly valuable data sources come short of details about industry sector. At the same time, the amount of Open Data (official or crowdsourced) is growing but it often lacks a standardized but practical approach to industry classification.
In this case, we aim at developing an automated and standardized classification model that can be used on any source to enrich the originally available data with industry sector information.
This is the presentation from the video. You’re probably better off watching the video to get an additional explanation but it is included here for completeness’ sake.
Why does the business need to solve the problem?
Standardized industry classification is an enabling feature for various data processing and advanced analytical task like:
- Reconciliation of company records from different sources
- Measuring similarity between companies
- Calculating company ranking score e.g. Popularity Rank of Global Banks
- etc.
Model
A (partial) description of the data model surrounding Wikipedia Organisations. You can use GraphDB visualization tools to explore the whole model surrounding Organisations but this is a description of the features that seem most salient to our task.
Monthly Challenge task
Ultimately the task is to classify companies into industries using rich and complex graph-based features. However, given the constraints of the Monthly challenge format, the task can be framed as an error/anomaly detection task. At the core, it is still a classification problem and the output should be not the ultimate result, but discussable qualitative results in the form of false positives and false negatives identified in the data. This should be possible as any classifier will learn the “essence” of a class and afterwards one can easily look for cases where the data and the classifier disagree the most.
Expected Output
For each of the 32 top-level classes it is expected to see:
- 10 false positives from the data
- 10 false negatives from the data
The Data
Simple features csv dump
In this folder, you will find enclosed a csv file containing the “basic” features for all Wikipedia organizations. As discussed in the demonstration, these are not all potentially valuable features but they are a strong initial set to start developing your algorithm on and potentially more than sufficient to create a strongly competitive solution. We are calling them basic simply because they are immediately adjacent to the organization in the graph and don’t require moving through multiple nodes.
In the folder, you will also find the current mapping of entities and literals to the normalized set of industries as well as several statements defining the semantics of their relations.
Additional info you may find in the GitLab project by Ontotext.
SPARQL Queries for additional features
A list of example queries that can be used to extract more complex features from the graph. These are not all possible features but they are promising ones and the queries should give you a good idea how you can extract other complex features should you wish to. Firstly, let’s go over how you can use those queries
Through workbench
The factforge workbench is available at http://factforgenet/sparql You can just paste the query into the editor and run it straight away. You can also export the results of queries in any number of formats including csv, tsv, xml, turtle, trig, trix, etc.
API
If you want to access the endpoint programatically, read up on the API at http://factforge.net/webapi
RDF4J API
There is a client-side RDF4J implementation of the API. You can read up at http://graphdb.ontotext.com/free/using-graphdb-with-the-rdf4j-api.html or check a demo project using the API at https://gitlab.ontotext.com/trainings/demo-app
An example of the code for reading data from the endpoint is in https://gitlab.ontotext.com/trainings/demo-app/blob/master/src/main/java/com/ontotext/demonstrators/training/services/GenericService.java
Sample Queries
Now let’s look at some specific queries. You will notice most of them have limits in one or more locations. These make it easier to develop the query and make sure it is giving you the information in the desired format but should be removed when extracting the full dataset. Some of these queries can be quite slow to complete so be careful when removing limits.
isMemberOf
A property introducing a hierarchy of organisations.
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dul: <http://www.ontologydesignpatterns.org/ont/dul/DUL.owl#>
PREFIX ff-map: <http://factforge.net/ff2016-mapping/>
PREFIX dbr: <http://dbpedia.org/resource/>
select ?org ?org_parent (group_concat(?ind_parent_str ; separator=";") as ?industries_parent)
where {
?org a dbo:Organisation ; dul:isMemberOf ?org_parent .
?org_parent a dbo:Organisation .
optional {
?org_parent ff-map:hasTopLevelIndustry ?ind_parent .
bind(replace(str(?ind_parent), str(dbr:), "") as ?ind_parent_str)
}
} group by ?org ?org_parent limit 100
Click here to run query on FactForge
News Bodies
PREFIX pubo: <http://ontology.ontotext.com/publishing#>
PREFIX ff-map: <http://factforge.net/ff2016-mapping/>
PREFIX dbo: <http://dbpedia.org/ontology/>
select *
where {
{
select ?org {
?org a dbo:Organisation .
} limit 1000
}
?news a pubo:Document ; ff-map:mentionsEntity ?org ; pubo:title ?title ; pubo:content ?body .
OPTIONAL { ?news pubo:source ?source . }
} limit 100
Click here to run query on FactForge
News Entity Co-occurrence
PREFIX pubo: <http://ontology.ontotext.com/publishing#>
PREFIX ff-map: <http://factforge.net/ff2016-mapping/>
PREFIX dbo: <http://dbpedia.org/ontology/>
select ?org ?e2 (count(distinct ?news) as ?cnt)
where {
{
# This subselect is just here so you can work with chunks of that if necessary. Remove the limit to get all orgs.
select ?org {
?org a dbo:Organisation .
} limit 100
}
?news a pubo:Document ; ff-map:mentionsEntity ?org, ?e2 .
# We limit the other entity to the Person, Location and Organisation types. You can easily change this list.
?e2 a ?t2 . filter(?t2 in (dbo:Person, dbo:Location, dbo:Organisation))
filter(?e2 != ?org)
} group by ?org ?e2 limit 100
[Click here to run query on FactForge](
http://factforge.net/sparql?savedQueryName=Z%20-%20DT18%20-%20News%20Entity%20Co-occurrence&owner=admin)
WikiLink Link Text
Collects the text of all outgoing links from the Organisation’s article.
PREFIX dbo: <http://dbpedia.org/ontology/>
select ?org (group_concat(?text ; separator=", ") as ?links)
where {
?org a dbo:Organisation ; dbo:wikiPageWikiLinkText ?text .
} group by ?org limit 100
Click here to run query on FactForge
WikiLink Article Bodies
Collects the bodies of all articles pointed to by links outgoing from the Organisation’s article.
PREFIX pubo: <http://ontology.ontotext.com/publishing#>
PREFIX ff-map: <http://factforge.net/ff2016-mapping/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select distinct ?org ?page ?name ?abstarct
where {
{
select ?org {
?org a dbo:Organisation .
} limit 1000
}
# The way this query is structured will duplicate a lot of text.
# In practice, you will probably want one file with page names and bodies
# and a second file with mappings from orgs to linked pages.
?org dbo:wikiPageWikiLink ?page .
OPTIONAL { ?page rdfs:label ?name . filter(lang(?name) = "en") }
?page dbo:abstract ?abstarct .
} limit 100
Click here to run query on FactForge
WikiLink Entity Connections
Collects the IRIs of all PLO articles mentioned in articles pointed to by links outgoing from the Organisation’s article.
PREFIX pubo: <http://ontology.ontotext.com/publishing#>
PREFIX ff-map: <http://factforge.net/ff2016-mapping/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select ?org ?e2 (count(distinct ?page) as ?cnt)
where {
{
select ?org {
?org a dbo:Organisation .
} limit 100
}
?org dbo:wikiPageWikiLink ?page .
OPTIONAL { ?page rdfs:label ?name . filter(lang(?name) = "en") }
# We are combining two ways for wiki pages to express a connection.
# There are other ways too (e.g. ?page dbo:wikiPageWikiLink ?e2) but
# we are limiting ourselves a bit. You can play around with different way to express this.
{ ?page dct:subject ?e2 . } UNION { ?e2 dbo:wikiPageWikiLink ?page . }
?e2 a ?t2 .
filter(?t2 in (dbo:Person, dbo:Location, dbo:Organisation))
} group by ?org ?e2 limit 100
Click here to run query on FactForge
Varying levels of abstraction based on geonames
Fetches the parent features of the location of an organisation.
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX ff-map: <http://factforge.net/ff2016-mapping/>
PREFIX pubo: <http://ontology.ontotext.com/publishing#>
PREFIX onto: <http://www.ontotext.com/>
PREFIX geodata: <http://sws.geonames.org/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX gn: <http://www.geonames.org/ontology#>
select *
FROM onto:disable-sameAs
where {
{
select ?org {
?org a dbo:Organisation .
} limit 1000 offset 50000 #limit for demo
}
?org dbo:city ?city .
?city gn:name ?name .
optional{
?city gn:parentFeature ?par1 .
?par1 gn:name ?par1_name .
optional {
?par1 gn:parentFeature ?par2.
?par2 gn:name ?par2_name .
optional {
?par2 gn:parentFeature ?par3.
?par3 gn:name ?par3_name .
}
}
}
}
Click here to run query on FactForge
Collect organisation country and top-level administrative region
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX ff-map: <http://factforge.net/ff2016-mapping/>
PREFIX pubo: <http://ontology.ontotext.com/publishing#>
PREFIX onto: <http://www.ontotext.com/>
PREFIX geodata: <http://sws.geonames.org/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX gn: <http://www.geonames.org/ontology#>
select distinct ?org ?adm1 ?country
FROM onto:disable-sameAs
where {
{
select ?org {
?org a dbo:Organisation .
} limit 100 offset 1000 #limit for demo
}
?org dbo:city ?city .
?city gn:name ?name .
optional {
?city gn:parentFeature* ?adm1 .
?adm1 gn:featureCode gn:A.ADM1 ; gn:name ?adm1_name .
}
optional {
?city gn:parentFeature* ?country .
?country gn:featureCode ?countryCode ; gn:name ?country_name .
filter(?countryCode in (gn:A.PCL, gn:A.PCLI))
}
}
Click here to run query on FactForge
Parent Subsidiary relations
Fetches all the industry classifications of all the subsidiaries of an organisation.
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX ff-map: <http://factforge.net/ff2016-mapping/>
PREFIX onto: <http://www.ontotext.com/>
PREFIX geodata: <http://sws.geonames.org/>
PREFIX dbr: <http://dbpedia.org/resource/>
select ?org
(group_concat(distinct ?ind_str;separator=";") as ?inds)
(group_concat(distinct ?sub_ind_str;separator=";") as ?sub_inds)
FROM onto:disable-sameAs
where {
bind(dbr:Microsoft as ?org)
?org ff-map:hasTopLevelIndustry ?ind .
bind(replace(str(?ind),str(dbr:),"") as ?ind_str )
?org dbo:subsidiary ?sub
filter(?org != ?sub)
?sub ff-map:hasTopLevelIndustry ?sub_ind .
bind(replace(str(?sub_ind),str(dbr:),"") as ?sub_ind_str )
} group by ?org
Click here to run query on FactForge
Using the industry category mappings to generate relevant text/keywords
Retrieve all the labels of the entities mapped to a particular industry
PREFIX ff-map: <http://factforge.net/ff2016-mapping/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbr: <http://dbpedia.org/resource/>
select ?ind (group_concat(distinct ?label;separator=";") as ?labels) {
?ind a ff-map:topLevelIndustry
values ?alias_prop {ff-map:industrySubsector ff-map:industryVariant ff-map:industryDuplicate}
?var ?alias_prop ?ind .
values ?label_prop {dbo:wikiPageWikiLinkText rdfs:label}
?var ?label_prop ?label .
} group by ?ind
Click here to run query on FactForge
Generate an insane list of all the wikipages pointing to a given industry category (or one of its aliases)
PREFIX ff-map: <http://factforge.net/ff2016-mapping/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbr: <http://dbpedia.org/resource/>
select ?ind (group_concat(distinct ?label;separator=";") as ?labels) {
bind(dbr:Information_technology as ?ind)
?ind a ff-map:topLevelIndustry .
values ?alias_prop {ff-map:industrySubsector ff-map:industryVariant ff-map:industryDuplicate}
?var ?alias_prop ?ind .
?res ?p ?var .
values ?label_prop {dbo:wikiPageWikiLinkText rdfs:label}
?res ?label_prop ?label .
} group by ?ind
Click here to run query on FactForge
Contact
If you have questions, comments or a successful solution, please write in the Data.Chat @nikola, @gloria, @boryana or @andreytagarev
One thought on “Monthly Challenge – Ontotext – Case”
Hi! I would like to say, that I am not so smart in study and I don’t know how to write essay papers, but I know one magnificent writing company (they always write my research paper for me free). My classmates are using this agency if they want to have high results in their subjects!!!