User Login Register Regain password
Email Username

Datathon winner - Revealing hidden links through open data

CATEGORY
Datathon the final Datathon the final

On 24 to 26 of March Data Science Society organized a Datathon - the first-of-its-kind data analysis competition in Central and Eastern Europe. The event was held in the grounds of Software University in Sofia with the support of partner companies and organisations such as Kaufland, Telenor, Experian, HyperScience, ReceiptBank, SAP, ShopUp, A4E, GemSeek, Ontotext, Helecloud, VMware, NSI and Open Government from Council of Ministers.

The Data Science Society team with and the partner companies provided various business cases in the field of data science offering challenges to the participants who set out to solve them in less than 48 hours.

 

At the end of the event there were 16 teams presenting their results of a weekend of work. Within the next few months Data Science Society plan to present in the regular on- and off-line events each team with their solutions and results.

 

The first such event is planned for late May and it would feature the winners team: Iva Delcheva, Nikolay Petrov, Yasen Kiprov, and Viktor Senderov. They worked on both cases (1) Bulgarian Commercial Register task to RDF-ize publicly accessible data about companies in Bulgaria (a case provided by Ontotext, and (2) To analyze and explore data about public procurements in Bulgaria (a case provided by Open Data portal of Bulgarian government). They decided to join the two datasets together thus  generating a Linked Open Data dataset in RDF, which then to query and analyze. See below al summarized description of how they fulfilled this task with the help of GraphDB by Viktor Senderov.

 

The Bulgarian Commercial Register (Търговски регистър) is available online as a set of XML files and it covers deeds from 2008 onwards. A deed is a legal term describing the entering into the register of data pertaining to a company such as address, managers, legal status, etc. The data for one company is distributed amongst several deeds and needs to be aggregated. Ontotext recognized the need and offered both a data model for commercial register data and a Java program to RDF-ize the data (see Fig. 1).

 

Screenshot from 2017-04-12 14-50-19.png

Fig. 1 Simplified data model for the Bulgarian Commercial Register.

 

In addition to the model, mentors suggested a scheme for issuing URI’s to companies allowing for the easy merging of the data. The identifier each company gets is `:Company_UIC`, where UIC is the Unified Identification Code of the company.

 

Data for Bulgarian public procurement from the Public Procurement Agency was made available. It has information from 2007 to the middle of 2016. The data is in a CSV format and has columns for the principle and the contractors of the procurement, procurement objective, value in a currency, etc. They modeled these entities as in Fig. 2.

 

Blank Diagram - Page 1.png

Fig. 2. Simplified data model for the public procurements.

 

RDF-ization of this CSV set can be done with OpenRefine, which is included in GraphDB, but in their case was done with a custom-made R script written for a bioinformatics project of Viktor Senderov. Using the same identification scheme for the companies participating in public procurement as we used for the Commercial Register, they were able to link the two datasets.

 

In addition to these two major datasets, they interlinked companies to their geo-coordinates by utilizing the Google API.

 

The resulting RDF dataset, a set of Turtle files, was uploaded to a GraphDB 8 installation running on the Amazon cloud. The size of the uploaded data is approximately 12.5 million triples (more than 2 GB of uncompressed data). The data was not only aggregated and formatted for easy querying, but also connected to previously disconnected information. One interesting question that can be explored with this linked dataset is the question about conflicts of interest.

A conflict of interest may arise if a person A managing a government entity is also a related party (such as, for example, owner) of a private contractor of the government entity. The SPARQL query answering this question is:

 

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
prefix : <http://datathon.com#>
SELECT ?p_name ?p ?proc ?c_name ?c ?c2_name ?c2 ?val ?cur
WHERE {
   ?proc a :GovernmentProcurement ;
         :principle ?c ;
         :contractor ?c2 ;
         :currencyValue ?val ;
         :currency ?cur .
    ?c  :hasInfluencingPerson ?p ;
       skos:prefLabel ?c_name .
   ?c2 :hasInfluencingPerson ?p ;
       skos:prefLabel ?c2_name .
   ?p skos:prefLabel ?p_name .
   FILTER (?c != ?c2)
   FILTER NOT EXISTS {  
           ?p skos:prefLabel ?p_name .
       FILTER  ( regex( lcase(?p_name), "(.*община.*)|(.*държавата.*)|(.*министерство.*)" ))  
   }
} ORDER BY DESC (?val)

 

 The query above returns at least 455 potential issues for a total of 348,468,109 Bulgarian Leva in sectors such as energy and forestry (Fig. 3).

 


Fig. 3. Query results

 

 

The team started solving the case as data scientists and they do not have the legal or commercial expertise to interpret this vast amount of potential conflicts of interest. It may in fact be the case that none of these potential conflict of interest is illegal or even unethical. The team do, believe, however, that someone whose expertise is in legal and commercial matters may benefit from using this linked dataset. This person could be an investigative journalist, a public representative or simply a concerned citizen.

 

In another example, based on the Commercial Register data, one can do a “board walk”, i.e. jump from company to company that share board members and discover cliques of companies. This leads to finding that there are certain individuals on the boards of many dozens of companies. Could this be done for tax evasion? Another idea that the team had is to see what persons are most successful in receiving EU funds. For this task, one would have to include more publicly available information into the dataset about EU procurements. So the possibilities seem endless.


In all teams’ efforts they have only used publicly available data. However, by putting the data into a database the team have vastly increased its searchability and usefulness which is in the interest of society. The effort required from going from loosely-structured public data posted online to Linked Open Data stored in a database is worth it, given the public service. The dataset will be put online in the near future and they will gladly answer questions pertaining to its use.

Prepared by Viktor Senderov and Angel Marchev.

RATE THIS ITEM
(1 Vote)

Leave a comment

Make sure you enter all the required information, indicated by an asterisk (*). HTML code is not allowed.

Data Science Society

Data Science Society

Contact us at: info@datasciencesociety.net

+ 359 (0) 888 400 290

Latest Tweets

Follow Us - @Data Science Society 2 hours ago
MR + AI = The Future Stambol Studios https://t.co/grzkBHEQ6Y https://t.co/nOY9u8QtGN
Follow Us - @Data Science Society 5 hours ago
Panel Looks to Foster Collaboration Around AI and Machine Learning for CNS Diseases GN... https://t.co/iBVuRSwN4W
Follow Us - @Data Science Society 8 hours ago
Researchers use 'big data' to predict side effects of radiotherapy for https://t.co/5v0wfzQTvN https://t.co/jyi6uwuPft
Follow Us - @Data Science Society 11 hours ago
Big Data for Better Hearts https://t.co/NuB7wsmRDz https://t.co/KXlgY8d3v8
Follow Us - @Data Science Society 14 hours ago
Transparency of machine-learning algorithms is a double-edged sword https://t.co/o1hT5mGXw6 https://t.co/NJdJnXTX5s

From the blog