On 24 to 26 of March Data Science Society organized a Datathon – the first-of-its-kind data analysis competition in Central and Eastern Europe. The event was held in the grounds of Software University in Sofia with the support of partner companies and organisations such as Kaufland, Telenor, Experian, HyperScience, ReceiptBank, SAP, ShopUp, A4E, GemSeek, Ontotext, Helecloud, VMware, NSI and Open Government from Council of Ministers.
The Data Science Society team with and the partner companies provided various business cases in the field of data science offering challenges to the participants who set out to solve them in less than 48 hours.
At the end of the event there were 16 teams presenting their results of a weekend of work. Within the next few months Data Science Society plan to present in the regular on- and off-line events each team with their solutions and results.
The first such event is planned for late May and it would feature the winners team: Iva Delcheva, Nikolay Petrov, Yasen Kiprov, and Viktor Senderov. They worked on both cases (1) Bulgarian Commercial Register task to RDF-ize publicly accessible data about companies in Bulgaria (a case provided by Ontotext, and (2) To analyze and explore data about public procurements in Bulgaria (a case provided by Open Data portal of Bulgarian government). They decided to join the two datasets together thus generating a Linked Open Data dataset in RDF, which then to query and analyze. See below al summarized description of how they fulfilled this task with the help of GraphDB by Viktor Senderov.
The Bulgarian Commercial Register (Търговски регистър) is available online as a set of XML files and it covers deeds from 2008 onwards. A deed is a legal term describing the entering into the register of data pertaining to a company such as address, managers, legal status, etc. The data for one company is distributed amongst several deeds and needs to be aggregated. Ontotext recognized the need and offered both a data model for commercial register data and a Java program to RDF-ize the data (see Fig. 1).
Fig. 1 Simplified data model for the Bulgarian Commercial Register.
In addition to the model, mentors suggested a scheme for issuing URI’s to companies allowing for the easy merging of the data. The identifier each company gets is `:Company_UIC`, where UIC is the Unified Identification Code of the company.
Data for Bulgarian public procurement from the Public Procurement Agency was made available. It has information from 2007 to the middle of 2016. The data is in a CSV format and has columns for the principle and the contractors of the procurement, procurement objective, value in a currency, etc. They modeled these entities as in Fig. 2.
Fig. 2. Simplified data model for the public procurements.
RDF-ization of this CSV set can be done with OpenRefine, which is included in GraphDB, but in their case was done with a custom-made R script written for a bioinformatics project of Viktor Senderov. Using the same identification scheme for the companies participating in public procurement as we used for the Commercial Register, they were able to link the two datasets.
In addition to these two major datasets, they interlinked companies to their geo-coordinates by utilizing the Google API.
The resulting RDF dataset, a set of Turtle files, was uploaded to a GraphDB 8 installation running on the Amazon cloud. The size of the uploaded data is approximately 12.5 million triples (more than 2 GB of uncompressed data). The data was not only aggregated and formatted for easy querying, but also connected to previously disconnected information. One interesting question that can be explored with this linked dataset is the question about conflicts of interest.
A conflict of interest may arise if a person A managing a government entity is also a related party (such as, for example, owner) of a private contractor of the government entity. The SPARQL query answering this question is:
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
The query above returns at least 455 potential issues for a total of 348,468,109 Bulgarian Leva in sectors such as energy and forestry (Fig. 3).
Fig. 3. Query results
The team started solving the case as data scientists and they do not have the legal or commercial expertise to interpret this vast amount of potential conflicts of interest. It may in fact be the case that none of these potential conflict of interest is illegal or even unethical. The team do, believe, however, that someone whose expertise is in legal and commercial matters may benefit from using this linked dataset. This person could be an investigative journalist, a public representative or simply a concerned citizen.
In another example, based on the Commercial Register data, one can do a “board walk”, i.e. jump from company to company that share board members and discover cliques of companies. This leads to finding that there are certain individuals on the boards of many dozens of companies. Could this be done for tax evasion? Another idea that the team had is to see what persons are most successful in receiving EU funds. For this task, one would have to include more publicly available information into the dataset about EU procurements. So the possibilities seem endless.
In all teams’ efforts they have only used publicly available data. However, by putting the data into a database the team have vastly increased its searchability and usefulness which is in the interest of society. The effort required from going from loosely-structured public data posted online to Linked Open Data stored in a database is worth it, given the public service. The dataset will be put online in the near future and they will gladly answer questions pertaining to its use.
Prepared by Viktor Senderov and Angel Marchev.