Datathon winner – Revealing hidden links through open data

On 24 to 26 of March Data Science Society organized a Datathon – the first-of-its-kind data analysis competition in Central and Eastern Europe. The event was held in the grounds of Software University in Sofia with the support of partner companies and organisations such as Kaufland, Telenor, Experian, HyperScience, ReceiptBank, SAP, ShopUp, A4E, GemSeek, Ontotext, Helecloud, VMware, NSI and Open Government from Council of Ministers.

The Data Science Society team with and the partner companies provided various business cases in the field of data science offering challenges to the participants who set out to solve them in less than 48 hours.


At the end of the event there were 16 teams presenting their results of a weekend of work. Within the next few months Data Science Society plan to present in the regular on- and off-line events each team with their solutions and results.


The first such event is planned for late May and it would feature the winners team: Iva Delcheva, Nikolay Petrov, Yasen Kiprov, and Viktor Senderov. They worked on both cases (1) Bulgarian Commercial Register task to RDF-ize publicly accessible data about companies in Bulgaria (a case provided by Ontotext, and (2) To analyze and explore data about public procurements in Bulgaria (a case provided by Open Data portal of Bulgarian government). They decided to join the two datasets together thus  generating a Linked Open Data dataset in RDF, which then to query and analyze. See below al summarized description of how they fulfilled this task with the help of GraphDB by Viktor Senderov.


The Bulgarian Commercial Register (Търговски регистър) is available online as a set of XML files and it covers deeds from 2008 onwards. A deed is a legal term describing the entering into the register of data pertaining to a company such as address, managers, legal status, etc. The data for one company is distributed amongst several deeds and needs to be aggregated. Ontotext recognized the need and offered both a data model for commercial register data and a Java program to RDF-ize the data (see Fig. 1).


Fig. 1 Simplified data model for the Bulgarian Commercial Register.


In addition to the model, mentors suggested a scheme for issuing URI’s to companies allowing for the easy merging of the data. The identifier each company gets is `:Company_UIC`, where UIC is the Unified Identification Code of the company.


Data for Bulgarian public procurement from the Public Procurement Agency was made available. It has information from 2007 to the middle of 2016. The data is in a CSV format and has columns for the principle and the contractors of the procurement, procurement objective, value in a currency, etc. They modeled these entities as in Fig. 2.


Blank Diagram - Page 1.png

Fig. 2. Simplified data model for the public procurements.


RDF-ization of this CSV set can be done with OpenRefine, which is included in GraphDB, but in their case was done with a custom-made R script written for a bioinformatics project of Viktor Senderov. Using the same identification scheme for the companies participating in public procurement as we used for the Commercial Register, they were able to link the two datasets.


In addition to these two major datasets, they interlinked companies to their geo-coordinates by utilizing the Google API.


The resulting RDF dataset, a set of Turtle files, was uploaded to a GraphDB 8 installation running on the Amazon cloud. The size of the uploaded data is approximately 12.5 million triples (more than 2 GB of uncompressed data). The data was not only aggregated and formatted for easy querying, but also connected to previously disconnected information. One interesting question that can be explored with this linked dataset is the question about conflicts of interest.

A conflict of interest may arise if a person A managing a government entity is also a related party (such as, for example, owner) of a private contractor of the government entity. The SPARQL query answering this question is:


PREFIX skos: <>
prefix : <>
SELECT ?p_name ?p ?proc ?c_name ?c ?c2_name ?c2 ?val ?cur
   ?proc a :GovernmentProcurement ;
         :principle ?c ;
         :contractor ?c2 ;
         :currencyValue ?val ;
         :currency ?cur .
    ?c  :hasInfluencingPerson ?p ;
       skos:prefLabel ?c_name .
   ?c2 :hasInfluencingPerson ?p ;
       skos:prefLabel ?c2_name .
   ?p skos:prefLabel ?p_name .
   FILTER (?c != ?c2)
           ?p skos:prefLabel ?p_name .
       FILTER  ( regex( lcase(?p_name), “(.*община.*)|(.*държавата.*)|(.*министерство.*)” ))  
} ORDER BY DESC (?val)

 The query above returns at least 455 potential issues for a total of 348,468,109 Bulgarian Leva in sectors such as energy and forestry (Fig. 3).

Fig. 3. Query results

The team started solving the case as data scientists and they do not have the legal or commercial expertise to interpret this vast amount of potential conflicts of interest. It may in fact be the case that none of these potential conflict of interest is illegal or even unethical. The team do, believe, however, that someone whose expertise is in legal and commercial matters may benefit from using this linked dataset. This person could be an investigative journalist, a public representative or simply a concerned citizen.


In another example, based on the Commercial Register data, one can do a “board walk”, i.e. jump from company to company that share board members and discover cliques of companies. This leads to finding that there are certain individuals on the boards of many dozens of companies. Could this be done for tax evasion? Another idea that the team had is to see what persons are most successful in receiving EU funds. For this task, one would have to include more publicly available information into the dataset about EU procurements. So the possibilities seem endless.

In all teams’ efforts they have only used publicly available data. However, by putting the data into a database the team have vastly increased its searchability and usefulness which is in the interest of society. The effort required from going from loosely-structured public data posted online to Linked Open Data stored in a database is worth it, given the public service. The dataset will be put online in the near future and they will gladly answer questions pertaining to its use.

Prepared by Viktor Senderov and Angel Marchev.

Share this

2 thoughts on “Datathon winner – Revealing hidden links through open data

  1. تاثیر
    خرید پیج اینستاگرام در کار شما چیست ؟

    شما یک شرکت یا فروشگاه بزرگ در کشورهستید امروزه با گسترده شدن شبکه های اجتماعی و پیشرفت الکترونیک شما هم باید به رقابت با دیگران بپردازید با خرید یک پیج اینستاگرام می توانید محصولات خود و یا شرکت خود را در شبکه های اجتماعی برای دیگران معرفی کنید می توانید محصولات جدید و خبرهای جدید در مورد کار خود را در صفحات اجتماعی به نمایش بگذارید

    چرا پیج اینستاگرام بخریم ؟

    قیمت ارزان و مناسب پیج اینستاگرام برای شما مشتریان عزیز

    تحویل سریع و آسان پس از خرید

    عموم افراد اغلب دارای صفحات اجتماعی شخصی می باشند

    بیش از 20 میلیون کاربر فعال ایرانی در اینستاگرام

    بیش از 80000 میلیون کاربر فعال در کل دنیا در اینستاکرام

    ارزان ترین ساده ترین و مطمئن ترین راه برند سازی

    دسترسی سریع و راحت برای گذاشتن مطالب در پیج اینستاگرام خریداری شده

    اشتراک گذاشتن محصولات ومعرفی کار خود در شبکه های اجتماعی

    پرداخت شما از طریق 3 درگاه پرداخت امن بانکی پرداخت الکترونیک پارسیان , به پرداخت ملت ، درگاه پرداخت بانک سامان انجام میشود

    پس ار اتمام خرید و پرداخت برای شما فاکتور صادره شده و به ایمیل شما ارسال میشود تمامی پرداخت ها قابل پیگیری است و ثبت میشوند

    سیستم پرداخت الکترونیک پارسیان , به پرداخت ملت ، درگاه پرداخت بانک سامان تمامی کارت های عضو شتاب را پشتیبانی میکند

    در صورت که می خواهید یک پیج اینستاگرام خوب داشته باشید حتما سایت اینستاگلد را کامل بررسی نمایید

  2. You Know Why Should Buy Instagram Followers?

    buy instagram followers

    The number of followers on the Instagram is dotted with thousands of thousands of like and comments that go down to your page.

    Would you want buy Instagram likes?
    Are you one of the people who opened a new page for a new business and

    are worried that their visitors will not increase so soon and their work will take a long time to earn revenue.

    So they’re thinking of
    buy instagram likes


Leave a Reply