How to Turn Wikipedia into a Structured Database


 Last Tuesday we at Data Science society were delighted to organize the 3rd Sofia Open Data and Linked Data meetup. In line with our tradition to try new venues, the event was held at the Telerik Academy thanks to our hosts from Telerik, a Progress Company. The other sponsors for our event were Ontotext and the DaPaaS research project, funded by the EC. Once again we reached several milestones:

– we streamed the event live in Beehive in VarnaBusiness Incubator Burgas and Pechatnitsata in Plovdiv;

– this was our first presentation in English and we will strive presenting in English from this time on;

– giving our guests the opportunity to ask questions online from their smartphones, tablets ot laptops and voting for the questions, all in real time, by employing http://sli.do;

– we had two speakers coming from two different organisations but sharing a single passion – to organize the wealth of data from Wikipedia into a searchable database.

Our first speaker was Dimitris Kontokostas – one of the leading researchers in the area of Linked Data and knowledge graphs, CTO and member of the executive team at DBpedia Association and part of the Agile Knowledge Engineering and Semantic Web (AKSW) Research Group in Germany. He is currently finishing his PhD at the University of Leipzig.

Dimitris introduced the ideas behind DBpedia – to transform the unstructured knowledge in Wikipedia articles into a structured database. Information in Wikipedia consists of text, images and links, but every article can be synthesized into an infobox. Unfortunately, infoboxes do not have a common format and one of the challenges for the DBpedia project that started in 2006 is to extract the information from this heterogeneous source and map it to a knowledge database.The database is organised as a RDF graph and Linked Data, and queryable via the RDF query language – SPARQL.

You can find more information in Dimitris’ presentation here.

Our next speaker was Vladimir Alexiev, PhD, PMP – leading expert at Ontotext in the area оf ontology engineering and Linked Open Data with over 15 years as head of R&D teams developing cutting edge software technologies. Vladimir is currently leading projects related to the use of Semantic Technology in the cultural heritage and digital libraries domain, with organisations such as the British Museum, Europeana and Getty.

Vladimir made the audience aware of the unique challenges that Wikipedia data poses. For example, a wrong decimal mark put unknown villages in Bulgaria at the top of the ranking of the largest residential places in terms of surface. Some important information is not available in the templates but in the main text of the articles. Information about musicians was also hard to categorize – most musicians were tagged as bands. The Bulgarian DBpedia team succeeded to improve the templates in order to extract the correct information. Vladimir also gave us a practical tour in mapping data from an unstructured article into the database of bg.dbpedia.org.

Taking a look at the general presentation of Vladimir or the presentation about mapping if you want to learn more. A complete video from the event is also available here.

The lecture was followed by networking drinks and snacks generously supplied by our friends at Ontotext. Don’t miss our great upcoming events and projects – stay tuned by visiting our website, following our Facebook page, LinkedIn page or following our twitter account.

Author: Vladimir Labov

Share this

Leave a Reply