with this article we are continuing our collaboration with Toptal. Toptal is an exclusive network that aims to connect the top freelance software developers, designers, and finance experts in the world to top companies for their most important projects.
The article is authored by Anthony Sistilli and was originally published in Toptal's blog.
Big data is everywhere. Period. In the process of running a successful business in today’s day and age, you’re likely going to run into it whether you like it or not.
Whether you’re a businessman trying to catch up to the times or a coding prodigy looking for their next project, this tutorial will give you a brief overview of what big data is. You will learn how it’s applicable to you, and how you can get started quickly through the Twitter API and Python.
What Is Big Data?
Big data is exactly what it sounds like—a lot of data. Alone, a single point of data can’t give you much insight. But terabytes of data, combined together with complex mathematical models and boisterous computing power, can create insights human beings aren’t capable of producing. The value that big data Analytics provides to a business is intangible and surpassing human capabilities each and every day.
The first step to big data analytics is gathering the data itself. This is known as “data mining.” Data can come from anywhere. Most businesses deal with gigabytes of user, product, and location data. In this tutorial, we’ll be exploring how we can use data mining techniques to gather Twitter data, which can be more useful than you might think.
For example, let’s say you run Facebook, and want to use Messenger data to provide insights on how you can advertise to your audience better. Messenger has 1.2 billion monthly active users. In this case, the big data are conversations between users. If you were to individually read the conversations of each user, you would be able to get a good sense of what they like, and be able to recommend products to them accordingly. Using a machine learning technique known as Natural Language Processing (NLP), you can do this on a large scale with the entire process automated and left up to machines.
This is just one of the countless examples of how machine learning and big data analytics can add value to your company.
Why Twitter data?
Twitter is a gold mine of data. Unlike other social platforms, almost every user’s tweets are completely public and pullable. This is a huge plus if you’re trying to get a large amount of data to run analytics on. Twitter data is also pretty specific. Twitter’s API allows you to do complex queries like pulling every tweet about a certain topic within the last twenty minutes, or pull a certain user’s non-retweeted tweets.
A simple application of this could be analyzing how your company is received in the general public. You could collect the last 2,000 tweets that mention your company (or any term you like), and run a sentiment analysis algorithm over it.
We can also target users that specifically live in a certain location, which is known as spatial data. Another application of this could be to map the areas on the globe where your company has been mentioned the most.
As you can see, Twitter data can be a large door into the insights of the general public, and how they receive a topic. That, combined with the openness and the generous rate limiting of Twitter’s API, can produce powerful results.
To connect to Twitter’s API, we will be using a Python library called Tweepy, which we’ll install in a bit.
Twitter Developer Account
In order to use Twitter’s API, we have to create a developer account on the Twitter apps site.
- Log in or make a Twitter account at https://apps.twitter.com/.
- Create a new app (button on the top right)
- Fill in the app creation page with a unique name, a website name (use a placeholder website if you don’t have one), and a project description. Accept the terms and conditions and proceed to the next page.
- Once your project has been created, click on the “Keys and Access Tokens” tab. You should now be able to see your consumer secret and consumer key.
- You’ll also need a pair of access tokens. Scroll down and request those tokens. The page should refresh, and you should now have an access token and access token secret.
We’ll need all of these later, so make sure you keep this tab open.
Tweepy is an excellently supported tool for accessing the Twitter API. It supports Python 2.6, 2.7, 3.3, 3.4, 3.5, and 3.6. There are a couple of different ways to install Tweepy. The easiest way is using
pip install tweepyinto your terminal.
You can follow the instructions on Tweepy’s GitHub repository. The basic steps are as follows:
git clone https://github.com/tweepy/tweepy.git cd tweepy python setup.py install
You can troubleshoot any installation issues there as well.
Now that we have the necessary tools ready, we can start coding! The baseline of each application we’ll build today requires using Tweepy to create an API object which we can call functions with. In order create the API object, however, we must first authenticate ourselves with our developer information.
First, let’s import Tweepy and add our own authentication information.
import tweepy consumer_key = "wXXXXXXXXXXXXXXXXXXXXXXX1" consumer_secret = "qXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXh" access_token = "9XXXXXXXX-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXi" access_token_secret = "kXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXT"
Now it’s time to create our API object.
# Creating the authentication object auth = tweepy.OAuthHandler(consumer_key, consumer_secret) # Setting your access token and secret auth.set_access_token(access_token, access_token_secret) # Creating the API object while passing in auth information api = tweepy.API(auth)
This will be the basis of every application we build, so make sure you don’t delete it.
this article is the beginning of our collaboration with Toptal. Toptal is an exclusive network that aims to connect the top freelance software developers, designers, and finance experts in the world to top companies for their most important projects.
The article is authored by Rogelio Nicolas Mengual and was originally published in Toptal's blog.
As you may know, the Foreign Exchange (Forex) market is used for trading between currency pairs. But you might not be aware that it’s the most liquid market in the world.
A few years ago, driven by my curiosity, I took my first steps into the world of Forex trading algorithms by creating a demo account and playing out simulations (with fake money) on the Meta Trader 4 trading platform.
After a week of ‘trading’, I’d almost doubled my money. Spurred on by my own success, I dug deeper and eventually signed up for a number of forums. Soon, I was spending hours reading about algorithmic trading systems (rule sets that determine whether you should buy or sell), custom indicators, market moods, and more.
My First Client
Around this time, coincidentally, I heard that someone was trying to find a software developer to automate a simple trading system. This was back in my college days when I was learning about concurrent programming in Java (threads, semaphores, and all that junk). I thought that this automated system this couldn’t be much more complicated than my advanced data science course work, so I inquired about the job and came on-board.
The client wanted the system built with MQL4, a functional programming language used by the Meta Trader 4 platform for performing stock-related actions.
The role of the trading platform (Meta Trader 4, in this case) is to provide a connection to a Forex broker. The broker then provides a platform with real-time information about the market and executes your buy/sell orders. For readers unfamiliar with Forex trading, here’s the information that is provided by the data feed:
Through Meta Trader 4, you can access all this data with internal functions, accessible in various timeframes: every minute (M1), every five minutes (M5), M15, M30, every hour (H1), H4, D1, W1, MN.
The movement of the Current Price is called a tick. In other words, a tick is a change in the Bid or Ask price for a currency pair. During active markets, there may be numerous ticks per second. During slow markets, there can be minutes without a tick. The tick is the heartbeat of a Forex robot.
When you place an order through such a platform, you buy or sell a certain volume of a certain currency. You also set stop-loss and take-profit limits. The stop-loss limit is the maximum amount of pips (price variations) that you can afford to lose before giving up on a trade. The take-profit limit is the amount of pips that you’ll accumulate in your favor before cashing out.
The client’s algorithmic trading specifications were simple: they wanted a robot based on two indicators. For background, indicators are very helpful when trying to define a market state and make trading decisions, as they’re based on past data (e.g., highest price value in the last n days). Many come built-in to Meta Trader 4. However, the indicators that my client was interested in came from a custom trading system.
They wanted to trade every time two of these custom indicators intersected, and only at a certain angle.
As I got my hands dirty, I learned that MQL4 programs have the following structure:
· [Preprocessor Directives]
· [External Parameters]
· [Global Variables]
· [Init Function]
· [Deinit Function]
· [Start Function]
· [Custom Functions]
The start function is the heart of every MQL4 program since it is executed every time the market moves (ergo, this function will execute once per tick). This is the case regardless of the timeframe you’re using. For example, you could be operating on the H1 (one hour) timeframe, yet the start function would execute many thousands of times per timeframe.
To work around this, I forced the function to execute once per period unit:
Getting the values of the indicators:
The decision logic, including intersection of the indicators and their angles:
Sending the orders:
If you’re interested, you can find the complete, runnable code on GitHub.
Once I built my algorithmic trading system, I wanted to know: 1) if it was behaving appropriately, and 2) if it was any good.
Back-testing is the process of testing a particular (automated or not) system under the events of the past. In other words, you test your system using the past as a proxy for the present.
MT4 comes with an acceptable tool for back-testing a Forex trading system (nowadays, there are more professional tools that offer greater functionality). To start, you setup your timeframes and run your program under a simulation; the tool will simulate each tick knowing that for each unit it should open at certain price, close at a certain price and, reach specified highs and lows.
After comparing the actions of the program against historic prices, you’ll have a good sense for whether or not it’s executing correctly.
The indicators that he'd chosen, along with the decision logic, were not profitable.
From back-testing, I’d checked out the robot’s return ratio for some random time intervals; needless to say, I knew that my client wasn’t going to get rich with it—the indicators that he’d chosen, along with the decision logic, were not profitable. As a sample, here are the results of running the program over the M15 window for 164 operations:
Note that our balance (the blue line) finishes below its starting point.
One caveat: saying that a system is "profitable" or "unprofitable" isn't always genuine. Often, systems are (un)profitable for periods of time based on the market's "mood":
Parameter Optimization, and its Lies
Although back-testing had made me wary of this robot’s usefulness, I was intrigued when I started playing around with its external parameters and noticed big differences in the overall Return Ratio. This particular science is known as Parameter Optimization.
I did some rough testing to try and infer the significance of the external parameters on the Return Ratio and came up with something like this:
Or, cleaned up:
You may think (as I did) that you should use the Parameter A. But the decision isn’t as straightforward as it may appear. Specifically, note the unpredictability of Parameter A: for small error values, its return changes dramatically. In other words, Parameter A is very likely to over-predict future results since any uncertainty, any shift at all will result in worse performance.
But indeed, the future is uncertain! And so the return of Parameter A is also uncertain. The best choice, in fact, is to rely on unpredictability. Often, a parameter with a lower maximum return but superior predictability (less fluctuation) will be preferable to a parameter with high return but poor predictability.
The only thing you can be sure is that you don’t know the future of the market, and thinking you know how the market is going to perform based on past data is a mistake. In turn, you must acknowledge this unpredictability.
Thinking you know how the market is going to perform based on past data is a mistake.
This does not necessarily mean we should use Parameter B, because even the lower returns of Parameter A performs better than Parameter B; this is just to show you that Optimizing Parameters can result in tests that overstate likely future results, and such thinking is not obvious.
Overall Forex Algorithmic Trading Considerations
Since that first algorithmic Forex trading experience, I’ve built several automated trading systems for clients, and I can tell you that there’s always room to explore. For example, I recently built a system based on finding so-called “Big Fish” movements; that is, huge pips variations in tiny, tiny units of time. This is a subject that fascinates me.
Building your own simulation system is an excellent option to learn more about the Forex market, and the possibilities are endless. For example, you could try to decipher the probability distribution of the price variations as a function of volatility in one market (EUR/USD for example), and maybe make a Montecarlo simulation model using the distribution per volatility state, using whatever degree of accuracy you want. I’ll leave this as an exercise for the eager reader.
The Forex world can be overwhelming at times, but I hope that this write-up has given you some points on how to get going.
I’ve read extensively about the mysterious world that is the Forex market. Here are a few write-ups that I recommend for programmers and enthusiastic readers:
· BabyPips: This is the starting point if you don’t know squat about Forex trading.
· The Way of the Turtle, by Curtis Faith: This one, in my opinion, is the Forex Bible. Read it once you have some experience trading.
· Expert Advisor Programming – Creating Automated Trading Systems in MQL for Meta Trader 4, by Andrew R. Young
· Trading Systems – A New Approach to System Development and Portfolio Optimisation, by Urban Jeckle and Emilio Tomasini: Very technical, very focused on testing.
· A Step-By-Step Implementation of a Multi-Agent Currency Trading System, by Rui Pedro Barbosa and Orlando Belo: This one is very professional, describing how you might create a trading system and testing platform.
In the past few months we realised a lot of exciting projects and now we are growing! It’s now time to take it one step further! That’s why we are looking for somebody proactive, enthusiastic and outgoing to take up a role as a leader of the Data Science Society community and main events organiser.
What will you do?
In short, you will ensure the daily operations go smoothly. You will be the main administrative and operations point-of-contact for partnering organisations, sponsoring companies, new members and key activists:
Day-to-day handling of event & partnership inquiries
Support partners & customer relations - keep an updated database of contacts, regular contact
Organise DSS monthly meetups/seminars (logistics, promotion, on-site coordination)
Write blog posts, articles, newsletters, communications materials
Create rich media for social platforms: pictures, video and written content
Support the core of key activists in organisation of regular internal meetings, keep updated to-dos and follow up responsible parties
Liaise with external partners for the building & update of website, strategic marketing & PR activities
Administration of organisations documents and finances
Why should you apply?
You will have the chance to work with some of the brightest minds in data science in Bulgaria and cooperate with them on visionary projects, all aimed at increasing the expertise in data processing and analytics throughout the local tech community. You will have the freedom to develop ideas and processes on your own. You will help tech people to apply innovations and cutting-edge technologies in data science. And you will have lots of fun!
Any skills you need to bring to the table?
Just an excellent spoken and written English, some experience with event management, a prominent interest in the tech industry and an outgoing personality.
How to apply?
But if you feel more creative and want to send your application in another format, we’ll welcome it.
On 24 to 26 of March Data Science Society organized a Datathon - the first-of-its-kind data analysis competition in Central and Eastern Europe. The event was held in the grounds of Software University in Sofia with the support of partner companies and organisations such as Kaufland, Telenor, Experian, HyperScience, ReceiptBank, SAP, ShopUp, A4E, GemSeek, Ontotext, Helecloud, VMware, NSI and Open Government from Council of Ministers.
The Data Science Society team with and the partner companies provided various business cases in the field of data science offering challenges to the participants who set out to solve them in less than 48 hours.
At the end of the event there were 16 teams presenting their results of a weekend of work. Within the next few months Data Science Society plan to present in the regular on- and off-line events each team with their solutions and results.
The first such event is planned for late May and it would feature the winners team: Iva Delcheva, Nikolay Petrov, Yasen Kiprov, and Viktor Senderov. They worked on both cases (1) Bulgarian Commercial Register task to RDF-ize publicly accessible data about companies in Bulgaria (a case provided by Ontotext, and (2) To analyze and explore data about public procurements in Bulgaria (a case provided by Open Data portal of Bulgarian government). They decided to join the two datasets together thus generating a Linked Open Data dataset in RDF, which then to query and analyze. See below al summarized description of how they fulfilled this task with the help of GraphDB by Viktor Senderov.
The Bulgarian Commercial Register (Търговски регистър) is available online as a set of XML files and it covers deeds from 2008 onwards. A deed is a legal term describing the entering into the register of data pertaining to a company such as address, managers, legal status, etc. The data for one company is distributed amongst several deeds and needs to be aggregated. Ontotext recognized the need and offered both a data model for commercial register data and a Java program to RDF-ize the data (see Fig. 1).
Fig. 1 Simplified data model for the Bulgarian Commercial Register.
In addition to the model, mentors suggested a scheme for issuing URI’s to companies allowing for the easy merging of the data. The identifier each company gets is `:Company_UIC`, where UIC is the Unified Identification Code of the company.
Data for Bulgarian public procurement from the Public Procurement Agency was made available. It has information from 2007 to the middle of 2016. The data is in a CSV format and has columns for the principle and the contractors of the procurement, procurement objective, value in a currency, etc. They modeled these entities as in Fig. 2.
Fig. 2. Simplified data model for the public procurements.
RDF-ization of this CSV set can be done with OpenRefine, which is included in GraphDB, but in their case was done with a custom-made R script written for a bioinformatics project of Viktor Senderov. Using the same identification scheme for the companies participating in public procurement as we used for the Commercial Register, they were able to link the two datasets.
In addition to these two major datasets, they interlinked companies to their geo-coordinates by utilizing the Google API.
The resulting RDF dataset, a set of Turtle files, was uploaded to a GraphDB 8 installation running on the Amazon cloud. The size of the uploaded data is approximately 12.5 million triples (more than 2 GB of uncompressed data). The data was not only aggregated and formatted for easy querying, but also connected to previously disconnected information. One interesting question that can be explored with this linked dataset is the question about conflicts of interest.
A conflict of interest may arise if a person A managing a government entity is also a related party (such as, for example, owner) of a private contractor of the government entity. The SPARQL query answering this question is:
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
The query above returns at least 455 potential issues for a total of 348,468,109 Bulgarian Leva in sectors such as energy and forestry (Fig. 3).
Fig. 3. Query results
The team started solving the case as data scientists and they do not have the legal or commercial expertise to interpret this vast amount of potential conflicts of interest. It may in fact be the case that none of these potential conflict of interest is illegal or even unethical. The team do, believe, however, that someone whose expertise is in legal and commercial matters may benefit from using this linked dataset. This person could be an investigative journalist, a public representative or simply a concerned citizen.
In another example, based on the Commercial Register data, one can do a “board walk”, i.e. jump from company to company that share board members and discover cliques of companies. This leads to finding that there are certain individuals on the boards of many dozens of companies. Could this be done for tax evasion? Another idea that the team had is to see what persons are most successful in receiving EU funds. For this task, one would have to include more publicly available information into the dataset about EU procurements. So the possibilities seem endless.
In all teams’ efforts they have only used publicly available data. However, by putting the data into a database the team have vastly increased its searchability and usefulness which is in the interest of society. The effort required from going from loosely-structured public data posted online to Linked Open Data stored in a database is worth it, given the public service. The dataset will be put online in the near future and they will gladly answer questions pertaining to its use.
Prepared by Viktor Senderov and Angel Marchev.
Don't miss the unique chance to hear from Dr. Preslav Nakov how Natural Language Processing (NLP) may expose paid opinion manipulation trolls on internet community forums.
The Department of Computational Linguistics (DCL) is proud to announce the Second Conference on Computational Linguistics in Bulgaria.
CLIB is an international conference that aims to foster the NLP community in Bulgaria and further the cooperation with Bulgarian researchers working in NLP around the world through establishing a forum for sharing high-quality scientific work in all areas of computational linguistics and NLP.
The Second CLIB Conference will be held in Sofia, Bulgaria, on September 9, 2016. It is organised by the team of the Department of Computational Linguistics at the Institute for Bulgarian Language in conjunction with the Faculty of Slavic Studies and the Faculty of Mathematics and Informatics at Sofia University.
Learn more here.
Interested how biometrics works and how it is used against terrorism? Or just curious how smart the machines have become and how long before they turn into Skynet? In both cases, we at the Data Science Society have got great news for you! In the beginning of September you can participate in two major events organized by IEEE Young Professionals AG of Bulgaria, and supported by the Data Science Society.
1) Joint meeting on "Computational Intelligence"
On September 3, you can participate In our Joint meeting on "Computational Intelligence" organized in partnership with the Bulgarian joint chapter of Computational Intelligence/Systems, Man and Cybernetics. The event is a specialized seminar within the "Distinguished Lecturer" program of IEEE CIS Society. we will have as guest Prof. James Bezdek, IEEE Fellow and pioneer in the field of fuzzy clustering.
ABSTRACT: Anomalies in Wireless Sensor Networks (WSNs): (i) Isolated and epoch anomalies internal to a node; aberrant behavior of an entire node; and anomalous subtrees. (ii) Models that use data capture by level sets of ellipsoids (iii) Models that use visual assessment of elliptical summaries (iv) Measures of (dis)similarity on sets of ellipsoids (v) Visual evidence for cluster tendency in sets of ellipsoids (vi) Numerical examples using single linkage clustering on real WSN data from the IBRL network, the Great Barrier Reef Ocean Observation System, and the Grand St. Bernard pass.
The meeting will be hosted at Software University in Sofia and starts at 11.00 on September 3, 2016.
You can find more information about the talk and the lecturer, as well as to register for the event on the following link here.
2) IEEE Summer School on "Systems, Man and Cybernetics"
On September 7, you can participate on our IEEE Summer School on "Systems, Man and Cybernetics" as a specialized event within the official program of the International IEEE conference on "Intelligent Systems'16", which is to be held in Sofia between 4-6 September (http://ieee-is.org).
The Summer School on SMC is an accompanying event to the conference and it is open to all: the conference participants, the local IEEE community, as well as to students and non-members as seminar for professional development.
3D surface reconstruction - Prof. Vincenzo Piuri (IT)
Linguistic Geometry: Constructing Strategies for Adversarial Games - Prof. Boris Stilman (USA)
How big is too big? (Mostly) c-Means Clustering in Big Data - Prof. James Bezdek (USA)
New Development of Bio-metrics and Forensics, AI, PR and Big Data in Interactive Learning Environment - Prof. Patrick Wang (USA)
Switched Fuzzy Systems: New Directions for Intelligent Control and Decision - Prof. Georgi Dimirovski (TR/MK)
The event is sponsored by the IEEE SMC society and organized by the local IEEE CIS/SMC chapters and the IEEE Young Professionals AG of Bulgaria. The seminar is scheduled as half day event, where distinguished lecturers and IEEE members of higher level will deliver inspirational talks on subjects related to Cybernetics, Intelligent Systems and Computational Intelligence!
The meeting will be hosted at Software University in Sofia and starts at 10.00 on September 7, 2016.
You can find more information about the talks and the lecturers, as well as to register for the event on the following link here.
We are glad to announce the final results of one of our ambitious data science projects - a study of higher education public data in Bulgaria. We realize the topic is of interest exclusively to the Bulgarian audience and since we wish to make it as accessible to them as possible, we are publishing it initially in Bulgarian. We would like to express our gratitude once again to our dedicated members Anton Nenov and Tony Getova for their excellent work on this project.
Над 3900 действащи специалности има към момента във висшите училища у нас.
Това показват резултатите от данни на МОН, събрани и обобщени от Data Science Society за периода 2005-2015 година.
На своя сайт, Mинистерството на образованието предоставя подробна справка по университети, специалности и направления, която обаче може да бъде проследена само в рамките на настоящата година и за конкретен университет.
Целта на проекта на Data Science Society беше да обедини данните в архив за времето назад, като получената база данни би могла да се анализира лесно и достъпно.
Идеята е студентите и кандидат-студентите да могат да се ориентират лесно и бързо за ситуацията в определени университети, градове или типове специалности към които са се насочили или вече учат.
Първичната обработка на данните показа няколко интересни тенденции. Независимо, че общо погледнато, броят на специалностите се увеличава, това не следва и не може да се свърже с нарастване на студентите като цяло, въпреки че отделни университети успяват да постигнат увеличение на броя на записалите ги през годините. Такъв университет например е УНСС. За разлика от него, СУ не успява да постигне увеличение на записаните студенти.
Причините за увеличението на студентите в УНСС могат да бъдат най-разнообразни - oт откриването на нови специалности, до повишаването на интереса на кандидат-студентите. Настоящите данни обаче не могат да дадат еднозначен отговор на този въпрос.
Ако разгледаме броя на всички студенти редовно обучение по градове, прави впечатление, че в София и Варна се наблюдава лек спад през последните години, докато за Пловдив те прогресивно се увеличават:
Обяснение за това може да се търси в нарастващите бизнес-възможности на града, както и в активността на самите местни висши училища.
На тази карта сме позиционирали всички университети в България, според прираста на броя студенти и специалности.
Освен университетите в Пловдив, тук се открояват също и военните и медицински висши училища, които също привличат доста студени последните няколко години.
Разбира се, това са само част от резултатите, които биха могли да се обобщят на база на тези данни. С помощта на подходящ инструментариум данните биха могли да се разгледат на още по-детайлно равнище. В момента Data Science Society е в процес на разработка на такъв инструментариум. Предстои публикуването на още резултати, като една то бъдещите цели е да се обърне специално внимание на магистърските програми, които са в основната на увеличения брой специалности за последните 10 години.
Видео от презентацията в Betahaus може да намерите долу:
One of the most important business databases in Bulgaria, the Commercial Register, has finally joined the list of open databases! The published database covers the period between January 2008 and March 2016 and can be downloaded here as a 1.6 GB zip file. Data dictionaries and more information are available here. The data is presented as XML files and contains the changes in company files at the Registry Agency for each day in the period. This opens a lot of exciting opportunities for data mining, for example using graph analysis to find clusters of related companies by common owners, representatives, address. We will keep you posted about any interesting developments.
Spring is in full bloom and so are our activities.
Come have a beer with us and let's talk data science trends and consumer demand forecasting on April 20th!
DATA Talk & Beer held by A4Everyone
When: 20/04/2016 19:00
Where: Sofia Tech Park Gallery, Incubator Building
The event, as usual, is free, but we would kindly ask you to register in advance.
Non-formal talk & drink dedicated to:
• A4Everyone – marketplace for data science
• Data science - new trends
• Consumer demand forecasting – challenges, applications, solutions
Analytics for everyone will host DATA Talk & Beer on April 20th, 2016 at Sofia Tech Park. All members of Data Science Society are welcomed as well as everyone interested in this specific scientific field.
In the non-formal environment of Sofia Tech Park Gallery, located at Incubator building we`ll present the opportunities A4Everyone is opening as a marketplace for data scientists. To make the DATA Talk & Beer not just pleasant, but useful, our chief scientist Alexander Efremov will share in-depth professional experience on creating consumer demand forecast algorithm. We are going to cover not just the challenges, but possible solutions, as well as applications.
When: April 8, 2016, 19.00 pm
Where: Betahaus - 58 Krum Popov Str, 1421 Sofia, Bulgaria
Free tickets available: via Eventbrite
In 2014 there were 290 000 university students in Bulgaria according to statistics by the Ministry of Education and Science, who took classes in 82 universities and colleges around the country. The numbers are talking - you can easily check with are the biggest universities or those who offer the widest array of programs.
Our data science experts worked with over 35,000 publicly available data records to search for answers to the more interesting questions -
Which universities are growing the fasters?
What trends can be spotted for the different professional fields?
What is the general direction higher education in Bulgaria is taking?
Come visit the Data Science Society workshop on the 8th of April at Betahaus, where we will present the data and the directions our research has taken. Let's try to find together the answer about higher education that we are all looking for.
In another milestone for our community, we are delighted to announce that we have become an official media partner of The Chief Analytics Officer Forum. As we are growing and developing, we are receiving more and more recognition from other organisations, which allows us to offer even more learning and development opportunities to our members and friends. One such opportunity is the chance to meet fellow analytics professionals from leading international companies, including the new breed of executives - chief analytics officers - managers responsible for the analysis of data.
The Chief Analytics Officer Forum Series is the premier event for Chief Analytical Officers and senior analytics professionals, providing top-level strategic advice and discussion. The CAO Forum brings the senior analytics community together to discuss the most critical data and analytics challenges faced by their organisations and the wider industry as a whole.
If you want to explore and examine the rise of the Chief Analytics Officer and analytics leadership in depth, visit the forum on 7th-9th March 2016 in London. Hear analytics thought leaders from the likes of UNILEVER, JUST GIVING, RBS, EBAY and VODAFONE discuss strategies for building an analytically driven enterprise.
One of the hottest topics in the financial markets recently has been the ability of Deutsche Bank to meet the payments on its CoCo bonds, as evident by the coverage of the topic by Bloomberg. CoCo stands for contingent convertibles, a mixture between bonds and stocks. The contingent part means that the issuer may stop temporarily interest payments when it runs into trouble, or even deny them altogether by turning them into stocks (hence the convertibles part). Of course all this risk bear higher return, but to learn how high it is according to a Bulgarian PhD student and risk practitioner, join this seminar at the Bulgarian Academy of Science!
When: February 12, 2016, 18.00
Where: Institute of Mathematics and Informatics, Bulgarian Academy of Sciences, room 403
Who: Krasimir Milanov (Finanalytica, PhD student at IMI)
Topic: "A complete CoCo bond pricing methodology"
Here is a short version of the abstract from SSRN:
The aim of the present research is to provide a new CoCo bond pricing method to assist analyses of both equity investors and fixed income investors. For this reason, we develop models in terms of PDEs where the spatial variable is the underlying stock. By using these approaches, one will be able to calculate delta, gamma, and any kind of duration and convexity for CoCo bonds including the callability feature. Two groups of approaches are developed. The first group is based on the primary market assumptions of Black-Scholes, and the second one involves credit risk modeling by means of jump to default stock dynamics.
Institute of Information and Communication Technologies (IICT) at BAS together with Data Science Society are pleased to invite you to the Open Days project AComIn (Advanced Computing for Innovation), funded under the Seventh Framework Programme EU Commission in 2012-2016. The event is a final project and will held on 15-16 January 2016.
Purpose of AComIn is to strengthen the scientific and innovative potential of (IICT) both through increased knowledge, skills and international contacts of its scientists from various scientific fields and by purchasing modern equipment. Scientific achievements project will be displayed through posters, including the results of foreign postdocs in AComIn, visiting professors and young scientists from IICT institute and also patent proposals. Innovative prototypes will be demonstrated with use of appliances in "Smart Lab": industrial tomography 3D scanner, 3D printer, 3D video wall, thermal camera, fast camera acoustic chamber, laser granulomer and recording studio.
Developments contribute to the grow of areas such as new materials, improve the quality of life, three-dimensional technology and digitization prototyping, efficient transport, non-destructive testing, Bulgarian language technologies and others.
Open Days project AComIn will be held on Friday and Saturday 15 to 16 January 2016 in the two buildings of the Institute of Information and Communication Technologies BAS city. Sofia, "Acad. Georgi Bonchev" Block 2 and Block 25A. Start: 10:00 am.
We will be glad to see you at the event.
We, the members of Data Science Society, are looking for you - a skilled expert in the domains of statistics, data mining or predictive analysis. Join our efforts to accelerate innovation and promote open data! Become a speaker at one of our talks, a trainer in a course or even a lector in the master’s degree program we are in preparation of.
Data Science Society reaches over 600 industry experts, learners and enthusiasts and has organised a number of events with the best in the data science field.
Let us get to know you: answer a few questions in our special survey! Our team will review your application and will get in contact with you.
Still new to the field of data science, but equally passionate about the magic you can work with data?
Become a part of our team and help us set the foundations of data science collaboration between the business, educational and science organisations. We have a lot going on - from our regular meetups to a brand new online space to bring together the best global experts.
ON 7th of December at 14:00h at BAS IMI hall 403 Institute of Mathematics and Informatics,
Astroinformatics is a new field which connects astronomy, information technology and computer science. Astronomy is currently in transition from data scarce to data intensive science. New trans disciplinary COST action TD1403 Big data in Sky and Earth Observations started in January 2015. Our goal is to bring together astronomers, geophysicists and computer scientists and exchange experience and collaborate especially in data mining, database structure, visualization and education and outreach. Two projects are on the forefront of Big Data in astronomy - Square Kilometer Array(SKA) and Large Synoptic Survey Telescope(LSST). Using LSST as an example, The presenter will review several aspects from collecting data, moving them around, reducing and enabling them to be used in science. One problem what they face is that currently available codes are not scaling well when applied to large datasets. He will make few suggestions where additional effort is needed.
Friends of the Data Science Society had a couple of occasions to celebrate this week. First, after a longer than usual pause we are back in the event organizing business. Second, the topic our speaker Ekaterina presented stands at the fascinating intersection of machine learning, mobile application development and healthy lifestyle. As a bonus, our audience had the opportunity to explore a new venue - the cosy "Tell me bar" close to the National Theatre and the "Bulgarian Broadway" - Rakovski street.
Our speaker, Ekaterina, is the founder of Sugarwise, a young social business aimed at improving people's health. Sugarwise is the reason for Ekaterina to go into Image Processing. She has worked on practical machine translation and is a huge machine learning enthusiast. She is a co-organizer of the Lisbon Open Data Meetup, and in her spare time loves to pass her practical knowledge forward and learn. She will gladly chat with anyone who is interested in tech, the future of tech and the business, and explore together how to find solutions to current social problems.
The challenge Ekaterina embarked to tackle that evening was reading the nutrition information from a yoghurt cup. Our smart audience immediately suggested a clear-cut decision - to read it from the bar code. As Ekaterina pointed out, this would have only worked for the USA, which doesn't fit the global scale on which she desires to solve the problem. An originalAnother common solution is to use an OCR (Optical Character Recognition) engine such as Tesseract. Unfortunately, OCR works well with black characters on white background, which is not the case with food labels.
Next, Ekaterina led us on a journey how to read the nutrition information. The first step is to convert the colour image to a grayscale one because of the desirable properties of grayscale images (dimensionality reduction). Then, the cardinal problem that needs solving is how to represent the image as black letters on white background. In order to do that, Ekaterina tried to cluster the pixel colours into a cluster of black colours and a cluster of white colours. The method she tried is k-mean clustering. The output of this method is an image where the background is converted into white, while the text is turned into black. Here Ekaterina shared what she learned the hard way - it's better to save an image as .png, because saving an image as .jpg introduces noise to it, since JPG is a lossy compression format which keeps pixel values close to their original value, a change undetected by the human eye, but not missed by computers.
Unfortunately, the output of this step still confuses the Tesseract engine because of the lines in the nutrition table (usually all nutrient information is provided in a table with lines separating each column, row and cell). The challenge here lied within the fact that when Tesseract sees a vertical line it assumes it has reached a line end, and starts reading the next line skipping all textual content after the first detected vertical line. Ekaterina then presented several ways to delete black lines. The first is to remove black regions with more than 400 pixels, relying on the fact that letters usually have fewer pixels. The results are not encouraging though. Another solution she tried, detecting uninterrupted black regions, would cut some letters in half and is overall ineffective. Only after exhausting these options did the smart decision emerge. Ekaterina pointed out that we need to remove the vertical lines in order to enhance Tesseract's reading capabilities, not all the lines. To do that, she processed the image a little bit more to separate it into regions of pixel colours.
Imagine there is a binary image (black and white) with the letters “A” and “B” on it. Separating these regions - background, letter “A” and letter “B”, would mean to paint all pixels of the background with a pixel colour, for simplicity 0 is used, nevertheless in pixel colours 0 is black, all pixels of “A” with pixel colour 1, and all pixels of “B” with pixel colour 2. Note that in the original binary image all background pixels are set to 255, or white, and all letter pixels, of “A” and “B”, are set to 0, or black. Back to the problem - Ekaterina wants to identify all vertical lines of the segmented image. The solution is to separate the image into smaller parts, called windows, and identify the partial vertical lines in each window. She does this by searching for pixels of the same region, 1 or 2 in the case of the example above, that lie on the top and bottom frame of each window. Having found these pixels, she draws an imaginary line between the top and bottom ones and sees if most of the pixels lying under this imaginary line are of the same region, 1 or 2. If yes, then the algorithm has idetified a partial vertical line that needs to be deleted from the image. An interesting alternative worth exploring is Fourier analysis, which should be a familiar topic of the people in signal processing.
After the traditional Q&A session, the talk naturally continued in the Tell me bar. Take a look at the video from the presentation if you want to learn more. Don't miss our great upcoming events and projects - stay tuned by visiting our website, following our Facebook page, LinkedIn page or following our twitter account.
Summary: A meetup for experts and non-experts in the fields of Self-Organizing Data Mining, Model Based Business Games, Predictive Analytics and Business Analytics. It is a chance for everyone to get to know Multi-stage selection procedures and see how are they superior to both evolutionary algorithms, genetic algorithms and neural networks. The lecture will finish with a special demonstration of modelling software Insights (previously known as KnowledgeMiner).
About the presenter: Professsor Mihail Motsev is a professor in Quantity Methods and Information Systems at the Business Faculty, WALLA WALLA UNIVERSITY, College Place, WА, USA. He has authored and co-authored over 90 publications and 5 books in the area. Since 2004 professor Motsev has been leading small scientific projects in the areas of Self-Organizing Data Mining, Model Based Business Games and Predictive Analytics. One of his main topics of research is Management games with imitation models based on the principles of self-organization. In 2013 he was nominated at the annual general assembly meeting of the ISAGA and elected as a member of the ISAGA International Advisory Council.
Time: 23rd November 2015, 19:00
Venue: University of National and World Economy (UNWE/УНСС), Hall P035
If you are a developer, designer, media representative, civic activist or an NGO and are interested into making the public data work for better government and more participatory society, our friends from NGO Links would like to invite you to a civic hacking contest to develop data-based visualizations, applications, and other products. The goal is to make use of the datasets released on the Open Data Portal of Bulgaria in order to spur the creation of tools with significant social impact. Work will be done in teams comprised of people with different areas of expertise – developers, NGOs, journalists. The winner team will receive a prize of €2000.
Creating software solutions could be fun and easy like playing with Legos. Anybody can do it if they are given clear instructions and guided by a good mentor.
In order to encourage team work and innovation two preparatory workshops will be organized on 17th and 19th of November during which the teams led by their mentors will be able to generate ideas and select the idea to work on. You are free to choose the date that is most convenient to you.
We are looking forward to seeing you at the event and don't forget to bring your laptop.
The workshops are free-of-charge and are part of the projects “Civic Platform for Open Government” funded by the NGO Programme in Bulgaria Under The European Economic Area Financial Mechanism 2009 -2014.
You can find more information about the events here.
Members of Data Science Society are invited to the International Science Conference “Big Data, Knowledge and Control Systems Engineering” (BdKCSE’2015), which aims to provide an open forum for the dissemination of the current research progress, innovative approaches and original research results on all aspects of Big Data Management, Technologies, and Applications.
Organizer of the BdKCSE’2015 Conference is the Institute of Information and Communication Technologies of the Bulgarian Academy of Sciences, co-organizer is the “John Atanasoff” Union of Automatics and Informatics, Bulgaria and partnership with Data Science Society.
The main aim of the BdKCSE’2015 Conference is to bring researchers, students, professionals and others interested in the topic to present their works, to help the research community identify the important contributions and opportunities for research on the innovative methodologies and techniques in the Big Data field.
More about conference and the topics can be found at http://conference.ott-iict.bas.bg/
You can find a preliminary agenda for the event at http://conference.ott-iict.bas.bg/scientific-programme/
We are pleased to invite you to the Conference BdKCSE’2015 that will be held in Sofia, Bulgaria on 5-6 November 2015.
On Thursday, September 10th 2015 in the Mirror Hall of Sofia University, Professor Sabine Bergler gave a talk at the invitation of Mozaika, The Humanizing Technologies Lab, in front of select audience, including distinguished members of the Data Science Society. Doctor Sabine Bergler is a Full Professor at the Department of Computer Science at Concordia University, Montreal, Canada. She holds a Ph.D from Brandeis University, Boston, USA, on reported speech and has degrees from the University of Massachusetts at Amherst and the University of Stuttgart. She founded the CLAC Laboratory in 2002 at Concordia, where she conducts research on computational linguistics. Among the achievements of CLAC Lab we find groundbreaking work on sentiment analysis and embedding predicates as unified theoretical foundation in semantics and computational aspects of bioinformatics carried out under the direction of Dr Bergler. Her students consistently win competitions on speculative language in bioNLP at BioNLP, on negation focus and modality at worldwide shared task challenges such as *Sem and at QA4MRE and on sentiment analysis at SemEval.
Dr Bergler presented a forward looking perspective on extraction of knowledge from texts, outlining a series of use cases and techniques that allow to mine more than just factual information from texts, and emphasizing the necessity and impact of employing standard linguistic knowledge to achieve this. Her examples demonstrated that texts convey non-factual information such as authors’ points of view, that negation can be expressed in explicit and implicit manner in texts, that speculative and figurative languages require a combination of features to be accounted for and their values interpreted in an inverse way. For example, the word “positive” which has a positive connotation as a lexical unit, is obviously negative, when uttered in a context such as “positive results for brest cancer”.
She outlined the basics of the unified account adopted by her group of employing linguistic knowledge in shallow processing pipelines and the effects of applying this method to the different use cases. One convincing effect of these linguistic knowledge based approaches she reported about was that her team’s algorithms perform as top number 1 (the highest rank) on pilot worldwide shared challenges in different language processing areas, in which no prior task specific collection of text corpora to train and test with machine learning methods has been available, and the machine learning based algorithms fail in such shared task challenges.
The overall talk showed that linguistic principles form a solid baseline for modular, adaptable NLP modules, and that trigger-linguistic scope approach to speculative language, negation, and modality proved to be effective. Regardless, that it renders language processing tasks less scalable, relying on syntactic parsing is feasible, even for tweets, with appropriate preprocessing steps. Finally, extra-propositional parts of text prove effective in task-oriented evaluation.
Full of interesting examples from the bioinformatics domain and from tweets and featuring the work of several CLAC students and collaborators, conveying highly expert content in a very accessible form for a non expert audience, that did not lose concentration after close to two hours, the talk was followed by a vivid discussion during the social networking event at the Krivoto over a glass of wine and a pint of bear. Many had a lot of questions.
The slides of Dr Bergler’s presentation can be used for more details and information.
Author: Mariana Damova, PhD
The day 20/10/2010 was the first to be officially announced by the UN and UNESCO as the world day of statistics. It was then decided that the role of statistics will be celebrated worldwide quinquennially.
For the forthcoming 20/10/2015 the department of probability and statistics at the Institute of Mathematics and Informatics (IMI) at the Bulgarian Academy of Sciences is organizing a seminar and a small convivial at the premises of IMI. The former will be held in room 403 at IMI from 14:00 and the speaker will be Mladen Savov (IMI) who will be presenting the topic “Spectral theory of generalized Laguerre semigroups.” The content of the presentation, albeit introductory, will touch upon several modern areas of mathematics and will contain the current research of Pierre Patie (Cornell University) and Mladen Savov (IMI). It is therefore most suitable for mathematical audience but all people with interests in science might find it interesting to discover how the process of research actually happens and what topics and subjects are under scrutiny in contemporary probability theory. The convivial, with no clear programme and organization yet, as befits mathematicians, will take place at the ground floor in the canteen of IMI from 15:30. It will be a good opportunity for the interested parties to get in touch with some of the leading experts in statistics in Bulgaria and thus expand their network.
Everybody is warmly welcome to any of the events with the prior warning that there is a very tiny chance that some of them could run on first come first serve basis.
This Wednesday, Data Science Society went back to its roots and held a meetup in the Technical University. This was, however, the only similarity to our previous meetups. The venue was the Experian modelling laboratory at the TU, there was much more interaction between the audience and the speaker, and the topic was an introduction to an area not covered before in our events.
Our speaker was Stanimir Kabaivanov, the missing link between academia and business. While studying Macroeconomics at the Plovdiv University, Stanimir started working at EUROS Bulgaria as a software engineer developing a real-time operating system. Afterwards, he followed with a MSc in Financial Management and a PhD in Finance from his alma mater, all this while simultaneously working as a developer at EUROS, where he has been the Head of software development for 7 years. Another remarkable fact about Stanimir is that he is successfully specializing in several very different fields, like a Renaissance scholar - he is an Assistant Professor in Finance at his home Plovdiv University, an expert in financial econometrics and a software developer! However, for his talk he had another topic up his sleeve - Internet of Things.
Stanimir started his presentation by introducing the terms in the headline - Big Data is used to describe datasets so large and complex that traditional data applications are not adequate for handling them, while Internet of Things (IoT) is the network of physical objects embedded with electronics, software, sensors which enables them to collect and exchange data. Afterwards, Stanimir focused on the major problems in gathering data from these devices. The first of them that he covered is the interconnectivity - every device has its own communication protocol - for example LIN, FlexRay and DeviceNET in cars. Unifying these standards is not easy due to the difference in prices of microcontrollers that support each protocol. Stanimir gave a striking example with a chip for coffee machines - a single unit might cost less than a dollar, but the monthly production of these amounts to 35 million units! And a modern car might need 150 microcontrollers.
Another problem such devices face is the increasing complexity of their software - it grows from 10% to 50% every year. Programming code for this hardware cannot be easily optimized and often contains dead code that is never used, simply because the manufacturers demand describing in code all the possible situations it might face, even the impossible ones! Surprisingly, complexity is not hardware demanding - Stan gave an example with a controller for the water level in rivers and dams that has a 48 MHz CPU and is 20 to 50 times less powerful than a modern smartphone.
Next, Stanimir focused on the security issues for IoT devices. He summed up the main challenge in this regard as "appropriate data should reach the appropriate people." For example, encrypting data is the best security solution, but data transfer takes a lot more time - the problem breaks down to "frequent data updates vs. computational power limitation".
Another issue that Stanimir discussed is the power consumption of IoT devices. They consume less power when in static regime, and more when they transmit data in dynamic regime.
Finally, Stanimir presented a possible solution to the heterogeneity - the OPC UA protocol. He described the idea behind it as standardizing - every device can be represented in a general scheme. Stan illustrated the approach with an analogy from NLP - representing sentences as a semantic network.
As a desert, our speaker made a live demonstration how to test if the software similar to the one used in modern cars works. It served as an illustration of yet another unsolved problem - what to do with all the data collected. If the data is transferred in real time and stored remotely, this puts a load on the transmitting network. If it is stored locally on a SSD or flash drive, this makes the device more expensive.
As a testament to the involvement of the audience of the talk, the presentation lasted for 30 minutes, while the Q&A session and the discussions after - an hour. Everybody was so engulfed in the topic that if it wasn't for the suggestion of Prof. Marchev Jr. to continue over a beer at the Milenkata restaurant, it could well drag on until midnight.
Take a look at the video from the presentation if you want to learn more. Don't miss our great upcoming events and projects - stay tuned by visiting our website, following our Facebook page, LinkedIn page or following our twitter account.
We have great news! Two of our founding and most active members, Maria Mateva and Alexander Efremov, will be lecturers at the upcoming HackConf 2015 held on 19-20 September at the National Palace of Culture in Sofia! The event is completely free and the goal of the lecturers, among which are the co-founders of Telerik, is to motivate all the people who want to develop professionally in the IT field.
Alexander, who is an Associate Professor at the Technical University Sofia and a Senior Business Analyst at Experian, will speak about the importance of mathematics and data science in our world. Watch his introduction on Youtube!
Maria, who is a Software Engineer at Experian and a former teaching assistant at the Faculty of Mathematics and Informatics in Sofia University, will follow up by focusing on the role and applicability of algorithms.
The potential audience of the event is beginner programmers, IT professionals, students, and enthusiasts from anywhere in Bulgaria.
HackConf has a very clear goal - to be a conference for motivation and direction for all who want to develop in the IT sector in Bulgaria - http://blog.hackbulgaria.com/hackconf-2015/
Motivation is key and the organizers believe it is mostly missing in the IT industry.
Do people that study computer science and want to become programmers know why? Do they have the energy to go all the way from "a PHP web developer" to "a talent headhunted by Google and Facebook"?
According to the people behind the conference, the education in Bulgaria cannot answer these questions and their primary aim in organizing it is motivation and direction.
The topics at the conference will cover:
Programming and software development
Internet of Things & Hardware
Mathematics and Algorithms
Learning how to learn
Marketing and how to sell things that we use?
Open Source Technologies
The event is free but requires prior registration at http://hackconf.bg. Updates are published also on the Facebook page of the event.
When and where can you get the best price for your travel plans? And, why are there different flight prices? Why is creating a meta-search engine for flights one of the hardest problems? In the peak of the travelling summer frenzy, we tried to find an answer to these questions together with two software engineers from the Sofia office of Skyscanner. The office of the search site that serves over 40 million unique visitors every month opened in October 2014 and is growing quickly. Our speakers were their Principal Software Engineer Plamen Aleksandrov and Konstantin Halachev, PhD.
Konstantin Halachev graduated the Sofia University, before moving to the Max Planck Institute for Informatics in Germany to do a PhD in bioinformatics. He then briefly worked on personalization in e-commerce, before accepting the challenge to play with Skyscanner’s data in their newest office in Sofia.
Plamen Aleksandrov also graduated from Sofia University, and specialised in Meta-search Heuristics for Discrete Optimization in JKU Linz, Austria. He later worked on a flight search engine and discovered the nitty-gritty specifics and complexities of the Airline Distribution Industry, before joining Skyscanner in Sofia.
In the room packed with audience in Betahaus, including a group of guest students from the University of Warwick, Konstantin and Plamen first focused on what makes flight search complex. A typical flight meta-search engine such as Skyscanner sends the user query to the websites of airlines and online travel agencies (OTAs). The results are aggregated into a single list and ranked based on the preferences of the user, e.g. the price. The challenge of the problem comes from its sheer dimensionality. For example, there are 10,000 possible ways to fly from San Francisco to Boston. If we constrain our example and search only for an American Airlines round trip flight changing in Chicago and Dallas, there are 25.4 million valid ticket prices, out of billions of combinations. And this is just for a particular airline and route, out of a much larger search space - there are 100,000 flights per day, and 15 million queries per second.
Our speakers revealed that one of the reasons why there are so many options is called variable pricing - the prices change according to demand and seat availability and the airlines offer a portfolio of prices for the same flight. For example, cheap fares may require round trip travel, prohibit non-stop flights and ticket refunds. That's why it's almost certain that your flight neighbour paid a different price.
After this introduction, Konstantin and Plamen delved into details about their search engine, Skyscanner. It gets 120 million visits per month and serves 13 million searches per day, and as you can imagine, these numbers result in some really big data - 200 GB zipped data per day (80 TB per year). What makes Skyscanner unique among the popular search engines are two features. First, the destination is flexible - you can get a list of possible destinations from a certain airport. Not only is the destination flexible, but also the departure and return dates.
After that, our speakers presented a few intriguing applications one can do with the data gathered. First, you can see how the price on a certain direct one-way route (say London to Madrid) changes over time - it tends to be more expensive the closer the flight date gets. You can compare that dynamics across airlines, and across days of the week (a flight on Wednesday is cheaper than a flight on Friday). One can also factor in the month of the travel, or combine any of these factors to research the price dynamics further.
Second, airlines may use the data to track their sales, compare them to the competitors and get an idea what routes are searched for the most. And if you play with the data, you can also find the hottest destinations from each airport. For example, the most popular destination from Munich is Bangkok in the winter, July and August. In the spring, June and the autumn, however, London becomes more attractive for Munich travellers. Another application of the data is to create a "deal navigator" that based on a range of dates, a maximum price preference and the planned length of stay may suggest the best destinations for you.
Finally, Konstantin and Plamen demonstrated how the demand for trips to Greece was influenced by yet another debt crisis. In short, it plunged throughout Europe, except in Denmark. Curiously, in 2014 Danish were not so interested to travel to Greece, unlike British, Italians, Austrians, Latvians and Bulgarians.
Take a look at the presentation if you want to learn more.
Ten lucky guests from the audience won a portable battery chargers for mobile phones from Skyscanner. The lecture was followed by networking over a bottle of beer generously provided by Skyscanner as well. Don't miss our great upcoming events and projects - stay tuned by visiting our website, following our Facebook page, LinkedIn page or following our twitter account.
Last Thursday all the friends of Data Science society had the great pleasure to listen to a talk by Preslav Nakov, hosted for the first time at Vivacom Art Hall. Preslav Nakov is a Senior Scientist at the Qatar Computing Research Institute (QCRI). He received his Ph.D. in Computer Science from the University of California at Berkeley in 2007. Before joining QCRI, Preslav was a Research Fellow at the National University of Singapore. He has also spent a few months at the Bulgarian Academy of Sciences and the Sofia University, where he was an honorary lecturer. Preslav's research interests include lexical semantics, machine translation, Web as a corpus, and biomedical text processing ,and he covered the first three topics in his presentation.
Preslav introduced the field of computational linguistics (or NLP) by sharing what is the big dream of the people behind it - to make computers capable of understanding human language, just like the HAL 9000 computer did in the movie 2001: A Space Odyssey. After going over the different services that use NLP, such as Google Translate and Siri, and the history of NLP itself starting from the 50s, he emphasized the role of the big datasets in training algorithms (the so-called statistical revolution in the 90s). It turns out that getting more data improves accuracy of algorithms a lot more than fine-tuning them.
The largest available database is the Web itself, providing access to quadrillions of words. Preslav pointed out that research based on the Web has been restricted so far to using page hit counts as an estimate for n-gram word frequencies and this has led some researchers to conclude that the Web should be only used as a baseline (in the words of the late Adam Kilgarriff, "Googleology is bad science"). Preslav is an advocate for "smart" usage of the Web that goes beyond the n-gram, which he demonstrated by focusing on the syntax and semantics of English noun compounds. The noun compounds are sequences of nouns that function as a single noun, such as "healthcare reform". He demonstrated several ways for understanding their internal syntactic structure (e.g., “[plastic water] bottle” vs. “plastic [water bottle]”).
Next, Preslav revealed a simple unsupervised method for determining the semantic relationship between nouns in noun compounds - by using multiple paraphrasing verbs. For example, "malaria mosquito" is a "mosquito that carries/spreads/causes/transmits/brings/infects with/... malaria". The verbs are extracted from Google results. Our lecturer showed various applications of noun compounds, the most important being to machine translation. Machine translation nowadays works with phrases, and is really augmented by the paraphrasing techniques.
Preslav discussed another related topic - different sources of data such as Google N-grams, Google Book N-grams and Microsoft Web N-grams Services. Even though they have their shortcomings, they are superior to simply using a search engine. Yet, text is not the only useful database out there - images can also help. If searching for a word in English and Spanish returns similar images, then the word should have similar meaning in both languages.
Afterwards, our presenter shared his thoughts about the future. According to him, the next revolution in NLP should come from semantics and there are already signs for this - the number of scientific papers in the field has recently exploded. Using syntax and semantics may be a big step over the current phrase-based machine translation. Another big improvement might come from implementing deep neural networks.
Finally, Preslav involved the audience by challenging them to recognize if a short text is written by a computer or by a human. Our audience got the author of 6 out of 8 texts right, beating even Preslav's score of 4-4. You can take the test here. The idea behind this demonstration is that computers are already capable of writing text themselves that might be indistinguishable from text written by people. Preslav also touched a topic that has recently become quite popular again - AI as a threat to humanity, by citing opinions of Stephen Hawking, Bill Gates and Elon Musk on the subject.
The lecture was followed by networking drinks next to the National Theater. Don't miss our great upcoming events and projects - stay tuned by visiting our website, following our Facebook page, LinkedIn page or following our Twitter account.
Author: Vladimir Labov
Yesterday, 09.07.2015, we gathered our first volunteers in the cozy atmosphere of Murphy's pub. Over 20 enthusiasts discussed how to help build our community further by being involved in the following activities:
- Higher Education data project - project manager Anton Nenov
- Website update project - project manager Sergi Sergiev
- Data visualisation tools project - project manager Kaloyan Haralampiev and presented by Angel Marchev Jr.
- Education in Data Science (Master's program and DSLabs) - project manager Alexander Efremov
- Operational aspects - events, promotion and content - presented by Iliyan Oprev, with project managers Iliyan, Sergi and Vladimir Labov
After a funny ice-breaking game orchestrated by Angel Marchev, Sergi outlined the plan for the meeting and each project manager said a few words about their initiative. Then we split in groups to discuss each project in detail - to agree on a detailed plan, tasks, collaboration tools and way of work, next steps and resources. The beer started flowing and its brainteasing virtues helped us to come up with great and original ideas :)
We agreed to meet again on 16 July at Eleven roof from 19:00, where each team will present their progress so far. If you intended to join us but could not, or if you are contemplating to participate, don't hesitate and come to our next volunteer meeting!
We invite you to attend the conference "Open Data and Intelligent Government", organized by the Sofia Tech Park and held on 14 July from 09:30 to 19:00 at the Sheraton Hotel in Sofia. The event is free and open to everyone.
The conference aims to start the discussion on the opening of data and application of open data to the intelligent government. Examples from other European countries that have opened their data will be presented. The topic will also be discussed from a scientific point of view.
Highlights of the event:
The eighth installment of the conference "Vanguard scientific instruments in management" will be held from 09. to 13 September 2015 г. in the UNWE Study and recreational center in Ravda, Black Sea Coast, Bulgaria. The conference is technically co-sponsored by the Data Science Society. With this announcement we would like to extend our warmest invitations to participate. You could read the whole text of our invitation on the website of the conference at: http://vsim-conf.info
To apply please use the form at: http://bit.ly/vsim-conf.
Last Tuesday we at Data Science society were delighted to organize the 3rd Sofia Open Data and Linked Data meetup. In line with our tradition to try new venues, the event was held at the Telerik Academy thanks to our hosts from Telerik, a Progress Company. The other sponsors for our event were Ontotext and the DaPaaS research project, funded by the EC. Once again we reached several milestones:
- this was our first presentation in English and we will strive presenting in English from this time on;
- giving our guests the opportunity to ask questions online from their smartphones, tablets ot laptops and voting for the questions, all in real time, by employing http://sli.do;
- we had two speakers coming from two different organisations but sharing a single passion - to organize the wealth of data from Wikipedia into a searchable database.
Our first speaker was Dimitris Kontokostas - one of the leading researchers in the area of Linked Data and knowledge graphs, CTO and member of the executive team at DBpedia Association and part of the Agile Knowledge Engineering and Semantic Web (AKSW) Research Group in Germany. He is currently finishing his PhD at the University of Leipzig.
Dimitris introduced the ideas behind DBpedia - to transform the unstructured knowledge in Wikipedia articles into a structured database. Information in Wikipedia consists of text, images and links, but every article can be synthesized into an infobox. Unfortunately, infoboxes do not have a common format and one of the challenges for the DBpedia project that started in 2006 is to extract the information from this heterogeneous source and map it to a knowledge database.The database is organised as a RDF graph and Linked Data, and queryable via the RDF query language - SPARQL.
You can find more information in Dimitris' presentation here.
Our next speaker was Vladimir Alexiev, PhD, PMP - leading expert at Ontotext in the area оf ontology engineering and Linked Open Data with over 15 years as head of R&D teams developing cutting edge software technologies. Vladimir is currently leading projects related to the use of Semantic Technology in the cultural heritage and digital libraries domain, with organisations such as the British Museum, Europeana and Getty.
Vladimir made the audience aware of the unique challenges that Wikipedia data poses. For example, a wrong decimal mark put unknown villages in Bulgaria at the top of the ranking of the largest residential places in terms of surface. Some important information is not available in the templates but in the main text of the articles. Information about musicians was also hard to categorize - most musicians were tagged as bands. The Bulgarian DBpedia team succeeded to improve the templates in order to extract the correct information. Vladimir also gave us a practical tour in mapping data from an unstructured article into the database of bg.dbpedia.org.
The lecture was followed by networking drinks and snacks generously supplied by our friends at Ontotext. Don't miss our great upcoming events and projects - stay tuned by visiting our website, following our Facebook page, LinkedIn page or following our twitter account.
Author: Vladimir Labov
During the last 7 months Data Science Society progressed a lot. We organized more than 10 meetups with average attendance of 80 people, we’ve got over 160 registered users on our website and more than 400 different people attended our events. During the meetups our experts covered a large variety of topics - from Computer Vision to Semantics and NLP, Analytics, Machine Learning and others.
We received tons of positive and constructive feedback and we already have a more clear roadmap for the months to come. It is time for Data Science Society to make the next step and expand into new activities besides meetups in order to promote closer collaboration between experts. We’ve got an action plan to organize a number of hackathons and workshops, to participate in conferences and to set up project-based teams for shorter and longer term data science initiatives.
Currently there is a diverse team engaged in the DSS activities with senior expertise in data analysis, development, research, marketing and design. Now we want to involve a growing number of people in upcoming projects so we can stimulate and facilitate better collaboration, knowledge sharing and learning within the data science sector.
To make this happen we are raising a hand for followers, supporters and volunteers - everybody who is willing to share and learn is welcome in our ranks! We are looking for both technical guys and people who can coordinate and support projects. To become part of Data Science Society you just need to fill in our supporter survey and tell us how you can help!
Our event marked several "firsts" - for the first time on Monday, for the first time at a new venue and for the first time with a speaker that has worked for Google and Facebook. We were pleased to welcome our guests in Betahaus in Sofia as part of their skill sharing initiative. And it was an even greater pleasure to have Ivan Vergiliev as our speaker. Ivan has a stunning track record at some of the hottest tech ventures around the globe. After earning a bronze medal at the International Olympiad in Informatics, he studied Computer Science in Sofia University and secured two internships at Google and one at Facebook. His professional experience includes Musala Soft, Chaos Group and more recently SoundCloud, where he created recommendation engines. Lately Ivan is working with LeanPlum on mobile A/B testing.
Ivan shared some of the recent developments in applying deep learning to NLP problems. While neural networks have been around for a long time, recent innovations such as employing GPUs for faster computation and new ideas like convolutional networks have further advanced the field. The idea behind convolutional networks is to connect neurons only locally, as opposed to forming connections with each and every neuron from the neighbouring layers which is ineffective when training the network.
Ivan demonstrated how the meaning of words can be represented as vectors where every word is a point in a multi-dimensional space. This idea is applied when building the so-called "skip-grams" in which every word is a node in the input layer and the weights of the hidden layer correspond to the vector coordinates. Ivan showed how the distance between "man" and "woman" is very close to the distance between "king" and "queen". Even more interesting was the query engine that you can find on his blog. Playing with it can produce sometimes amusing and sometimes insightful results - start with entering "мъж". To see how 'adding and subtracting words' can be turned into a phrase that actually makes sense, start with a relationship like "жена" - "мъж" = "кралица" - "крал". Then if you move the word "крал" to the left-hand side, and enter the expression "жена" - "мъж" + "крал", then the result will be close to "кралица".
Formally, several neural language models exist. Ivan gave an example with the Feedforward neural network-based language model and with the recurrent neural network that is particularly good at capturing the global context. Finally, he delved into new developments such as the paragraph vectors and learning directly from raw data instead of using words as building blocks. The last concept relies on convolutional networks for representing the position of the letters.
A very interesting idea that Ivan presented is putting two languages in the same vector space in order to see the similarities in them. It turns out words with similar meanings have similar coordinates and stack closely when represented in a common space.
Ivan also revealed how deep learning can be utilized not only for text recognition but for image recognition and tagging as well. Go to this page to witness how an algorithm generates sentences describing the contents of pictures - some sentences are remarkably successful, some are on the funny side.
Finally, Ivan honestly discussed how neural networks can be broken - for example in image recognition, adding a little white noise to a picture completely changes the guess - from a panda to a gibbon.
Take a look at the presentation link below and the full video record from the event if you want to learn more.
The lecture was followed by the traditional networking drinks, this time conveniently in the cozy Betahaus bar. Don't miss our great upcoming events and projects - stay tuned by visiting our website, following our Facebook page, LinkedIn page or following our twitter account.
Author: Vladimir Labov
A series of new releases of open data in Bulgaria is starting, which are updated into a portal: https://opendata.government.bg/.
At that moment there are 33 sources from 20 institutions and almost every day there are new uploads. Among some of them you can find lists of tour-operators, tourists agents, entertainment and food places, data for financed European projects and many others.
This sources of data can be used from IT specialist, businesses and NGO for making analysis, infographics, visualization and mobile applications.
That great imitative provides opportunity for better public awareness and control, also stimulates new and different businesses to be set up.
Good luck of that project and stay tuned :)
There is news that Bulgarian academy of science at Institute for Information and Communication Technologies is opening its doors on 17 and 18th of April presenting project AKOMIN at IICT buildings (str. Academic G. Bonchev, block 2 and 25 A).
Different topics are going to be covered starting from Nano electronic simulation, 3D printing up to cultural heritage objects.
For more details you can refer to the attached program.
Last Wednesday, Data Science Society and its friends delved deeper into the field of information retrieval. The topic this time was how to extract information from music, while the Faculty of Mathematics and Informatics at the Sofia University continued the tradition of hosting our machine-learning oriented events. The venue bears significance to our speaker as well - Petko Nikolov graduated Informatics in the same faculty. His interest in machine learning developed during his MSc studies in AI at the University of Edinburgh. He had a chance to apply what he learned in practice at SoundCloud and later at Leanplum and HyperScience.
Petko shared with us his experience with music information retrieval (MIR) acquired at SoundCloud. Similar algorithms are employed at Spotify, Pandora, Shazam and SoundHound. The most basic form of music retrieval extracts musical notes from the audio signal. The digital audio signal itself is a sequence of numbers sampled from electrical voltage representing sound waves. Petko explained that the first step in MIR is to perform segmentation of the signal at even intervals (frames), typically lasting 52 ms each. Between each pair of non-overlapping frames, an overlapping frame is inserted covering half of the previous and half of the subsequent frame. Once this information is obtained, it is converted from information over time to information about frequencies. This is achieved by a fancy numerical algorithm called Discrete Fourier Transform (DFT). When the data is in this convenient form, one would like to capture musical characteristics as timbre, tempo, rhythm and acoustics on local level. A way to approximate them is by extracting frame’s statistical properties - for example the center of mass of the spectrum, the slope coefficient of a linear regression, the spectral correlation between the frequencies of two consecutive frames which is useful for distinguishing between slow music like classical music, and fast music like rock music. These are the features (variables) that are the cornerstone for the models applied in music information retrieval.
From this point on, you can apply most machine learning algorithms for classification - for example neural networks, support vector machines or random forest which is applied in SoundCloud. Petko introduced a not so popular concept - Gaussian Mixture Model (GMM). A mixture model represents the presence of subpopulations within an overall population - in our context, styles of music. The algorithm is used to build representation of a track by maximizing the likelihood of its frames being generated from the model’s distribution. Each sub-population has its own probability density, as you can guess from the name in the GMM the densities are modelled as Gaussian. In the MIR field the model is also popular as Universal Background Model.
Petko introduced another cutting-edge methodology - Deep Learning. It builds on the neural networks but has many more hidden layers and the input is as raw as possible. The standard neural network approach using backpropagation doesn't work so well in such architectures because the gradient fades quickly. The first step is to derive the so-called mel-spectrum by aggregating the frequencies to a logarithmic scale that corresponds to the human perception of sound. Then the deep belief network is employed - it is a form of unsupervised learning that tries to find these values of the weights of the neurons that approximate most closely the input data (the mel-spectrum). A variant of deep belief networks is the deep auto encoders, where the output is not the music style but the mel-spectrum. Deep auto encoders are used for denoising in situations where the signal-to-noise ratio is too low.
During the lively Q&A session after presentation Petko discussed how overfitting is tackled in deep learning - by keeping the weights low with regulating functions, by inserting random noise and by testing the models on a validation set. Our speaker also shared that the feature extraction is the most computationally intensive part of music information retrieval - it takes 10 seconds per 4-minute track, and a typical database contains 100 million tracks! His colleagues in SoundCloud employ C++ for feature extraction and Python for classification purposes.
After so much information in so little space, you certainly have a lot of questions. Some of the answers may be found by taking a look at the presentation and the audio record from the event. A link to the video will appear as soon as the video is uploaded by our collaborators.
The lecture was followed by the traditional networking drinks. Don't miss our great upcoming events - stay tuned by visiting our website, following our Facebook page, LinkedIn page or following our twitter account.
Author: Vladimir Labov
Last Wednesday, the data science aficionados in Sofia enjoyed another exciting event brought to them by Data Science Society. This time the presenter believes so much in what he presented that he is building his startup based on it. Christian Mladenov introduced the software and programming language R in the cozy atmosphere of Eleven roof - the co-working space for start-ups in Sofia. As most of our speakers, Christian has diverse background - he graduated his BSc in Business Administration at Hogeschool INHOLLAND and obtained his MSc at RSM in the Netherlands. He gained experience as a software developer in Fredhopper, marketing expert in Agilent Technologies, product manager and business analyst at HP in the Netherlands and in Bulgaria, and a compliance intern in UBS in London. Currently he is running his own business as the co-founder of Intuitics.
At Intuitics Christian and his team are developing tools for building intuitive web applications for data analysis. At the core of this effort stands R. R is an open source statistical software with a meta-programming language behind. Christian revealed that R started as an academic-oriented project but has been gaining popularity in the business circles lately. This is due to the plethora of extra packages developed on the fly by enthusiasts, the widely available support by the community, the quality of its charts and the opportunity to code every step of the analysis process in its programming language, which is fully customizable. Its capabilities have spawned a rich ecosystem of graphical interface applications, commercial applications, packages dedicated to data analysis and so on. Christian compared R to some other competitors and in his view, only Python comes close. He mentioned that companies like Google, Facebook, Amazon, Microsoft, Dell and HP use R, sometimes for prototyping solutions before implementing them under Java or Python. Unfortunately, a serious drawback for big data is the memory limitation for datasets and the lack of multithreading support.
After this introduction, our speaker demonstrated the main features of the programming language. As in other languages, R has objects that have classes, and functions that manipulate the objects. Unlike most languages, you can assign a value to a function. Among the most useful objects for data analysis are vectors and matrices. A very interesting concept to R is the list object - a collection of other objects that might be from different classes in the same list. Most data analysis is conducted by employing data frames - a concept similar to tables in SQL and Excel.
Christian gave a practical example how to use R for analyzing a dataset of wines. The purpose of the exercise is to determine how the chemical properties of the wines affect the quality. He started by loading the datasets into R, demonstrated how to manipulate them by adding new variables, merging with different datasets, filtering out rows and columns. The graphical capabilities of R are its centrepiece and Christian gave us examples how to utilize them by preparing the data for plotting and building the plots. He also got the audience acquainted with packages for neat summary tables and correlation tables.
In trying to infer the wine rating from its qualities, our speaker showed us two approaches under R - a linear regression and decision trees. For the latter, he grouped the ratings in three groups and explained that classification models might work better for datasets where the target variables is clustered around few values, as is the case.
The lecture was followed by the traditional networking drinks at the bar of Eleven. Don't miss our great upcoming events - stay tuned by visiting our website, following our Facebook page, LinkedIn page or following our twitter account.
Author: Vladimir Labov
Last Wednesday, we at Data Science Society were happy to organize yet another successful event. This time we were honored to have Maria Mateva as our speaker in her alma mater, the Faculty of Mathematics and Informatics at the Sofia University. There she obtained her BSc degree in Computer Science and her MSc degree in Artificial Intelligence. Her professional path is much more diverse - Maria has worked as a software developer in VMware, Comverse, Ontotext and Experian.
What Maria shared with us stems primarily from her experience gained during her stint at Ontotext and her position of a teaching assistant in Information Retrieval in FMI. As Maria explained, information retrieval is mostly about finding text documents from within large collections by predetermined criteria - exactly the task that Google Search does. In order to find documents efficiently, you need to preprocess them. By using NLP preprocessors that were explained in detail in a previous lecture by Yasen Kiprov, all the key words (terms) from each document are extracted and form a dataset (matrix) where the terms are the observations (rows) while the documents are the variables (columns). From there, the simplest way to satisfy a query on the documents is to just return all the documents containing the searched term. This is what the Boolean retrieval model is all about - assigning 1 to the columns containing the term, and 0 to those that don't.
As appealing as this approach may be, it has a critical drawback - it doesn't rank the documents. As Maria pointed out, to overcome this we need a metric for how specific is each term for each document. This can be achieved by assigning weights to each term-document couple and this is what the Term frequency - inverted document frequency (TF-IDF) metric serves for. The term frequency promotes terms that occur often in the specific documents, while the inverted document frequency expression punishes terms that occur too often in most of the documents by tending to zero if such is the case. The resultant TF-IDF score(weight) is higher for documents that are more relevant for the query.
As it often happens in data science, there are several ways to solve a problem, and this is no exception - Maria demonstrated a way to compare documents by representing them as vectors in order to obtain their cosine similarity in the vector space.
Cosine similarity proves useful in facing another challenge - automatically recommending items to users, for example friends to follow, products to buy online (on Amazon), videos to watch (on Youtube), articles to read. The first approach that Maria showed us is collaborative filtering. The idea behind is to recommend items that similar users preferred, because users with similar ratings have most probably similar taste and will rate items in a common fashion. The first step in the approach is to standardize each user's rating by subtracting from each user's rating their average rating. From that point, you apply two methods - user-to-user or item-to-item filtering, but both can make use of cosine similarity. In user-to-user filtering, for each user we need to filter the best similar users and predict the user's taste based on the similar users' rating. In item-to-item filtering, we find items similar between each other, based on the ratings. Maria explained that the latter approach does a better job, because items have more constant behaviour that humans.
Unfortunately, collaborative filtering has two problems combined under the "cold start" name - no information for preferences of new users and no ratings of new items. The latter can be resolved by the second major recommendations approach - the content-based one. The idea is to build a user profile based on the content of rated items. This profile can be represented by a vector of weights in the content representation space. Then, the user's profile can be examined for proximity to items in this space. As Maria pointed out, this is where the vector space model for comparing documents comes into play, with a twist - the user profile can be viewed as a dynamically updated document.
But every solution brings new problems and challenges to solve - with millions of users and items, the size of the matrix utilized leaves us with a big data problem. Maria revealed the answer to this problem - latent semantic indexing, which finds relations between the objects thanks to the Singular Value Decomposition algorithm. It alleviates the memory consumption problem by substitution of the original high dimensional matrix with three much smaller auxiliary matrices. The model also aids the cleansing of noise in the original vector space.
After so much information in so little space, you certainly have a lot of questions. Some of the answers may be found by taking a look at the presentation and the full video from the event (speech in Bulgarian).
The lecture was followed by the traditional networking drinks. Don't miss our great upcoming events - stay tuned by visiting our website, following our Facebook page, LinkedIn page or following our twitter account.
Written by: Vladimir Labov
In our second event for the year, held on 11.02.2014, the presenter was Vladimir Labov, who is coincidentally also the writer of these pieces. Writing about yourself in third person is something reserved for megalomaniacs, so I will humbly switch to a first person narrative from this point on. My path to the lecture room of the Technical University was a bumpy one, a path of twists and turns. After obtaining my double degree in Finance and Accounting from the University of Economics Varna, I was set firmly on the path to become a typical financial professional. But during my studies for a MSc in Financial Economics at Erasmus University Rotterdam, I was captivated by the world of Risk Management. Out of nowhere came the job offer for a Quantitative Analyst at First Investment Bank, a stint that taught me a lot about risk modeling among other things. The nostalgia for Western Europe still had a firm grip on me, so I first decided to do a PhD in Finance. Due to the lack of time and will to come up with a proposal for a top UK university, I decided to give an inexpensive Master's program in Computing Science at Imperial College London a go, but declined the offer after being invited for a job in a debt collection company, Cabot Financial. The job turned out a mundane one, and six months, a second of distraction, one car hitting me and one broken leg after I started it, I went back to the roots in FiBank, covering pretty much all areas of financial risk management, for which my FRM qualification stands as a testament.
But enough with yours truly, let's get back to business. In the Wednesday presentation, I revealed solutions you can't find in any popular textbook that I have applied for problems faced while developing credit risk models. Credit risk models predict whether a borrower will pay their loan back, and people like me in effect decide if you will be approved for a credit card or a mortgage loan. Statistical classification algorithms work best for such problems with a binary outcome. In particular, the logistic regression is preferred in practice due to its quantification of a probability of default between 0 and 1.
As in any data analysis, the inevitable data-related problems show their ugly face. The most prevalent of those problems is missing values. Instead of trimming the observations, I have found that you can assign them to the group with the closest default rate, or to the most logical group. If all else fails, you can still keep the observations by transforming the raw values to weights of evidence. This approach is a solution for several other common problems.
Outliers have long been plaguing the work of statisticians. In credit risk environment, you can circumvent trimming these observations by applying weights of evidence that distribute all extreme values in the marginal groups where their weight is the same as all the other not-so-extreme values in the groups.
Working with weights of evidence transformation has other side benefits as well. First, it deals elegantly with categorical variables. Otherwise, you are left wondering what to do with 3 significant dummy variables and 2 insignificant ones, created from a variable with 5 categories. Second, weights of evidence take care of wrong regression signs for you - if everything is in line with economical logic, then all coefficient signs should be negative for WoE-transformed variables.
If you are thinking all this is too good to be true, indeed using the WoE approach comes at a cost - you need to split the numerical variables into bins. My personal solution is to split every numerical variable into 10 deciles, see if the default rates in each bin change monotonically and/or in a logical way, combine bins that don't differ much in terms of default rates and play with the cut-offs of bins that spoil the perfect picture of monotonically changing default rates.
Apart from the weight-of-evidence all-around solution, I also revealed how to tackle the persistent problem of multicollinearity. Instead of dropping the weaker variables, I combine the correlated variables into a new one. A solution similar in spirit is applied for taking into account both the income and indebtedness of a loan applicant - you end up working with the disposable income by subtracting the debt payments.
Next in my lineup for the evening were problems typical for Bulgarian scorecard data. You can request a declaration from the employer for the applicants with a unverifiable salary higher than the officially declared one. And a clever way to detect the current residence of applicants is to take the branch where they submitted the application.
Finally, I shared a secret discovery with my audience. Take a look at the presentation to find out what it is!
The lecture was followed by the traditional networking drinks. Don't miss our great upcoming events - stay tuned by visiting our website, following our Facebook page or following our twitter account.
Written by: Vladimir Labov
In our first event for 2015, held this Wednesday, 21.01.2015, Yasen Kiprov expanded our frontier even further by introducing the field of Natural Language Processing (NLP). Yasen has climbed up the academic ladder at the Sofia University - from a Bachelor in Informatics through a Master and PhD student in Artificial Intelligence, and has gained valuable experience as a software developer at Telerik, Melon, Ontotext, CNsys, BakeForce before landing his current job at Sentiment Metrics.
Yasen started by explaining what NLP is all about - enabling computers to derive meaning from a text. This is at the heart of Google Translate, Google Search and Smartphone assistants as Siri. A more specific application is classifying text into different categories - this is how e-mail services as Gmail and Yahoo detect spam e-mails, and lies in the core of sentiment analysis.
Sentiment analysis is utilized for determining the writer's attitude and emotions contained in a text. One simple approach for automatically deduce something about a text is to collect all criteria in the form of IF-ELSE statements and try to assign a text to a category if it complies with those criteria, often designed by experts in the field. This idea faces hurdles when applied outside narrow tasks.
This is where Yasen's explanation took a turn toward supervised machine learning with its two main representatives - the linear and logistic regression. When applying them to the text classification problem, the target variable is the sentiment for a given document, determined by human experts that have read it. The independent variable is a vector (one-hot vector) in which each word or useful word combinations are represented with their count in the documents analyzed.
Here Yasen gave a typical illustration of the most common problem in data science - the data quality. The text sentiment (the target variable) in the training sample is determined by human annotators that often don't agree what the overall sentiment of the text is. On the top of that, subtle concepts like meaning in context and irony are notoriously difficult to catch by an algorithm.
But not all hope is lost and Yasen enlightened the audience in the end of his presentation by introducing it to the concepts of word representations and artificial neural networks in the field. The first concept uses analogies to detect word pairs similar to already established pairs and thus overcomes some shortcomings of the one-hot vector concept. The neural network approach can render some impressive results as usual, but at the cost of transparency - it's difficult to explain to others and to yourself what really happens inside this black box mechanism.
The lecture was followed by the traditional networking drinks, for the first time held conveniently at the bar in the presentation room. Our hosts from Questers were so kind to provide even free drinks! If you do not want to miss such opportunities in the future, join us for our upcoming event - stay tuned by visiting our website, following our Facebook page or following our twitter account. For those of you that want to delve even deeper into the subject, you can find the presentation slides here.
Written by: Vladimir Labov
Our last meetup before Christmas was a venture into a new venue, new audience and a new challenging field, all at the same time thanks to Angel Marchev, Jr. and his father Angel Marchev. The presenter, Angel Marchev Jr., is an assistant professor at the University for National and World Economy, where he also graduated his BSc in Finance with distinction and defended his PhD. In between he managed to obtain a MSc degree at the Burgas Free University and gain experience as a bank analyst. Marchev Jr. is a fourth generation lecturer – his co-author and father, Prof. Angel Marchev provided valuable insights during the discussion.
The event itself was held at the alma mater of Marchev, Jr. – UNWE, which attracted visitors new to our meetings. They witnessed an ambitious project determined to leave a trail in an overcrowded field – Portfolio Management Theory. The goal is to bring the tools and techniques of the science of Cybernetics into the portfolio management process. To do this, the authors reformulated the investment portfolio problem as a cybernetic system where the investor is the controlling system and the portfolio is the controlled system. Another building block of the approach is the introduction of models of investors - simulating the behavior of an imaginary investor following a certain initially defined investment strategy. The strategies are ranging from investing in a single security, investing an equal amount into each available security, to the classical Markowitz optimization and the model invented by Prof. Angel Marchev, the Multi-stage selection procedure. The investor models based on these strategies are all tested on the same historical data – all instruments traded on the Bulgarian stock exchange between 1997 and 2011.
The nature of the selected data poses a great challenge typical for data science - missing data due to the poor trading activity. The authors went through a process of meticulous data cleaning in order to obtain useful data – most typically by trimming outliers and imputing the missing quotes by using the Last value carried forward approach. As usual, a single best approach to this problem does not exist, as illustrated by Marchev Jr. with the distorted results from the Markowitz optimization strategy due to spikes in returns for certain securities. These spikes were responsible for crashing the model and were remedied only after applying linear interpolation for imputing the missing data.
The models of investors were compared based on the risk-weighted return that the strategy yielded for every investment horizon. Overall, the best-performing strategy is the Multi-stage selection procedure of Marchev Sr., which is also the most computationally intensive.
On the top of all the information condensed into two hours of presentation, the audience also learned an interesting finding – all the popular investment strategies can be broken down into building blocks - predictors, solution generator and solution selector. This makes it possible to recombine these elements and obtain totally new strategies – a process called by the authors “heuristic restructuring”.
For the readers that already regret bitterly missing this great presentation – not all is lost. Thanks to our innovative and tech-oriented team, a complete video recording is available on Youtube and the slides are uploaded on our website and Slideshare. Stay tuned for updates, visit our website, follow us on Twitter and Facebook!
Written by: Vladimir Labov
This Wednesday, 26.11.2014, Georgi Botev immersed us into the fascinating field of computer vision - an area that combines diverse disciplines like Machine Learning, Artificial Intelligence and Neurobiology. Not surprisingly, Georgi also comes from a diverse background - he earned his Bachelor degree in Economics and his Masters in Statistics at the Sofia University, then delved into the world of credit risk prediction models at Experian before joining StatSoft to further pursue his ambitions in data science.
Georgi guided his audience in a classroom at the Faculty of Mathematics and Informatics through the path leading to a better algorithm for image recognition by computers. As often happens with important discoveries, it is a path of trial and error, of gradual improvement and, as Georgi reminded us at the end of the lecture, of contributions made by outsiders to the field. His first attempt was at facial recognition by detecting the coordinates of facial features such as the eyes, the nose, the mouth. The algorithms employ the coordinates for the images in the training set to recognize the faces in the out-of-sample images. Unfortunately, this approach does not cope well with images that differ a lot in resolution from those in the training set.
Then Georgi took a shot at the problem from a different angle, by detecting the saturation for each pixel. The idea is that some parts of the face are darker than others, for example the pupil and iris are darker than the rest of the eye. By using the saturation level of each pixel (between 0 and 255 for greyscaled images) you can create templates for each facial part using the difference of the sum of pixels saturation within the part and the sum of pixels in adjacent area. Most face finders, including Facebook's, employ similar algorithms because they are fast enough.
Another advance in the area that Georgi described in detail is the usage of neural networks. These machine learning algorithms approximate the way the neurons in the human brain work. Just like human vision is enabled by the photoreceptor cells in the retina that catch the light, so does computer vision rely on sensors. Artificial neural networks are organized in layers similarly to the neurons in the brain, and each neuron or set of neurons is responsible for recognizing a different feature of the object. The newest generation of Neural Networks simulates the sparse distribution behaviour of brain neurons - only 3-5% of them are active at any time.
At the moment the neural network models are good at recognizing a particular object they were trained for, but the real challenge lies in designing Artificial Intelligence that emulates the way humans are able to recognize a new object. Such a self-training system could be able to detect in a film the important objects that appear with a frequency say over 1000 times. Georgi gave an example with the developments in the area of optical character recognition (OCR), where ANN algorithms beat humans by a small margin in identifying handwritten characters.
The lecture was followed by the traditional networking drinks, this time over a piece of pizza in a pizzeria nearby. Join us for our upcoming event - stay tuned by visiting our website, following our Facebook page or following our twitter account.
Written by:Vladimir Labov
This Wednesday we were delighted to host the presentation of Peter Manolov, PhD. After graduating in Mathematics at the Sofia University, Peter got his PhD degree at the University of Illinois at Chicago. He continued his career with stints in the Institute for Mathematics and Informatics at the Bulgarian Academy of Science, and in Experian, where he worked on credit risk modelling. The highlight of the evening however was his experience as a quant in a hedge fund. There he was engaged with software support, data processing and most importantly, modelling the financial markets.
In his presentation, Peter revealed how important are news and their interpretation for making money from trading on the stock exchange. Traders try to buy low after any market overreaction to bad news and sell high when the price moves back to the level before the news. The amount of information released every day about the companies, the economy and the markets themselves is enormous and difficult to process and interpret by a trader. This is why some hedge funds try to predict the stock prices by employing models. Peter gave an example of a model for predicting the stock volume. His model predicts the volume of individual stocks by selecting a candidate variable set with a sample size of 10 trading days and estimating regressions on it. He used a moving window technique, moving the sample window one day ahead in order to gather enough forecasts (typically 1000) for the daily volume of each company. He further aggregated the forecasts for all the companies on the market and compared them to the observed values via the mean squared error and similar statistics. This error is compared to the error of another model with a larger sample size (e.g.11 trading days) until we find the model with the smallest prediction error for this variable set. The algorithm is repeated with different variable sets as well, until the ultimate money-making model is found. But as markets are constantly changing, so are the models - they are constantly re-estimated to keep track.
Modeling the financial markets is data-intensive and this is why Peter shared his experience in tackling big data problems in the second part of his presentation. As an example, he pointed out that his hedge fund received 100 GB zipped stock exchange data daily. Before employing it for modelling, he had to clean it. Typical data consistency problems like the presence of outliers arise because stock brokers and traders do not report their trades right after their execution but at the end of the day. A sudden jump and then drop in price data series is typically due to this reason. Outliers trimming that relies on quantile estimation and other data analysis techniques might be painfully slow with terabytes of data - for example, time series estimation took Matlab 14 days. This is why Peter developed his own Price Server solution that stored and compressed the same data so efficiently that it took only 6 hours to complete the 14-days Matlab challenge.
To end a memorable discussion in a memorable and original way, Peter sang on his guitar what he felt about an industry driven by people obsessed with making money. A picture is worth a thousand words, so a video should be worth a lot more , see on YouTube.
In what is becoming a tradition, we ended the evening with social drinks in a bar nearby.
If you want to be among the exclusive audience of such events in the future, do not miss our next presentation on 26.10.2014. Stay tuned for updates and join on our website, and social media twitter and Facebook page!
Written by: Vladimir Labov