Datathon cases


You must be a registered user for the #AcademiaDatathon to see this content.
In an attempt to make a case which is to be somewhat universally understandable by various types of students, the case is financial time-series prediction, while making it more engaging with the hot topic of cryptocurrencies. The case integrates knowledge from various sources – Crypto Currencies, Quantitative Finance and Machine learning. At the same time, the case is stratified as the teams solving it could complete various levels – as far as they could solve it.


Authors: Angel Marchev, Jr., Alexander Efremov, Petar Nikolov, Pavel Nikolov, Boryana Pelova

(Please refer to the datathon instructions regarding the competition itself)

    1. Level 1: Make a prediction model of the major cryptocurrencies’ prices
    2. Level 2: Make an autonomous A.I. decision-maker for trading/investing.
    1. Тhe concept of the competition
    2. General Assumptions for the case
    3. Level 1
    4. Prediction scoring
    5. Level 2
    6. Additional assumptions for level 2
    7. Investment scoring
    1. Data transcript
    2. Selected cryptocurrencies
    3. Possible data challenges and solutions
    1. What should the teams prepare and upload
    2. Technical requirements for the uploaded code and data
    3. Questions
    4. Article instructions
    1. Glossary
    2. Computation considerations on portfolio significant variables


In attempt to make a case which is to be somewhat universally understandable by various types of students, the case is financial time-series prediction, while making it more engaging with the hot topic of cryptocurrencies. The case integrates knowledge from various sources – Crypto Currencies, Quantitative Finance and Machine learning. At the same time the case is stratified as the teams solving it could complete various levels – as far as they could solve it.


The goal is to build a successful investing/trading model on the cryptocurrency markets. The data consists of time-series of various cryptocurrencies (in 5-minute steps) prices and 24 hour volumes. The data could also be enriched during the datathon by the teams.

Cryptocurrencies’ portals: Link 1… ; Link 2… ; Link 3… ; Link 4…

Crypto guides & glossaries: Link 1… ; Link 2… ; Link 3… ; Link 4…

Cryptocurrencies data: Link 1… ; Link 2… ; Link 3… ; Link 4… ; Link 5…

The case has two levels of difficulty – in order to go to start solving the next level, the students would need to complete the previous:

  • LEVEL 1: Make a prediction model of the major cryptocurrencies’ prices

The process of investment could be analyzed in several consecutive phases, which are processes of transforming information. First are the input processes – from the environment, from the current state of the investment and from the investors’ goals. Using these inputs, the next phase is a predictor for forecasting/estimating the expected values for the significant variables and the external factors. Statistical analysis of past portfolio structure is also necessary. The prediction model should be able to forecast the prices of the major cryptocurrencies throughout the most volatile periods with some acceptable accuracy, utilizing all available information (including additional enrichment of data) for one period ahead.

  • LEVEL 2: Make an autonomous A.I. decision-maker for trading/investing.

Autonomous investment is a concept which implies that the process of investment management should be conducted by and self-maintained by non-human entity. The autonomous investment corresponds to the ever-faster changing investment environment accelerated by the rapid development of the communicational and computational abilities of the modern computer. So as an effort to keep up with the technological advancements in the field as well as trying to keep the playing field level, there is now a new tendency among researchers to propose autonomous solutions for trading (while of course making it harder and harder for human entities to participate). The general approach preached by the authors of the current case is based on the idea of machine learning. The A.I system should be able to make use of the available data (including the predictions from level 1) to form and simulate a viable investment strategy, working on its own for the whole period of supplied data.


This section consists of problem description from data science point of view. It is meant to assist the teams to what general directions should they take in their solution.

Тhe concept of the competition

Generally the case requires series of historical simulation experiments over the crypto currencies market, using the supplied pricing data. Every experiment uses prehistory ex-post data to predict ex-ante several points into the future. These predictions are then compared with the real data for these same several points and this concludes one experiment. To begin the next experiment the historical ex-post data window slides to the future and starts over again. All of the “models of investors” are to be back-tested on the unified competition data track. Such competition track consists of complete time series of all possible assets. The back-testing is done for every data-point (historical trading time moment) with all possibilities (all assets available for trading in that time moment). The direct result of such systematic approach would be a ranking list of the most successful (according to given criteria) A. I. investors.

Every A. I. investor is to be tested on each data point with the full variety of possible modifications. There is a minimal incremental step of one observation. The historical data window, which is used for estimation, reflects the period of previous data that the model uses is maximum 2016 observations. With each advance to the next data point, the window “slides” forward as well. The horizon of prediction (minimum one observation) also slides forward. For sake of equivalency among the teams we will assume that the earliest period for prediction (level 1) or investment decision (level 2) would be the first time data point after 00:00 on 25.01.2018.

The prediction/investment horizon would be 1 time step ahead, but the teams are free to implement methods working for more steps ahead. The sliding step of the prediction window would be 1 time step, but again the teams may implement systems which re-evaluate at longer time periods. The last simulation experiment should finish with the last data point of the supplied data. During the historical simulation all necessary computations are calculated according to the business logic or the algorithms of the experimented model of investor based on the all the data points in the data set. As a result, there is a new formed data array with selected solutions for each observation. So that this new data array has dimensions of k columns and m-1-D rows, where k is the number of all assets (some/most/all of them could be equal to 0), included in the data set, m is the number of all observations, and D is the size of the necessary pre-history data for the given model of investor. The A. I. investor reconsiders his investment portfolio every L periods, where L is the investment horizon.

General Assumptions for the case

These assumptions are necessary for the methodically correct run of the historical simulation:

  • the investor is independent, private, small entity investing own funds;
  • the transactions (frequency and amounts) do not influence the market prices;
  • all other investment markets except for the presented in the data set are disallowed;
  • minimum discretisation step is 5 minutes;
  • the moment “Now” is notated by t=0 or t(0)
  • all necessary predictions are based on ex-post data before t(0);
  • technical limitations – the once trained models should be able to work on regular computer (1 cpu 2 GhZ, 4 GB RAM will be assumed)

Level 1

The main aim of time series modeling is to carefully collect and rigorously study the past observations of a time series to develop an appropriate model which describes the inherent structure of the series. This model is then used to generate future values for the series, i.e. to make forecasts. Time series forecasting thus can be termed as the act of predicting the future by understanding the past. There are many techniques which could be used to forecast the cryptocurrencies’ prices, and the choice would be left to the discretion of the participants. Some of the main classes of forecasting models are: Regression analysis; Autoregression & moving averages; Support vector machines, Artificial neural networks, etc.

Introduction to Time series forecasting: Link 1… ; Link 2… ; Link 3… ; Link 4…

Examples of Time series forecasting: Link 1… ; Link 2… ; Link 3… ; Link 4…

Have a look at the article, written especially for this case by one of our mentors…

Prediction scoring

The prediction models should be tested on the prices of the chosen 20 cryptocurrencies (see below) and every metric would be averaged among all currencies in the final score. For the sake of scoring the accuracy of the prediction models, the jury would choose several various periods for testing out of the supplied data, with each testing period consisting of 288 observations. These periods would be unknown to the participants and would be chosen specifically for their contradictory movements and volatility. The prediction models should be able to perform within certain limits and will be competed among each other on the following metrics:

1. The mean absolute percentage error (MAPE) is a measure of prediction accuracy of a forecasting. It is measuring how close the prediction is to the actual data for the whole period. It expresses accuracy as a percentage, with the smallest possible value being most desirable:


yt – actual data point at period t

ft – forecasted data point for period t

n – total number of periods considered for evaluation

2. The directional symmetry (DS) statistic gives the percentage of occurrences in which the sign of the change in value from one time period to the next is the same for both the actual and predicted time series. The DS statistic is a measure of the performance of a model in predicting the direction of value changes. The case DS = 100 % would indicate that a model perfectly predicts the direction of change of a time series from one time period to the next. The desirable score here should be higher than 50%.


yt – actual data point at period t

ft – forecasted data point for period t

n – total number of periods considered for evaluation

3. As an additional and necessary characteristic of the prediction model will be calculated stability over the testing sets of periods. The stability will be measured by coefficient of variation of the mean absolute percentage error. Coefficient of variation is a relative measure of dispersion that corresponds to standard deviation, but it is expressed in percentage terms, which means it could be compared across different prediction models. The coefficient would be computed among the values of MAPE for the various testing sets of periods. The desired value should be less than 1.


Mi – mean average absolute error for testing set i

R – generalized coefficient of variation

m – total number of testing sets

4. In order for the prediction models to have any practical value Computational efficiency will be computed as a relative time to calculate the next prediction compared to the minimal time step of the next actual observation. The model should be able to be executed on a regular computer (1 cpu 2 GhZ, 4 GB RAM will be assumed) but there are no restrictions for the computer on which the model is trained. Desired value is minimum possible.


Tc – time for calculating prediction for currency c in seconds

p – total number of predicted currencies

Tg – time interval for next actual data point in seconds

U – relative time for execution of all predictions in % of time interval

5. The combined score for each model will be computed as a linear combination of the above specified metrics averaged over the studied crypto currencies. To mitigate the possibilities for cheating the competition scoring may be done at a dedicated server. Also the the finalists’ models may be trained at a dedicated server. Note that this score is one single number for the whole team solution. The highest possible value of the score is most desirable. Please, have in mind that this score may be a negative number.

Where: Z – combined score D – mean directional symmetry over the different testing periods M – mean MAPE over the different testing periods R – generalized coefficient of variation U – relative time for execution of all predictions in % of time interval  

Level 2

Over the years a significant number of portfolio models, methods, procedures and strategies (“investors”) have been proposed by theoreticians and practitioners in the field. Application of each of them should be considered as a systematic process consisting of several phases (i.e. goal setting, data collection, data structuring, statistical testing, enforcing limitations, forecasting, generating and developing feasible solutions, selection of optimal solution, realizing the investment solution, retrieving feedback of the significant outcomes, etc).

You should build an autonomous trading/investment system which presents the best chance for profiting on the cryptocurrencies market. This system (A.I.) should run throughout the whole period of the data set, starting on the first time data point after 00:00 on 25.01.2018 and finishing on the last time step. The A.I should be able to decide trades when to buy and sell the cryptocurrencies on its own without human interaction. And the task at hand is to simulate what would be the results of such A.I. for the whole data set period. The participants have the choice to which approach to use to build their autonomous trading system. Possible (out of many) approaches are Trading robots (typically directed towards trading with one or two assets), Statistical arbitrage (utilizing pricing inefficiencies among several assets), Autonomous portfolio management (using dynamic optimization techniques to allocate capital among many assets).

Trading robots introduction: Link 1… ; Link 2… ; Link 3…

Trading robot example: Link 1…

Trading robot builders: Link 1… ; Link 2…

Statistical arbitrage introduction: Link 1… ; Link 2… ; Link 3… ; Link 4…

Statistical arbitrage examples: Link 1… ; Link 2…

Statistical arbitrage advanced examples: Link 1… ; Link 2…

Investment portfolio introduction: Link 1… ; Link 2… ; Link 3…

Investment portfolio further reads: Link 1… ; Link 2… ; Link 3…

Investment portfolio advanced example: Link 1…

Additional assumptions for level 2

In order to define the case for Level 2, additional assumptions about the trading procedures are needed. Assumptions on simulating the investment value:

  • the initial invested amount is 10000 USD, simulating a small investor;
  • being small investor also means, that the transactions do not influence the market prices;
  • there is no new capital inflow, nor capital outflow (no new funds added, no funds withdrawn) for the full period of the data;
  • the initial capital could be invested no earlier than the first time data point after 00:00 on 25.01.2018;
  • if the investment value reaches 0 USD (or real return of -100%), the investor is insolvent/ bankrupt (none-the-less you should keep the results even from these tryouts as they are usually useful experiments);


Mv (t) – market value of investment at the moment t

Nui (t) – equivalent number of units of asset i at moment t

Pi (t) – market price of asset i at moment t

k – total number of assets

  • the value of the USD positions is unchanging (regardless of the real world trade possibilities for USD), which means zero return – the case is about cryptocurrencies, not FIAT currencies;
  • If the investor runs into non-computable values by the A.I.’s procedures (the model could not present a realizable result and there is not possible solution for the current time moment), the states of investments in all assets are kept as the values from previous time moment;
  • No fundamental events influence the value of the investment (e.g. do not account for effects from forks, airdrops, burns, etc.).

Assumptions on simulation the market:

  • there is perfect liquidity on the market, so the investor could always make a trade at the given price;
  • all transactions in time moment t (and after moment t-1) are made at price P(t);
  • there is no short selling;
  • there is no margin trading.

Assumptions on simulation the market frictions:

  • all trades are charged 0.25% of the turnover of the trade;


Nui (t) – equivalent number of units of asset i at moment t

Pi (t) – purchase price of asset i at moment t in USD

Bf – exchange fee (0.0025)

  • the trading fee includes all blockchain transaction fees (since there are no withdrawals allowed);
  • no income taxes are owed to any fiscal entity;
  • no market spread – all trades are simulated on the given market prices;
  • The minimum trade value for orders within same time frame and with the same cryptocurrency is 0.001 BTC (100,000 Satoshis);
  • The maximum trade value for orders within same time frame and with the same cryptocurrency is 0.5 BTC;
  • Rationalizing the transactions – due to the various market frictions it is rational that before realizing a transaction the new desired investment structure (computed at period t) is compared to the old investment structure (computed at period t-1), so that the new market order trades only the differences.


MOri(t) – market order in number of units of asset i at moment t

Nui (t) – equivalent number of units of asset i at moment t

  • for any other market frictions you should refer to the rules of Bittrex (here…)

Investment scoring

  1. Risk-adjusted return measure. In order measure performance of the investment A.I. it is essential to implement a viable indicator. It has to be risk adjusted return due to:
  • Firstly – since all possible benefits from investing in a given asset are already reflected in the individual return of the asset, then the return becomes general measure of the benefit.
  • Secondly – the return alone is not a good measure as the same value of return may be achieved at various values of risk. Ergo risk adjusted return is a better measurement for risky investments.
  • Thirdly – using risk adjusted return effectively compresses two-dimensional problem into one-dimensional and among other advantages it also relaxes the multi-criteria optimization problem to single-criterion optimization problem.

The risk adjusted return should be computed by using an asymmetric measure modified from Sortino ratio (Frank A. Sortino and Robert van der Meer. “Downside Risk.” Journal of Portfolio Management. Vol. 17, No. 5, Spring 1991, pp. 27–31). The asymmetric measures calculate the risk far more precisely, but require at least twice as many data points for computing a robust estimation. Desired value is maximum possible.


G – Generalised asymmetric risk adjusted return measure

Rp – Return of the investment A.I. for the whole data set

Rp(t) – Compound return of the investment A.I. for time data point t

Mv(M) – Market value of the investment at data point M (endpoint)

MV(1) – Market value at the start of the competition (10000 USD)

ν0(Rp) – Lower partial moment of return with minimum acceptable return = 0

2. In order for the A.I. investor to have any practical value Computational efficiency will be computed as a relative time to calculate the next solution compared to the minimal time step of the next actual observation. The A.I. should be able to be executed on a regular computer (1 cpu 2 GhZ, 4 GB RAM will be assumed) but there are no restrictions for the computer on which the model is trained. Critical value is < 1, (in other words time for calculation should be less than 300 seconds).


Ts (t) – time for calculating solution t in seconds

Tg (t) – time interval for actual data point t in seconds

U (t) – relative time for execution of solution t in % of time interval

4. DATA DESCRIPTION The used dataset is the same for all participants but could be enriched with any publicly available data (of course any enrichment should be diligently described). The dataset is considered a test track for tryout, evaluation, competition and comparative analysis of the A.I. investors.

You could download the dataset here…

Data transcript

Description of each variable. There are 2 types of data structure in the dataset. First is the structure of CSVData_coin.csv which contains the information about coding of each cryptocurrency. The second data structure is in all other files (CSVData_priceData_5min*.csv), which contain the actual prices in USD, trading volumes and some derived variables on data movements. There are files for every cryptocurrency (known by the ending code of the filename) and there is also one big file with all cryptocurrencies in the same place (filename – without any code at the end).

file name max_length precision scale notes
CSVData_coin ID 4 10 0 Coin ID
CSVData_coin coinName 100 0 0 Coin Name
CSVData_coin descirption 200 0 0 Description of the coin
CSVData_coin rowCreation 8 23 3 time stamp for record
CSVData_coin Symbol 40 0 0 Coin Symbol
CSVData_priceData_5min ID 4 10 0 record ID
CSVData_priceData_5min refID_coin 4 10 0 reference Coin ID (comes from coin table)
CSVData_priceData_5min marketCap 8 19 0 Total market capitalization for selected cryptocurrency
CSVData_priceData_5min price 17 38 15 usd price
CSVData_priceData_5min CirculatingSupply 8 19 0 Number of coin units in circulation
CSVData_priceData_5min Volume24h 8 19 0 Traded volume for 24 hours in number of coin units
CSVData_priceData_5min Movement1h 9 10 2 Price movement in percent for the past 1 hour in %
CSVData_priceData_5min Movement24h 9 10 2 Price movement in percent for the past 24 hour in %
CSVData_priceData_5min Movement7d 9 10 2 Price movement in percent for the past 7 days in %
CSVData_priceData_5min rowCreation 8 23 3 time stamp

Sample of CSVData_coin.csv

ID coinName descirption rowCreation Symbol
——– ———– ———– ——
1442 bitcoin Bitcoin 2018-01-15 20:40:28 BTC
1443 ethereum Ethereum 2018-01-15 20:40:28 ETH
1444 ripple Ripple 2018-01-15 20:40:28 XRP
1445 bitcoin-cash Bitcoin Cash 2018-01-15 20:40:28 BCH
1446 cardano Cardano 2018-01-15 20:40:28 ADA
1447 nem Nem 2018-01-15 20:40:28 XEM
1448 litecoin Litecoin 2018-01-15 20:40:28 LTC
1449 neo Neo 2018-01-15 20:40:28 NEO
1450 stellar Stellar 2018-01-15 20:40:28 XLM
1451 iota Iota 2018-01-15 20:40:28 MIOTA
1452 eos Eos 2018-01-15 20:40:28 EOS
1453 dash Dash 2018-01-15 20:40:28 DASH
1454 monero Monero 2018-01-15 20:40:28 XMR
1455 tron Tron 2018-01-15 20:40:28 TRX
1456 bitcoin-gold Bitcoin Gold 2018-01-15 20:40:28 BTG
1457 ethereum-classic Ethereum Classic 2018-01-15 20:40:28 ETC
1458 qtum Qtum 2018-01-15 20:40:28 QTUM
1459 icon Icon 2018-01-15 20:40:28 ICX
1460 lisk Lisk 2018-01-15 20:40:28 LSK
1461 raiblocks Raiblocks 2018-01-15 20:40:28 XRB

  Sample of CSVData_priceData_5min.csv

0 ID refID_coin marketCap price CirculatingSupply Volume24h Movement1h Movement24h Movement7d rowCreation
0 ———- ——— —– —————– ——— ———- ———– ———-
1 1 1442 180786170372 10756 16807937 17884600000 -1.42 -11.4 -24.18 2018-01-17 11:25:18
2 2 1443 93242345727 960.93 97033038 7990730000 -2.37 -12.48 -28.61 2018-01-17 11:25:18
3 3 1444 43630734374 1.13 38739142811 6058320000 -3.03 -18.17 -41.2 2018-01-17 11:25:18
4 4 1445 29504008303 1744.13 16916175 1544790000 -3.2 -12.15 -29.28 2018-01-17 11:25:18
5 5 1446 14111015557 0.544258 25927070538 1511130000 -2.56 -14.91 -26.02 2018-01-17 11:25:18
6 6 1448 9651298201 176.14 54793958 1353810000 -1.36 -12.67 -27.06 2018-01-17 11:25:18
7 7 1447 7984160999 0.887129 8999999999 174228000 -0.79 -20.52 -38.18 2018-01-17 11:25:18
8 8 1449 7791225000 119.86 65000000 1366290000 -3.92 -21.24 0.49 2018-01-17 11:25:18
9 9 1450 7357217335 0.411229 17890803748 443714000 -1.09 -14.33 -22.92 2018-01-17 11:25:18
10 10 1451 6905187082 2.48 2779530283 245908000 -1.79 -16.55 -27.3 2018-01-17 11:25:18
11 11 1453 5958391224 761.64 7823066 231959000 -1.18 -5.49 -29.25 2018-01-17 11:25:18
12 12 1452 5815328336 9.54 609350922 1503360000 -2.07 -13.14 5.19 2018-01-17 11:25:18
13 13 1454 4884965242 312.89 15612355 223049000 -1.64 -11.18 -20.44 2018-01-17 11:25:18
14 14 1455 3446487375 0.05242 65748192475 699245000 -2.11 -12.28 -52.94 2018-01-17 11:25:18
15 15 1456 3048981646 181.82 16769324 394049000 -5.69 -16.69 -21.53 2018-01-17 11:25:18
16 16 1457 2791091675 28.14 99178867 654307000 -0.13 -15.03 -24.54 2018-01-17 11:25:18
17 17 1459 2599230395 6.84 380045004 98879000 -2.85 -9.99 -38.62 2018-01-17 11:25:18
18 18 1458 2468682828 33.44 73813652 979452000 -1.28 -20.51 -35.7 2018-01-17 11:25:18
19 19 1460 2218036636 18.94 117087568 107490000 -0.74 -16.59 -34.1 2018-01-17 11:25:18
20 20 1461 2028585280 15.22 133248289 25533900 -1.98 -6.93 -45.94 2018-01-17 11:25:18
.. ..

Sample of CSVData_priceData_5min2496.csv

ID refID_coin marketCap price CirculatingSupply Volume24h Movement1h Movement24h Movement7d rowCreation
1067 2496 77862 0.031721 2454578 0 -1.74 -39.91 -28.06 2018-01-17 11:25:18
2509 2496 77896 0.031735 2454578 0 -1.12 -39.58 -28.04 2018-01-17 11:30:22
3951 2496 78374 0.03193 2454578 0 -0.51 -39.21 -27.61 2018-01-17 11:35:25
5393 2496 77722 0.031664 2454579 0 -1.48 -39.46 -28.23 2018-01-17 11:45:22
6835 2496 77647 0.031634 2454579 0 -1.79 -39.38 -28.32 2018-01-17 11:50:38
8276 2496 76967 0.031357 2454579 0 -2.65 -39.93 -28.96 2018-01-17 11:55:42
9718 2496 75620 0.030808 2454579 0 -4.15 -40.96 -30.21 2018-01-17 12:00:46
11160 2496 75046 0.030574 2454581 0 -4.81 -41.36 -30.76 2018-01-17 12:05:49
12602 2496 76056 0.030985 2454581 0 -2.99 -40.56 -29.85 2018-01-17 12:10:53
14044 2496 75909 0.030926 2454581 0 -2.87 -40.67 -30.03 2018-01-17 12:15:59
15486 2496 74738 0.030448 2454581 0 -4.38 -41.63 -31.11 2018-01-17 12:21:01
16928 2496 74243 0.030247 2454584 0 -4.9 -42.02 -31.59 2018-01-17 12:26:03
18371 2496 73033 0.029754 2454591 0 -3.9 -43.21 -32.93 2018-01-17 13:03:24
19813 2496 72570 0.029565 2454591 0 -3.9 -43.57 -33.38 2018-01-17 13:08:28
21255 2496 72689 0.029613 2454591 0 -3.36 -43.49 -33.28 2018-01-17 13:13:31
22696 2496 74940 0.03053 2454591 0 -0.1 -41.74 -31.24 2018-01-17 13:18:34
24138 2496 75967 0.030949 2454594 0 1.86 -40.97 -30.32 2018-01-17 13:23:37
25580 2496 76337 0.031099 2454594 0 2.44 -40.77 -30.01 2018-01-17 13:28:40
27022 2496 75879 0.030913 2454594 0 2.19 -41.2 -30.45 2018-01-17 13:33:45
28464 2496 74952 0.030536 2454594 0 0.89 -41.98 -31.32 2018-01-17 13:38:48
29906 2496 75565 0.030785 2454597 0 1.92 -41.57 -30.79 2018-01-17 13:43:53
31348 2496 75312 0.030682 2454597 0 1.95 -41.84 -31.04 2018-01-17 13:48:55
32790 2496 74611 0.030396 2454597 0 1.28 -42.43 -31.71 2018-01-17 13:53:57

  Selected cryptocurrencies List of the currencies to be used in level 1 (prediction modeling):

Currency ticker Coin ID
Bitcoin BTC 1442
Bitcoin Cash BCH 1445
Bitcoin Gold BTG 1456
Cardano ADA 1446
Dash DASH 1453
Dogecoin DOGE 1477
Eos EOS 1452
Ethereum ETH 1443
Ethereum Classic ETC 1457
Iota MIOTA 1451
Lisk LSK 1460
Litecoin LTC 1448
Monero XMR 1454
NEMcoin XEM 1447
Neo NEO 1449
Ripple XRP 1444
Stellar XLM 1450
Tether USDT 1474
Tron TRX 1455
Zcash ZEC 1465

Possible data challenges and solutions Data challenge: there are price time series which do not have enough data points to conduct reasonable analysis or to apply a certain method. Solution: making a well-grounded selection of the assets in the research database, but still leaving as many assets as possible in the data set. Data challenge: in the initial trading sessions of some assets there are relatively large periods of trading inactivity or strange, uncharacteristic behavior. Solution: eliminating the boundary effects by removing certain number of initial observations. Data challenge: in price time series there is missing data. Solution: data imputation in the time series. There are many approaches to data imputation of financial time series. According to Lazarov, D., “Missing data evaluation in financial time series”, these are the most promising ones: LVCF – Last value carried forward imputation LINT – Linear interpolation R1 – Multivariate regression imputation NP100 – Non-parametric multivariate regression imputation using a moving window of length 100 MARX1 – Multivariate autoregression imputation of lag 1 AR5X – Univariate autoregression imputation of lag1 – lag5 MLP – Univariate multi-layer perceptron imputation Data challenge: not enough data for a given A.I. trader. Solution: collect additional data and enrich the data set. This may include economic/business data, data from various markets, etc. Some popular data sources for cryptocurrencies are following (but any data source is acceptable, as long as it is publicly accessible): Link1…, Link2…, Link3…, Link4…, Link5…, Link6…, Link7…

5. EXPECTED OUTPUTS At the delivery deadline the teams will submit their solutions and the competition phase will start. The competition will be held in two stages – semi-finals and finals. The resulting solutions would be evaluated and ranked by the Jury members, but their decisions would be supported also by objective metrics for prediction accuracy, risk-weighted return, real return and computational time. The teams will have to prepare and upload the following:

  • a co-authored article (see below for more instructions). We provide Jupyter Notebook embedding functionality on our website, so you could make your article in a notebook, and import it directly.
  • Source codes, including all necessary libraries/used environments or workflows, so that the code works stand alone. The models that you upload should be already trained in order to evaluate them on our servers. Even if we could not execute the code on our servers, we are still going to check it afterwards.
  • Prediction datasets for the solution of Level 1.
  • Calculated metrics, as required by the case for Level 1 and Level 2

Technical requirements for the uploaded code and data: At the strict deadline you should publish your article including your source codes. Regarding the source codes you can straight upload a python Jupyter Notebook, or if you use something else, you should zip the source codes and upload them as an attachment media file. YOU CAN NOT CHANGE THE CODES OR THE ARTICLE AFTER THAT.

  • Level 1:

After the deadline you would receive the test time periods and you will have exactly 15 minutes to upload your predictions for the stated time periods. Note that we expect that you should have calculated in advance predictions for the whole period from 25.01.2018 00:00 til the end of the data set. So it would be only matter of cutting off the required data sets and saving them into 100 csv files (see below).  We believe it would only take about 5 minutes to do this simple operations and 15 minutes are plenty of time. Also you would have to calculate and upload the metrics as they are shown in the case, you should do this no later than 30 after the deadline for the article. The predictions should be formed into 100 csv files (20 currencies, 5 days) each consisting of 288 data points. Each csv file with predictions should be named as <coinID>_<test period>.csv (for example: 1477_1.csv). These csv files should consist of single column of numbers, where the top row should represent the prediction for the earliest time point within the test period, and the last should be about the latest time point.

prediction (T)
prediction (T+1)
prediction (T+n)

The following metrics should be calculated, written in the article and uploaded: M (MAPE) for every currency for every test period: csv file – 21 rows (including a header) D (Directional Symmetry) for every currency for every test period R (coefficient of variation of MAPE) for every currency U (computational efficiency) one number for the whole solution Z (prediction score) one number for the whole solution Download template files to fill in with the metrics for level 1 from here…

  • Level 2:

You should upload one matrix containing the relative weights of the investment back simulation in csv file, named portfolio_sim.csv. The matrix should contain 15817 lines (1 header row and 15816 observations from 25.01.2018 00:00 til the end of the dataset) and 1679 columns (1678 crypto currencies and 1 column for USD position). Every cell in the matrix consists of a value between 0 and 1 and it means what share of the total capital you have invested in this currency (column) at this time point (row). Every data row sums up to 1.00 (or 100%) which is guaranteed by the USD position. If your trader AI doen’t want to invest in any crypto currencies, it may choose to put 100% of the capital in USD (do not forget that by rule USD doesn’t bring any return).

  1 2 3 1678 USD
00:00 25.01.2018 (1)            

The following metrics should be calculated, written in the article and uploaded: G Generalised asymmetric risk adjusted return measure for the whole data set U (computational efficiency) for each time point (it should be < 300 seconds) Download template files to fill in with the metrics for level 2 from here…   Questions:

  1. Level 1: Prediction modelling

What necessary data preparation did you need? Which is the most suitable method for forecasting cryptocurrencies? Which cryptocurrencies were hardest to predict? What anomalies in behavior of the cryptocurrencies did you detect?

  1. Level 2: A.I. Trading

How many assets did you include in your A.I. Trading? How did you make the selection? Which is the best Risk adjusted return have you achieved? Which model? Would this be the winning model if we ranked them on Return only? What is the optimal time for reevaluating the investment (optimal investment horizon)? What is the optimal size of the training data set? (Bonus) If you use the calendar for fundamental events such as airdrops, burns, forks (here…), how would you change your A.I. Trader to be more profitable? Article instructions The main focal point for presenting the results from the Datathon from each team, is the written article. It would be considered by the jury and it would show how well the team has done the job. Considering the short amount of time and resources in the world of Big Data Analysis it is essential to follow a time-tested and many-project-tested methodology CRISP-DM. You could read more at The organizing team has tried to do the most work on phases “1. Business Understanding” “2. Data Understanding”, while it is expected that the teams would focus more on phases 3, 4 and 5 (“Data Preparation”, “Modeling” and “Evaluation”), so that the best solutions should have the best results in phase 5. Evaluation. Phase “6. Deployment” mostly stays in the hand of the case-study providing companies as we aim at continuation of the process after the event. So stay tuned and follow the updates on the website of the event. 1. Business Understanding This initial phase focuses on understanding the project objectives and requirements from a business perspective, and then converting this knowledge into a data mining problem definition, and a preliminary plan designed to achieve the objectives. A decision model, especially one built using the Decision Model and Notation standard can be used. 2. Data Understanding The data understanding phase starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data, or to detect interesting subsets to form hypotheses for hidden information. 3. Data Preparation The data preparation phase covers all activities to construct the final dataset (data that will be fed into the modeling tool(s)) from the initial raw data. Data preparation tasks are likely to be performed multiple times, and not in any prescribed order. Tasks include table, record, and attribute selection as well as transformation and cleaning of data for modeling tools. 4. Modeling In this phase, various modeling techniques are selected and applied, and their parameters are calibrated to optimal values. Typically, there are several techniques for the same data mining problem type. Some techniques have specific requirements on the form of data. Therefore, stepping back to the data preparation phase is often needed. 5. Evaluation At this stage in the project you have built a model (or models) that appears to have high quality, from a data analysis perspective. Before proceeding to final deployment of the model, it is important to more thoroughly evaluate the model, and review the steps executed to construct the model, to be certain it properly achieves the business objectives. A key objective is to determine if there is some important business issue that has not been sufficiently considered. At the end of this phase, a decision on the use of the data mining results should be reached. 6. Deployment Creation of the model is generally not the end of the project. Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organized and presented in a way that is useful to the customer. Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data scoring (e.g. segment allocation) or data mining process. In many cases it will be the customer, not the data analyst, who will carry out the deployment steps. Even if the analyst deploys the model it is important for the customer to understand up front the actions which will need to be carried out in order to actually make use of the created models.


  1. Glossary


  • The investment process is a chain of considerations and actions for an individual, from thinking about investing to placing buy/sell orders for investment assets. (Investments, Bodie, Kane, Marcus, pp. 858)
  • The sacrifice of certain present value for (possibly uncertain) future value. (Investments, Sharpe, Alexander, Bailey, pp. 1013)
  • Investing is a process of consciously sacrificing own resources in the pursuit of future reward or goal. Important notion is that uncertainty of future events may clash with some current actions. None of the future outcomes is guaranteed to offset the undertaken restriction of investor’s degrees of freedom (Alexander et al, 1993, pp. 840).
  • Although generally it is accepted that “investing” differs from “trading”, based on the term of the strategy (longer terms for investing), for the current needs trading will be assumed as a special case of investing, as it fits perfectly well with the above definitions (presuming “future” to be both short and long term).


  • The investor is an entity (physical or legal), purposefully using financial (and other) resources for investment and pursuing future rewards. It is assumed that such entity acts rationally as a real Homo Economicus (Mill, 1836).

Investment asset/asset

  • asset is an instrument that can be traded freely and easily on a well-developed market. (Investment Science, Luenberger, pp. 40)
  • asset (a.k.a. investment instrument, investment asset, position) according to Marchev (2012) is investment opportunity, traded freely on a transparent market which transmits publicly enough relevant information.

Organized market

  • A central location where trading of assets is done under a set of rules and regulations.
  • A mechanism designed to facilitate the exchange of financial assets by bringing buyers and sellers of assets together. (Investments, Sharpe, Alexander, Bailey, pp. 1010 -1017)
  • Organized exchange are centralized auction-type markets, while the over-the-counter market is an intricate network of asset dealers that take position in various assets and boy and sell from their own portfolios.(Modern Investment Theory, Haugen, pp. 26)

Market price

  • A market price is the last price at which an asset traded, meaning the most recent price on which a buyer and seller have agreed and at which some amount of the asset was transacted.
  • Closing price is the final price at which an asset was traded for a given trading period. The closing price represents the most current valuation of an asset until the end of the next trading period. Note that the trading period could be virtual (for the sake of research) and not officially regulated.


  • A rate of return expresses the percentage change in your wealth from one period to another. (Modern Investment Theory, Haugеn, pp. 389)
  • Returns from investing are crucial to investors; An assessment of return is the only rational way (after allowing for risk) for investors to compare alternative investments that differ in what they promise. The measurement of actual (historical) returns is necessary for investors to assess how well they have done or how well investment managers have done on their behalf. (Investments Analysis and Management , Jones, pp. 114)


  • Risk means uncertainty about future rates of return. (Investments, Bodie, Kane, Marcus, pp. 123)
  • The chance that the actual outcome from an investment will differ from the expected outcome. (Investments Analysis and Management , Jones, pp. 11)
  • A measure of uncertainty about the outcome from a given event. The greater the variability of possible outcomes, on both the high side and the low side, the greater the risk. (Foundations of Financial Management, Block, Hirt)
  • Risk is identified with the dispersion of returns, i.e. with possible deviations (both positive and negative) from the expected return. (Capital investment & financial decisions, Levy, Sarnat, pp. 237)
  • The standard deviation around the expected return. (asset Analysis and Portfolio Management, Fischer, pp. 560)


  • Liquidity is the ease (speed) at which an asset can be sold and still fetch a fair price. (Investments, Bodie, Kane, Marcus, pp. 864)
  • Liquidity refers to the ease of converting an asset into money quickly, conveniently, and at little exchange cost. (asset Analysis and Portfolio Management, Fischer, pp. 6)
  • The ability to sell an asset quickly without having to make a substantial price concession. (Investments, Sharpe, Alexander, Bailey, pp. 1014)


  • Arbitrage is the simultaneous purchase and sale of an asset to profit from a difference in the price. It is a trade that profits by exploiting the price differences of identical or similar financial instruments on different markets or in different forms. Arbitrage exists as a result of market inefficiencies and would therefore not exist if all markets were perfectly efficient.
  • Statistical arbitrage comprises a set of quantitatively driven trading strategies, looking to exploit the relative price movements across thousands of financial instruments by analyzing the price patterns and the price differences between financial instruments.
  • Possible scenarios for exploiting arbitrage conditions:
    • Market Neutral Arbitrage – involves taking a long position in an undervalued asset and shorting an overvalued asset simultaneously. The assets should be selected to have similar volatilities and thus, an increase in the market will cause the long position to appreciate in value and the short position to depreciate by a roughly the same amount.
    • Cross Market Arbitrage – exploiting the price discrepancy of the same asset across markets, buying the asset in the lower-valuing market and selling it in the highly valuing market simultaneously.
    • Cross Asset Arbitrage – using the price discrepancy between a financial derivative and its underlying asset.

Investment portfolio

  • Collection of assets. (Investments, Sharpe, Alexander, Bailey, pp. 167)
  • The assets held by an investor taken as a group. (Investments Analysis and Management , Jones, pp. 7)
  • assets that have return and risk characteristics of their own, in combination, make up a portfolio. (asset Analysis and Portfolio Management, Fischer, pp. 2)
  • Portfolio management is the process of combining assets in to a portfolio tailored to the investor’s preferences and needs, monitoring that portfolio , and evaluating its performance. (pp. 2) Portfolio theory deals with the big picture – the risk and return attributes of the investor’s overall portfolio of assets. (pp. 810) (Investments, Bodie, Kane, Marcus)
  • The purpose of using a portfolio approach is to improve the conditions of the investment process by obtaining such properties (values of significant variables) of the combined assets that are not obtainable by any single asset. The most often (but not the only) considered significant variables are risk and return. A certain configuration of risk and return is only possible within a given combination of assets. Improving risk and return conditions through portfolio management is diversification (Marchev, 2012a).


  • Another means to control portfolio risk is diversification, by which we mean that investments are made in a wide variety of assets so that the exposure to the risk of any particular asset is limited. (Investments, Bodie, Kane, Marcus, pp. 810)
  • The process of adding assets to a portfolio in order to reduce the portfolio’s unique risk and, thereby,the portfolio’s total risk. (Investments, Sharpe, Alexander, Bailey, pp. 1007)
  • The variance of the return of a portfolio can be reduced by including additional assets in the portfolio, a process referred to as diversification. (Investment Science, Luenberger, pp. 151)
  • Diversification is related to the Central Limit Theorem, which states that the sum of identical and independent random variables with bounded variance is asymptotically Gaussian (the notion of diversification can be extended to more general random variables by the concept of mixing). (Robust Portfolio Optimization and Management, Fabozzi, pp. 19)
  • The more traditional forms of diversification have concentrated upon holding a number of asset types (stock, bonds) across industry lines (utility, mining manufacturing groups). (asset Analysis and Portfolio Management, Fischer, pp. 560)
  • An investor can construct a diversified portfolio and eliminate part of the total risk, the diversifiable or non-market part (unsystematic part). (Investments Analysis and Management , Jones, pp. 254)

Computation considerations on portfolio significant variables The mathematical description of a portfolio is in the form of vector consisting of k+1 positions each with respective weights, where k is the total number of assets and the additional position is the cash position c(t). For unwanted positions the values are set to 0. Not invested funds are always assumed to be in the cash position. The sum of all weights is equal to 1.00. In brief the mathematical description of an investment portfolio is a k+1-dimensional vector of weights summing to 1.00 (i.e. a singular simplex). where: W(t)- weight structure of the portfolio at discrete time (t) c(t) – relative weight of the cash position at discrete time (t) wi(t)- relative weight of i-th asset at discrete time (t), subject to (no short positions): To calculate return of each asset at the discrete time t the particular case is considered, where:

  • Short selling is not possible transaction, meaning that the discrete times s (time of selling) follows b (time of buying).
  • No income taxes are assumed.
  • Besides the return derived from price change, there are other forms of return of a asset arising during the time of investing. These include airdrops, forks, rewards, burns and etc. All these must be estimated as financial inflow or outflow per one unit of currency (e.g. if while holding a position there is an airdrop of z amount per unit, then this is a positive return of z).

where: Pi(s)- sell price at the discrete time s of asset i Pi(b)- buy price at the discrete time B of asset i Bi(t) – quantified complimentary benefits of asset i for the period between discrete times and b and s K(s)- brokerage at the discrete time s t – discrete time of return. Return of an investment portfolio is calculated as a weighted average of the returns of all included assets. The weights correspond to the configuration of the portfolio – the allocated investment in each position. The sum of all weights (including cash position) is always equal to 1. The return of a cash position is normally assumed 0. where: Rp(t) – return of the portfolio p at the discrete time t Ri(t) – return of the asset i at the discrete time t wi(t) – relative weight of position i at the discrete time t There are several approaches to calculating asset risk. The dominant concept is to use variance and/or standard deviation and/or volatility as a measure of risk. A good case could be built around using information entropy as a risk measure of a portfolio. So measuring the risk of an individual asset may be formulated as function of historical data of the return of asset: where: Vi(t) – risk of the asset i at the discrete time t F – function for measuring the risk of asset i d – number (depth) of historical data considered for calculation of risk No matter what measure is used for portfolio risk, there is a strong agreement among authors that “the risk of a portfolio is not a weighted average of the risks of all included assets” (Jones, 1994, p. 573). The risk of a portfolio depends not only on the risks of every included asset, but also on the mutual dependence (interdependence) between and among the assets. Cash position is assumed to have a risk of 0. An example approach to measure portfolio risk is described below, where there are two additive terms – one for weighted average of the risks of included assets and the other for calculating pair by pair the interdependence of the assets. where: Vp(t) – risk of the portfolio p at the discrete time t Vi(t) – risk of the asset i at the discrete time t ρ(Vi(t),Vj(t))- measure for interdependence of the assets i and j at the discrete time t

Share this


  1. 0

    Hello, I’m a Co-Founder of — we provide a unified crypto API with market data with tick level access and full lookthrough. If you are interested to use it for your backtesting, please feel free to get in touch. We’re currently in free Beta and looking for feedback on the product! Cheers, Gus

Leave a Reply