- Ana Popova, @anie
- Izabella Taskova, @ izabellataskova
- Kamelia Kosekova, @kameliak
- Kameliya Lokmadzhieva, @kameliyalokmadzhieva
- Nikolay Bojurin, @nikolay
Mentors: @boryana @alex-efremov @pepe
Team name: DAB PANDA
NB!!!! OUR NOTEBOOKS ARE AVAILABLE HERE: DAB PANDA Rmds
Data Understanding and Preparation
You may see our code with results and brief comments if you dab here
Cryptocurrencies…. Are they as cryptic as the name suggests? Perhaps we’ll know at the end of this journey. Let’s start dabbing!
As a start we need to take a look at what we have. And we have a loooot of files.
For level 1 we need to focus on predicting the prices of 20 cryptocurrencies therefore we focus on price series data. We may find that info either in the separate files for the different currencies or within price_data.csv. We opted for the latter. What we discovered was to say the least interesting…
In the data preparation stage we discovered a discrepancy. Originally, we have 15 267 observations. However, we know that for each day we should have 288 observations. The period under consideration covers ‘2018-01-17 11:25:00’ – ‘2018-03-23 14:00:00’, or 66 full days and 2 incomplete ones.
Let’s figure out how many observations in total we should have by breaking down that period:
– for day 1 (2018-01-17): 151 observations
– for the 64 full days: 18 432 observations
– for day 66 (2018-03-23): 168 observations
Woooow! There is a big difference between 15 267 and 18 752. We decide to find out what we are missing by creating a sequence for all date times within the period with a step of 5 minutes (you may see this in code form in our code – dab).
Next, we merge the data on coins with the full list of dates. We find that we get 1 extra observation, which is weird. So, we check for duplicates and discover 1! Then we get rid of the imposter row!
We learn that for each coin we have 3 578 missing values.
To tackle the missing values, we decide to look at the log differenced prices. On that basis we interpolate the missing values by simulating 20 rows of white noise. You may see our pretty plots before and after the interpolation in the link we have provided for this stage (dab).
After that we need to retrieve the original data so we do the reverse of log and diff with a lovely loop that performs some reverse engineering feats!
We then make an empty dataframe which we feed all of the data to – this is our orig set!
We plot the price series – before and after the interpolation! (see our Rpubs to see our pretty graphs – by dabbing here)
We look at the autocorrelation to see how the different coins relate to one another.
We look at ACF and PACF curves for the complete log differenced data – for all 20 coins.
Finally we look at the histograms for the 20 coins!
Prelude – dab here
The orig dataframe is the one we use for modelling. It contains the missing observations we imputed form the initial dataset. We transform it from a dataframe to a time series object.
Next, we look for models that would be appropriate for the different coin prices. We look at combinations of (p, d, q), with p and q between 0 and 7. We perform this for all 20 coins and evaluate the models by considering the Ljung-Box p-value, sum of the squared residuals and Akaike criterion.
We discover that some models tend to perform well across multiple coins, such as ARIMA(0,1,6), ARIMA(6,1,0).
We also look at the residuals for all 20 coins for ARIMA(0,1,6) on log data.
Next, we provide a list of models with the highest p-values in our second Rpubs link – to see it dab here!
We attempted to apply ARIMA with rolling window by using a loop. We begin with a historical subset from the first 7 days or 2016 observations.
We managed to obtain results for several coins among which – Dash, Bitcoin Gold, Dogecoin, Ripple and Litecoin.
For Dash: see here
Results for Dogecoin: > sqrt(mean((x[2017:length(x)]-ff[2017:length(x)])^2))  0.006681334 > > mean(ff[2017:length(ff)])  -5.270246 > mean(x[2017:length(x)])  -5.270284 > > sd(ff[2017:length(ff)])  0.2668967 > sd(x[2017:length(x)])  0.2669037
sqrt(mean((y[2017:length(y)]-gg[2017:length(y)])^2))  0.003165124 > > mean(gg[2017:length(gg)])  6.960584 > mean(y[2017:length(y)])  6.96065 > > sd(gg[2017:length(gg)])  0.0230234 > sd(y[2017:length(y)])  0.02289996
> sqrt(mean((y[2017:length(y)]-gg[2017:length(y)])^2))  0.002051064 > mean(gg[2017:length(gg)])  6.748966 > mean(y[2017:length(y)])  6.748907 > sd(gg[2017:length(gg)])  0.03380054 > sd(y[2017:length(y)])  0.03372321