Data preparation
The first step in our analysis was to see if there are any observations missing. Some dates were not even present in the provided file, so we decided to add them and consider the price for that certain day, hour or minute as NA. The next step was to differentiate the data. At this point we had not still dealt with the NA values. Our final decision was to simulate a time series with white noise with the same mean and standard deviation values as the differentiated data and then to fill in all instances of NAs we take the corresponding values from the white-noise time series. Later we would substitute in NAs in initial data with the previous value plus the corresponding values from the white-noise time series.
Exogenous variables
We decided that we cannot build very reliable model which depends only on its own previous values, since cryptocurrency market are strongly speculative markets. Therefore, we decided to build a model, using additional exogenous variables. For this task we used Google Trends as well as Yahoo statistics.
From Google Trends we downloaded the number of searches of a few keywords daily. The keywords we chose were: cryptocurrency, bitcoin price, etherium price and litecoin price.
From Yahoo finance we downloaded the VIX index, AMD stock prices as well as Nvidia stock prices, again daily, and we spread the daily values to 5-minute intervals. The Chicago Board Options Exchange Volatility Index, known by its ticker symbol VIX, is a popular measure of the stock market’s expectation of volatility implied by S&P 500 index options. We did this since we expect that there is going to be a negative relationship between the index and the price of cryptocurrencies. As far as Nvidia and AMD stock prices are concerned, those are the two biggest producers of graphics cards. Cryptocurrencies are mined by using those cards and we expect the two to have a positive correlation.
Model Development
From the plots of these exogenous variables and the Bitcoin price that we initially had (and have decided to build our model based on it) we could see that the AMD stock prices and VIX index values most closely follow the Bitcoin prize – the shape of the AMD graph follows the shape of Bitcoin price graph, while VIX graph mirrors it. Therefore, we decided to use these as exogenous regressors in an ARIMAX model.
We used the Johansson test to check for cointegration between these two additional variables and the Bitcoin price, but it appeared there is no cointegration present. Both series, ADM stock prices and VIX index values, are non-stationary. Since we have spread the daily values across 5-minute intervals, differentiating the series was not an option, because of the resulting 0 values. We decided to apply exponential smoothing. We tried 2 approaches, one of which was Holt-Winters algorithm. However, the resulting time series were still non-stationary.
We have decided to work with ARIMAX model where exogenous regressors are VIX index, AMD stock prices, NVIDIA stock prices, frequency of ‘cryptocurrency’ search in Google and frequency of ‘Bitcoin price’ search in Google. Ourr depedent variable is the change in Bitcoin prices. The model order is (1,0,2) (based on ACF and PACF graph and auto.arima). Then we perform a rolling sample algorithm to train the model. We managed to graph the predicted values over the real ones (graph below). We achieved 52.19% directional symmetry accuracy. We would further proceed with creating an algorithm which would decide when and whether to invest based on the change in direction of the prediction of our model.
CODE
#R code# import data dd1=read.csv("price_data.csv")
ddd <- subset(dd1, select = c(1, 2, 3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,20,25,34)) # create a subset dd <- ddd[c(1:15258,15260:15266),] #Plot data for Bitcoin #install.packages('tseries') library(tseries) plot(ddd$time,ddd$X1442) #Create full matrix -> deal with missing dates and data library(dplyr) ts <- seq.POSIXt(as.POSIXlt("2018-01-17 11:25:000"), as.POSIXlt("2018-03-23 14:00:000"), by="5 min") df <- data.frame(timestamp=ts) dd$timestamp = as.POSIXct(dd$time) data_with_missing_times <- full_join(df,dd,by='timestamp') #plot(data_with_missing_times$timestamp,data_with_missing_times$X1442, type='l') #Fill NAs with white-noise values library(fBasics) data_fill=data_with_missing_times[-1,] for(j in 3:ncol(data_fill)) { data_fill[,j]=diff(data_with_missing_times[,j], differences = 1) stats = basicStats(data_fill[,j]) stats <- t(stats) mean = stats[,7] stdev = stats[,14] wnoise= rnorm(data_fill[,j], mean=mean, sd=stdev) for (i in 1:nrow(data_fill)) { if (is.na(data_fill[i,j])) { data_fill[i,j] = wnoise[i] } } #plot(data_fill$timestamp,data_fill[,j], type='l') } #Python code to extract exogenous variables# Download data for exogenous varables
import pandas as pd from pandas.io.json import json_normalize import datetime from pytrends.request import TrendReq from sklearn.preprocessing import StandardScaler import matplotlib.pyplot as plt import seaborn as sns import pandas_datareader.data as web # Query stock data from Yahoo! Financial using pandas_datareader # Specify start and end time start = datetime.datetime(2018, 1, 17) end = datetime.datetime(2018, 3, 25) vix = web.DataReader('^VIX', 'yahoo', start, end) amd = web.DataReader('AMD', 'yahoo', start, end) nvda = web.DataReader('NVDA', 'yahoo', datetime.datetime(2018, 1, 17), datetime.datetime(2018, 3, 25)) # Cleaning and resampling data vix = vix.resample('5min').pad().drop(['Open', 'Low', 'Close', 'Adj Close', 'Volume'], axis='columns') amd = amd.resample('5min').pad().drop(['Open', 'Low', 'Close', 'Adj Close', 'Volume'], axis='columns') nvda = nvda.resample('5min').pad().drop(['Open', 'Low', 'Close', 'Adj Close', 'Volume'], axis='columns') amd.columns = ['amd'] nvda.columns = ['nvda'] vix.columns = ['vix'] import datetime from pytrends.request import TrendReq pytrend = TrendReq() from_date = '2018-01-17' end_date = '2018-03-24' # Getting data from Google Trends using Pytrends API # keyword = 'cryptocurrency', cathegory = 16 (news), timeframe- limit range to 8 months to get daily data pytrend.build_payload(kw_list=['cryptocurrency'], cat=16, timeframe=from_date+' '+end_date) # Build payload ggtrends_1 = pytrend.interest_over_time() ggtrends_1 = ggtrends_1.resample('5min').pad().drop(['isPartial'], axis='columns') # Upsample daily to hourly ggtrends_1.columns = ['gg_crypto'] # keyword = 'bitcoin price', cathegory = 0 (all), timeframe- limit range to 8 months to get daily data pytrend.build_payload(kw_list=['bitcoin price'], cat=0, timeframe=from_date+' '+end_date) ggtrends_2 = pytrend.interest_over_time() ggtrends_2 = ggtrends_2.resample('5min').pad().drop(['isPartial'], axis='columns') ggtrends_2.columns = ['gg_bitcoin_p'] # keyword = 'etherium price', cathegory = 0 (all), timeframe- limit range to 8 months to get daily data pytrend.build_payload(kw_list=['etherium price'], cat=0, timeframe=from_date+' '+end_date) ggtrends_3 = pytrend.interest_over_time() ggtrends_3 = ggtrends_3.resample('5min').pad().drop(['isPartial'], axis='columns') ggtrends_3.columns = ['gg_etherium_p'] # keyword = 'litecoin price', cathegory = 0 (all), timeframe- limit range to 8 months to get daily data pytrend.build_payload(kw_list=['litecoin price'], cat=0, timeframe=from_date+' '+end_date) ggtrends_4 = pytrend.interest_over_time() ggtrends_4 = ggtrends_4.resample('5min').pad().drop(['isPartial'], axis='columns') ggtrends_4.columns = ['gg_litecoin_p'] #R code#External regressors ccur=read.csv("cryptocurrency.csv") bsearch=read.csv("bitcoin.csv") esearch=read.csv("etherium.csv") lsearch=read.csv("litecoin.csv") nvda=read.csv("nvda.csv") amd=read.csv("amd.csv") vix=read.csv("vix.csv")
plot(data_with_missing_times$timestamp,data_with_missing_times$X1442, type='l') plot(ccur$date,ccur$gg_crypto,type='l') plot(bsearch$date,bsearch$gg_bitcoin_p) plot(nvda$Date,nvda$nvda) windows() layout(matrix(c(1:8),nrow=4,ncol=2)) plot(nvda$Date,nvda$nvda,type = 'l',subtitle='NVDA') plot(amd$Date,amd$amd,type = 'l', subtitle='AMD') plot(bsearch$date,bsearch$gg_bitcoin_p,type = 'l', subtitle='B Serach') plot(esearch$date,esearch$gg_etherium_p,type = 'l', subtitle='E Serach') plot(lsearch$date,lsearch$gg_litecoin_p,type = 'l', subtitle='L Serach') plot(ccur$date,ccur$gg_crypto,type = 'l', subtitle='CriptoCurr Search') plot(vix$Date,vix$vix,type = 'l', subtitle='VIX') plot(data_with_missing_times$X1442,type = 'l', subtitle='Bitcoin Price') # change names library(gdata) data_with_missing_times <- rename.vars(data_with_missing_times, from = "timestamp", to = "Date") # change name #Bitcoin price bprice=data_with_missing_times[,c(1,3)] #Fill NAs in initial data for (i in 1:nrow(bprice)) { if (is.na(bprice[i,2])) { bprice[i,2] = bprice[i-1,2] + data_fill[i,3] } } #typeof(bprice$Date) #typeof(amd$Date) amd$Date=as.POSIXct(amd$Date) bprice$Date=as.POSIXct(bprice$Date) vix$Date=as.POSIXct(vix$Date) #Check for cointegration coint=merge(bprice,amd,by="Date") coint_nodate=coint[,2:3] jotest=ca.jo(coint_nodate, type="trace", K=2, ecdet="none", spec="longrun") summary(jotest) coint=merge(bprice,vix,by="Date") coint_nodate=coint[,2:3] jotest=ca.jo(coint_nodate, type="trace", K=2, ecdet="none", spec="longrun") summary(jotest) #Plot windows() layout(matrix(c(1:3),nrow=3,ncol=2)) plot(bprice$Date,bprice$X1442,type = 'l',subtitle='Bprice') plot(vix$Date,vix$vix,type = 'l', subtitle='VIX') plot(amd$Date,amd$amd,type = 'l', subtitle='AMD') ## HoldWinters vix_h=HoltWinters(vix$vix, alpha = 0.15 , beta = 0.92, gamma = F) #plot(vix_h$fitted) level_vix=vix_h$fitted[,2] #plot(level_vix) amd_h=HoltWinters(amd$amd, alpha = 0.15 , beta = 0.92, gamma = F) #plot(amd_h$fitted) level_amd=amd_h$fitted[,2] #plot(level_amd) #adf.test(level_amd) #adf.test(level_vix) #layout(matrix(c(1:2),nrow=2,ncol=1)) #acf(data_fill$X1442) #pacf(data_fill$X1442) #Exponential smoothing es_vix= es(vix$vix) vixtest = es_vix$fitted[2:18752,] #plot(es_vix$fitted) #adf.test(es_vix$fitted) es_amd= es(amd$amd) amdtest = es_amd$fitted[2:18752,] #plot(es$fitted) #adf.test(es$fitted) #create xreg matrix xreg = matrix(ncol=5,nrow = 18751) xreg[,1]= vixtest xreg[,2]= amdtest xreg[,3]= nvdatest xreg[,4]= ccurtest xreg[,5]= bstest #Work on ARIMAX model #VIX fit11= arima(data_fill$X1442 , order = c(1,0,2), xreg=vixtest) predict(fit11, newxreg = 1) #AMD fit21= arima(data_fill$X1442 , order = c(1,0,2), xreg=amdtest) predict(fit21, newxreg = 1) #BSearch fit31= arima(data_fill$X1442 , order = c(1,0,2), xreg=bstest) predict(fit31, newxreg = 1) #ESearch fit41= arima(data_fill$X1442 , order = c(1,0,2), xreg=estest) predict(fit41, newxreg = 1) #NVDA fit51= arima(data_fill$X1442 , order = c(1,0,2), xreg=nvdatest) predict(fit51, newxreg = 1) #ccur fit61= arima(data_fill$X1442 , order = c(1,0,2), xreg=ccurtest) predict(fit61, newxreg = 1) #Auto arima afit= auto.arima(data_fill$X1442, xreg = xreg) #Predict ARIMA(1,0,2) newxreg1=matrix(,nrow=1,ncol=5) for (i in 1:5){ newxreg1[,i]=xreg[dim(xreg)[1],i] } predict(afit, newxreg = newxreg1, n.ahead = 1) # Rolling sample x=data_fill$X1442 mm=list() for (i in 1:(length(x)-2016+1)){ mm[[i]]=arima(x[i:(i+2015)], order=c(1,0,2)) } f=list() ff=rep(0,length(x)) for (i in 1:(length(x)-2016+1)){ f[[i]]=predict(mm[[i]],n.ahead = 1) ff[2016+i]=f[[i]]$pred[1] } windows() plot(x,col="blue",type="l",lwd=0.75) lines(ff,col="red",lwd=2) sqrt(mean((x[2017:length(x)]-ff[2017:length(x)])^2)) #Check for directional symetry ff_dir=0 for (i in 2017:length(ff)-1){ if(ff[i]*x[i]>0){ ff_dir = ff_dir + 1 } } #Directional symmetry accuracy percentage = 52.19% symacc=(ff_dir/(length(ff)-1))*100
4 thoughts on “Datathron”
I really like that the whole process of modelling is backed up by sound business logics. I find that your approach of incorporating exogenous features such as data on relevant stocks and Google trends adds a significant extent of originality in the adopted research methodology.
Data prep is conducted in compliance with the core theoretical requirements.
The implementation of rolling sample is correct.
What I would like to advise on the model is to consider statistical significance, especially for the estimates associated with the exogenous explanatory variables. Also, looking at the plot of actual vs predicted 1-step-ahead data, I might state that the model captures really tightly the series volatility for the first 12 000 observations. In order to tune better the model you might consider the reason behind the deteriorated performance aftermath. Once again, my suggestion is to inspect how statistical significance of delivered estimates changes over time.
Congrets on reporting the figure of directional symmetry!
Also, I really like the way your workflow is organized taking advantage of both R and Python utilizing the one that is best suited to the research task at hand.
Great job, guys!
You have started promisingly, but switched to plain source python afterwards, sorry but this is not human readable article.
It seems you did some work, but it is not readable – no color highlights on the code, no plots, etc. Please update your article, there is functionality on the website, so you can upload directly .ipynb files. Or even if you prefer you can upload it as html.
As a jury I am not able to give you any good score.
After you update it, the mentors will be able to give you feedback and recommend you some more approaches.
We would appreciate it if for the next Datathon this site also supports R notebooks (Rmd), it will be convenient for people using predominantly R, not only Python (like what we have done above).