Prediction systems

Datathon Telenor Solution – WILDLINGS ANALYSIS ON TELENOR – GAME OF THRONES

Cell phones have become a necessity for many people throughout the world. The ability to keep in touch with family, business associates, and access to email are only a few of the reasons for the increasing importance of cell phones. Today’s technically advanced cell phones are capable of not only receiving and placing phone calls, but storing data, taking pictures, and can even be used as walkie talkies, to name just a few of the available options.
Dataset, The Telenor Case – What do Game of Thrones and Telecoms Have in Common? contains the data of delays in networks (RAVENS). The delays of RAVENS are ranging from 26/07/2018 – 05/08/2018. Each RAVEN_NAME represents the Tower. There are 7847 unique RAVEN_NAMES for different networks like 2G/3G/4G. There are 5 unique families.
To provide optimum solution to business problems we are solving the problem in two steps (i) Data Analysis and coding in PYTHON and (ii) Time Series model building in R Studio.
In data analysis we have found the solutions for the problems and found the number of delays (failures) of RAVENS. We also found the Top_10 RAVENS with and without fails. We also detected the Family names and Member names with most and least fails in networks (failures).
The methods of prediction & forecasting of the problem is done by using Time Series model building. As the name suggests that it involves working on time (years, days, hours, minutes) based on data, to derive the hidden insights to make informed decision making. Time series models are very useful models when it is serially correlated data. Based on mobile data, to predict the four days we have divided the data into train and test .We have done Time series analysis by using Arima, Simple exponential analysis and Recurrent Neural networks (RNN).
Finally we conclude that by considering the Root mean square error for these algorithms, we got RNN (Recurrent Neural Networks) as the best algorithm to predict the future for days. Based on the RNN algorithm the prediction of delays for the next four days were analyzed. We have plotted the graphs based on the Time series model for all the algorithms.

0
votes

The Telenor Case – What do Game of Thrones and Telecoms Have in Common?
                          The data set which we have chosen is The Telenor Case . This dataset is of  failure of mobile data for one month. The dataset contains the information about RAVENS communication and by which network (2G/3G/4G) they got sharing of information. Also, who initiated the communication – family and member name (Example: Lannister,Tyrion). It also contains different types of failures means different types of delays.
1. BUSINESS UNDERSTANDING:
                          This initial phase focuses on understanding the dataset objectives and requirements from a business perspective, and then converting this knowledge into a data mining problem definition, and a preliminary plan designed to achieve the objectives. A decision model, especially Time Series was to  built on the dataset.

2. DATA UNDERSTANDING:

                           The data understanding starts with an initial data analyzing of data in order to get familiar with the content in it, to identify data quality problems, to discover first insights into the data, or to detect interesting subsets to form hidden information and also predicting the information which is required. 

The dataset was of nearly 4GB and contains 30091754 rows and 16 columns. The data given was very huge and it was very difficult to import in PYTHON and R, but at last we successfully imported the dataset  into PYTHON for Exploratory Data Analysis.
CODE FOR IMPORTING THE DATA INTO PYTHON:
telenor = pd.read_csv("data.csv",delimiter=';')
telenor
telenor.info() 

OUTPUT:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30091754 entries, 0 to 30091753
Data columns (total 16 columns):
DATETIME                         object
RAVEN_NAME                       object
FAMILY_NAME                      object
MEMBER_NAME                      object
NETWORK                          object
FIRST_GET_RESPONSE_SUCCESS_D     int64
PAGE_BROWSING_DELAY              int64
TCP_SETUP_TOTAL_DELAY            int64
PAGE_CONTENT_DOWNLOAD_TOTAL_D    int64
FIRST_DNS_RESPONSE_SUCCESS_D     int64
DNS_RESPONSE_SUCCESS_DELAY       int64
FIRST_TCP_RESPONSE_SUCCESS_D     int64
PAGE_SR_DELAYS                   int64
SYN_SYN_DELAY                    int64
TCP_CONNECT_DELAY                int64
PAGE_BROWSING_DELAYS             int64
dtypes: int64(11), object(5)
memory usage: 3.6+ GB

TO ANALYSE THE DIFFERENT COLUMNS IN WHICH THE DATASET CONTAINS:
telenor.columns
Out[4]:
Index(['DATETIME', 'RAVEN_NAME', 'FAMILY_NAME', 'MEMBER_NAME', 'NETWORK',
       'FIRST_GET_RESPONSE_SUCCESS_D', 'PAGE_BROWSING_DELAY',
       'TCP_SETUP_TOTAL_DELAY', 'PAGE_CONTENT_DOWNLOAD_TOTAL_D',
       'FIRST_DNS_RESPONSE_SUCCESS_D', 'DNS_RESPONSE_SUCCESS_DELAY',
       'FIRST_TCP_RESPONSE_SUCCESS_D', 'PAGE_SR_DELAYS', 'SYN_SYN_DELAY',
       'TCP_CONNECT_DELAY', 'PAGE_BROWSING_DELAYS'],
      dtype='object')
3. DATA ANALYSIS: 
                         From this we came to know that the RAVEN means the TOWER and RAVEN is the channel for making the voice calls and for the usage of data. In order to share the information from one person to other person or one point to other point we need some channel to transfer or network to share it. From this data set we found there are 7847 unique RAVEN NAMES. Different networks are 2G,3G,4G.
There are 5 unique family names are Targerian, Greyjoy, Stark, Lannister, Baelish and there are 28 unique members names. The DATETIME ranges from  26/07/2018 – 05/08/2018. 
With this data understanding we started to identify data quality problems, to discover first insights into the data.
4. EVALUATION:
PROBLEM –  (i) Top 10 ravens with fails :
SOLUTION :
top_10_raven_fails=telenor_most_delays[[‘RAVEN_NAME’,‘DATETIME’]].groupby(‘RAVEN_NAME’).count()
top_10_raven_fails = top_10_raven_fails.rename(columns = {'DATETIME' :'FAILURES'})
top_10_raven_fails.sort_values('FAILURES',ascending=[False]).head(10)
RAVEN_NAME FAILURES
Brass raven Birdy 218372
Brown raven Ruby 210211
Yellow raven Rio 209263
Blue raven Axel 190226
Razzle Dazzle Rose raven Cleo 186966
Cadmium Red raven Bubba 184785
Vain And Lazy raven Polly 177113
Fearful Carrion raven Gizmo 175197
Blast Off Bronze raven Zazu 169995
Loving raven Maxwell 169584
EXPLANATION : From this problem we found the top_10 RAVENS having  failures and not having failures and this problem has been solved by dividing the data into two categories  as telenor_most_delay and telenor_least_delay.
In the data set if any column containing ‘1’ we have considered it as delay is TRUE means there is a delay in the network and assigned this to telenor_most_delay. And all columns containing ‘0’ we have considered it as delay is FALSE means there is a no delay in the network and assigned this to telenor_least_delay.
                   In order to find the top_10 ravens with fails we have grouped the data by RAVEN_NAME and found the FAILURES for each RAVEN_NAME  and made count of it,  telenor_most_delays category and lastly we have sorted the values and found the top_10.
PROBLEM – (ii) Top 10 ravens with out fails :
SOLUTION :
top_10_raven_no_fails = telenor_least_delays[['RAVEN_NAME','DATETIME']].groupby('RAVEN_NAME').count()
top_10_raven_no_fails = top_10_raven_no_fails.rename(columns={'DATETIME':'FAILURES'})
top_10_raven_no_fails.sort_values('FAILURES',ascending=[False]).head(10)
RAVEN_NAME FAILURES
Metallic Sunburst raven Polly 297
Green Sheen raven Azul 211
Less Combative raven Zazu 191
Weak raven Buddy 188
Copper raven Tweety 179
Spectral Yellow raven Zazu 148
Mythical raven Tiki 116
Cyber Grape raven Faith 104
Mysterious And Venerable raven Bubba 98
Shadow Blue raven Sammy 95

 

EXPLANATION : In order to find the top_10 ravens without fails we have grouped the data by RAVEN_NAME and found the FAILURES of delay for each RAVEN_NAME from telenor_least_delays category and lastly we have sorted the values and found the top_10.
PROBLEM – (iii) The family with most fails :
SOLUTION:
top_10_family_most_fails = telenor_most_delays[['FAMILY_NAME','DATETIME']].groupby('FAMILY_NAME').count()
top_10_family_most_fails = top_10_family_most_fails.rename(columns={'DATETIME':'FAILURES'})
top_10_family_most_fails.sort_values('FAILURES',ascending=[False]).head(1)
FAMILY_NAME FAILURES
Targerian 8154000

EXPLANATION:  In order to find the top family with most fails we have grouped the data by FAMILY_NAME and found the FAILURES of delay for each FAMILY_NAME from telenor_most_delays category and lastly we have sorted the values and found the top family.

PROBLEM – (iv)The family with least fails:
SOLUTION:
top_10_family_least_fails = telenor_most_delays[[‘FAMILY_NAME’,‘DATETIME’]].groupby(‘FAMILY_NAME’).count()
top_10_family_least_fails = top_10_family_least_fails.rename(columns={'DATETIME':'FAILURES'})
top_10_family_least_fails.sort_values('FAILURES',ascending=[False]).head(1)
FAMILY_NAME FAILURES
Petyr Baelish 2744406
EXPLANATION : In order to find the top family with least fails we have grouped the data by FAMILY_NAME and found the FAILURES of delay for each FAMILY_NAME from telenor_most_delays category and lastly we have sorted the values and found the top family.
PROBLEM : (v) The family member with most fails :
SOLUTION:
top_10_members_most_fails = telenor_most_delays[['MEMBER_NAME','DATETIME']].groupby('MEMBER_NAME').count()
top_10_members_most_fails = top_10_members_most_fails.rename(columns={'DATETIME':'FAILURES'})
top_10_members_most_fails.sort_values('FAILURES',ascending=[False]).head(1)
MEMBER_NAME FAILURES
Petyr Baelish 2744406
EXPLANATION : In order to find the top family member with most fails we have grouped the data by MEMBER_NAME and found the FAILURES of delay for each MEMBER_NAME from telenor_most_delays category and lastly we have sorted the values and found the top_10. 
PROBLEM  – (vi) The family member with least fails :
SOLUTION:
top_10_members_least_fails = telenor_most_delays[['MEMBER_NAME','DATETIME']].groupby('MEMBER_NAME').count()
top_10_members_least_fails = top_10_members_least_fails.rename(columns={'DATETIME':'FAILURES'})
top_10_members_least_fails.sort_values('FAILURES',ascending=[False]).head(1)
MEMBER_NAME FAILURES
Euron 491454
EXPLANATION : In order to find the top family member with least fails we have grouped the data by MEMBER_NAME and found the FAILURES of delay for each MEMBER_NAME from telenor_most_delays category and lastly we have sorted the values and found the top family member.
5. MODELLING – TIME SERIES ANALYSIS:
                 We have taken the date and number of rows in data set and created one data set and we exported that data set and we are doing the time series analysis in R Mark Down file.We have done Time series analysis by using Arima, Simple exponential analysis and Recurrent Neural networks (RNN).
TIME SERIES ANALYSIS USING ARIMA:
Introduction to ARIMA:
ARIMA stands for Autoregressive Integrated Moving Average models. Univariate (single vector) ARIMA is a forecasting technique that projects the future values of a series based entirely on its own inertia. Its main application is in the area of short term forecasting requiring at least 40 historical data points. It works best when your data exhibits a stable or consistent pattern over time with a minimum amount of outliers. The first step in applying ARIMA methodology is to check for stationarity. “Stationarity” implies that the series remains at a fairly constant level over time. If a trend exists, as in most economic or business applications, then your data is NOT stationary. The data should also show a constant variance in its fluctuations over time. This is easily seen with a series that is heavily seasonal and growing at a faster rate
SOLUTION:
############################### FITTING THE AUTOMATED FORECASTING ARIMA MODEL ###########################
fit1 <- auto.arima(data_new$total_failures)
forecast(fit1, 4)
summary(fit1)
plot(forecast(fit1)
“`
TIME SERIES ANALYSIS USING SIMPLE EXPONENTIAL METHOD:
INTRODUCTION TO SIMPLE EXPONENTIAL METHOD:
Exponential forecasting is another smoothing method and has been around since the 1950s. Where niave forecasting places 100% weight on the most recent observation and moving averages place equal weight on k values, exponential smoothing allows for weighted averages where greater weight can be placed on recent observations and lesser weight on older observations. Exponential smoothing methods are intuitive, computationally efficient, and generally applicable to a wide range of time series.
SOLUTION:
################################### FITTING THE SIMPLE EXPONENTIAL MODEL ##################################
fit2 <- holt(data_new$total_failures)
accuracy(fit2)
“`
                              ME                  RMSE            MAE               MPE                   MAPE           MASE                  ACF1
Training set    503.3918        15844.26      12202.75       0.02569473         1.258304     0.8891327        0.3502734
PREDICTION OF FUTURE FOUR VALUES BY USING TIME SERIES MODEL:
################################  FITTING THE SIMPLE EXPONENTIAL MODEL ############################
forecast(fit2, 4)
plot(forecast(fit2, 4))
“`
TIME SERIES ANALYSIS USING RECURRENT NEURAL NETWORK:
INTRODUCTION TO RECURRENT NEURAL NETWORK:
A powerful type of neural network designed to handle sequence dependence is called RECURRENT NEURAL NETWORK. The Long Short-Term Memory network or LSTM network is a type of recurrent neural network used in deep learning because very large architectures can be successfully trained.
SOLUTION:
######################################### FITTING THE NEURAL NETWORK #####################################
fit5 <- nnetar(data_new$total_failures)
plot(forecast(fit5,h=4))
“`
6. DEPLOYMENT : PREDICTING THE BEST TIME SERIES MODEL :
##################################### ROOT MEAN SQUARE ERROR FOR ARIMA ##########################
a1 = 975227-974967
a2 = 970960-971685
a3 = 958250-971685
a4 = 946177-971685
arima = (a1**2+a2**2+a3**2+a4**2)/4
arima
“`
OUTPUT VALUE  : 207937629
“`{r}
################################# ROOT MEAN SQUARE ERROR FOR THE RNN ################################
r1 = 975227-972724
r2 = 970960-966955
r3 = 958250-965626
r4 = 946177-965540
rnn = (r1**2+r2**2+r3**2+r4**2)/4
rnn
“`
OUTPUT VALUE  : 112909045

“`{r}
######################## ROOT MEAN SQUARE ERROR FOR THE SIMPLE EXPONENTIAL ############################

e1 = 975227-973776
e2 = 970960-973725
e3 = 958250-973674
e4 = 946177-973623
exponential =(e1**2+e2**2+e3**2+e3**2)/4
exponential
“`

OUTPUT VALUE  : 121387545

In order to predict the next four future values we have fitted the three different models. We have derived dataset named as fail_data in which it contains DATE,TIME and FAILURES. For predicting the data we have derived the  predict_data from the TELENOR data and applied all the three models to predict the future four values. The predict_data consists of DATE and FAILURES. From these predict_data and fail_data we have calculated the RMSE values and also next four future values for the dates of 06/08/2018, 07/08/2018,08/08/2018,09/08/2018.

(i) The future four values of ARIMA model are  974967,971685,971685,971685.

(ii) The future four values of SIMPLE EXPONENTIAL  model are 973776,973725,973674,973623.

(iii) The future four values of RECURRENT NEURAL NETWORK model are 972724,966955,965626,965540.

After predicting the next four future values we have found the Root mean square error values for each model and also plotted the graphs for each model. The RMSE (Root mean square error) values for each model are as follows:

(i) The RMSE (Root mean square error) value for ARIMA model is 207937629.

(ii) The RMSE (Root mean square error) value for SIMPLE EXPONENTIAL  model is 121387545.

(iii) The RMSE (Root mean square error) value for RECURRENT NEURAL NETWORK  model is 112909045.

7. CONCLUSION: Finally we conclude that by considering the Root mean square error for these algorithms, we got RNN (Recurrent Neural Networks) as the best algorithm to predict the future for days. because it has the lowest Root mean square error when compared to all the other two models.  So, based on the RNN algorithm we have predicted the delays for the next four days based on given TELENORdataset.

FINAL SUBMISSION – Includes the PYTHON AND R MARK DOWN CODES FOR PROBLEM SOLVING , ANALYZING AND PREDICTING THE TIME SERIES MODELS. 

Share this

3 thoughts on “Datathon Telenor Solution – WILDLINGS ANALYSIS ON TELENOR – GAME OF THRONES

  1. 0
    votes

    Great work, guys!
    I like how you approached it using different methods. What approaches would you recommend to remove seasonality from time-series?

  2. 0
    votes

    Thanks for all your appreciations jury members. The approaches which we would recommend to remove seasonality from time-series is Seasonal ARIMA( Auto regressive Integrated Moving Average models . Seasonal difference is a crude form of additive seasonal adjustment: the “index” which is subtracted from each value of the time series is simply the value that was observed in the same season for one year. Seasonal Autoregressive Integrated Moving Average (SARIMA) models can satisfactorily describe time series that exhibit non-stationary behaviors both within and across seasons.

Leave a Reply