Cell phones have become a necessity for many people throughout the world. The ability to keep in touch with family, business associates, and access to email are only a few of the reasons for the increasing importance of cell phones. Today’s technically advanced cell phones are capable of not only receiving and placing phone calls, but storing data, taking pictures, and can even be used as walkie talkies, to name just a few of the available options.
Dataset, The Telenor Case – What do Game of Thrones and Telecoms Have in Common? contains the data of delays in networks (RAVENS). The delays of RAVENS are ranging from 26/07/2018 – 05/08/2018. Each RAVEN_NAME represents the Tower. There are 7847 unique RAVEN_NAMES for different networks like 2G/3G/4G. There are 5 unique families.
To provide optimum solution to business problems we are solving the problem in two steps (i) Data Analysis and coding in PYTHON and (ii) Time Series model building in R Studio.
In data analysis we have found the solutions for the problems and found the number of delays (failures) of RAVENS. We also found the Top_10 RAVENS with and without fails. We also detected the Family names and Member names with most and least fails in networks (failures).
The methods of prediction & forecasting of the problem is done by using Time Series model building. As the name suggests that it involves working on time (years, days, hours, minutes) based on data, to derive the hidden insights to make informed decision making. Time series models are very useful models when it is serially correlated data. Based on mobile data, to predict the four days we have divided the data into train and test .We have done Time series analysis by using Arima, Simple exponential analysis and Recurrent Neural networks (RNN).
Finally we conclude that by considering the Root mean square error for these algorithms, we got RNN (Recurrent Neural Networks) as the best algorithm to predict the future for days. Based on the RNN algorithm the prediction of delays for the next four days were analyzed. We have plotted the graphs based on the Time series model for all the algorithms.
The Telenor Case – What do Game of Thrones and Telecoms Have in Common?
The data set which we have chosen is The Telenor Case . This dataset is of failure of mobile data for one month. The dataset contains the information about RAVENS communication and by which network (2G/3G/4G) they got sharing of information. Also, who initiated the communication – family and member name (Example: Lannister,Tyrion). It also contains different types of failures means different types of delays.
1. BUSINESS UNDERSTANDING:
This initial phase focuses on understanding the dataset objectives and requirements from a business perspective, and then converting this knowledge into a data mining problem definition, and a preliminary plan designed to achieve the objectives. A decision model, especially Time Series was to built on the dataset.
2. DATA UNDERSTANDING:
The data understanding starts with an initial data analyzing of data in order to get familiar with the content in it, to identify data quality problems, to discover first insights into the data, or to detect interesting subsets to form hidden information and also predicting the information which is required.
The dataset was of nearly 4GB and contains 30091754 rows and 16 columns. The data given was very huge and it was very difficult to import in PYTHON and R, but at last we successfully imported the dataset into PYTHON for Exploratory Data Analysis.
CODE FOR IMPORTING THE DATA INTO PYTHON:
telenor = pd.read_csv("data.csv",delimiter=';')
telenor
telenor.info()
OUTPUT:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30091754 entries, 0 to 30091753
Data columns (total 16 columns):
DATETIME object
RAVEN_NAME object
FAMILY_NAME object
MEMBER_NAME object
NETWORK object
FIRST_GET_RESPONSE_SUCCESS_D int64
PAGE_BROWSING_DELAY int64
TCP_SETUP_TOTAL_DELAY int64
PAGE_CONTENT_DOWNLOAD_TOTAL_D int64
FIRST_DNS_RESPONSE_SUCCESS_D int64
DNS_RESPONSE_SUCCESS_DELAY int64
FIRST_TCP_RESPONSE_SUCCESS_D int64
PAGE_SR_DELAYS int64
SYN_SYN_DELAY int64
TCP_CONNECT_DELAY int64
PAGE_BROWSING_DELAYS int64
dtypes: int64(11), object(5)
memory usage: 3.6+ GB
TO ANALYSE THE DIFFERENT COLUMNS IN WHICH THE DATASET CONTAINS:
telenor.columns
Out[4]:
Index(['DATETIME', 'RAVEN_NAME', 'FAMILY_NAME', 'MEMBER_NAME', 'NETWORK',
'FIRST_GET_RESPONSE_SUCCESS_D', 'PAGE_BROWSING_DELAY',
'TCP_SETUP_TOTAL_DELAY', 'PAGE_CONTENT_DOWNLOAD_TOTAL_D',
'FIRST_DNS_RESPONSE_SUCCESS_D', 'DNS_RESPONSE_SUCCESS_DELAY',
'FIRST_TCP_RESPONSE_SUCCESS_D', 'PAGE_SR_DELAYS', 'SYN_SYN_DELAY',
'TCP_CONNECT_DELAY', 'PAGE_BROWSING_DELAYS'],
dtype='object')
3. DATA ANALYSIS:
From this we came to know that the RAVEN means the TOWER and RAVEN is the channel for making the voice calls and for the usage of data. In order to share the information from one person to other person or one point to other point we need some channel to transfer or network to share it. From this data set we found there are 7847 unique RAVEN NAMES. Different networks are 2G,3G,4G.
There are 5 unique family names are Targerian, Greyjoy, Stark, Lannister, Baelish and there are 28 unique members names. The DATETIME ranges from 26/07/2018 – 05/08/2018.
With this data understanding we started to identify data quality problems, to discover first insights into the data.
4. EVALUATION:
PROBLEM – (i) Top 10 ravens with fails :
SOLUTION :
top_10_raven_fails=telenor_most_delays[[‘RAVEN_NAME’,‘DATETIME’]].groupby(‘RAVEN_NAME’).count()
|
|
RAVEN_NAME |
FAILURES |
Brass raven Birdy |
218372 |
Brown raven Ruby |
210211 |
Yellow raven Rio |
209263 |
Blue raven Axel |
190226 |
Razzle Dazzle Rose raven Cleo |
186966 |
Cadmium Red raven Bubba |
184785 |
Vain And Lazy raven Polly |
177113 |
Fearful Carrion raven Gizmo |
175197 |
Blast Off Bronze raven Zazu |
169995 |
Loving raven Maxwell |
169584 |
EXPLANATION : From this problem we found the top_10 RAVENS having failures and not having failures and this problem has been solved by dividing the data into two categories as telenor_most_delay and telenor_least_delay.
In the data set if any column containing ‘1’ we have considered it as delay is TRUE means there is a delay in the network and assigned this to telenor_most_delay. And all columns containing ‘0’ we have considered it as delay is FALSE means there is a no delay in the network and assigned this to telenor_least_delay.
In order to find the top_10 ravens with fails we have grouped the data by RAVEN_NAME and found the FAILURES for each RAVEN_NAME and made count of it, telenor_most_delays category and lastly we have sorted the values and found the top_10.
PROBLEM – (ii) Top 10 ravens with out fails :
SOLUTION :
|
|
RAVEN_NAME |
FAILURES |
Metallic Sunburst raven Polly |
297 |
Green Sheen raven Azul |
211 |
Less Combative raven Zazu |
191 |
Weak raven Buddy |
188 |
Copper raven Tweety |
179 |
Spectral Yellow raven Zazu |
148 |
Mythical raven Tiki |
116 |
Cyber Grape raven Faith |
104 |
Mysterious And Venerable raven Bubba |
98 |
Shadow Blue raven Sammy |
95 |
EXPLANATION : In order to find the top_10 ravens without fails we have grouped the data by RAVEN_NAME and found the FAILURES of delay for each RAVEN_NAME from telenor_least_delays category and lastly we have sorted the values and found the top_10.
PROBLEM – (iii) The family with most fails :
SOLUTION:
|
|
FAMILY_NAME |
FAILURES |
Targerian |
8154000 |
EXPLANATION: In order to find the top family with most fails we have grouped the data by FAMILY_NAME and found the FAILURES of delay for each FAMILY_NAME from telenor_most_delays category and lastly we have sorted the values and found the top family.
PROBLEM – (iv)The family with least fails:
SOLUTION:
top_10_family_least_fails = telenor_most_delays[[‘FAMILY_NAME’,‘DATETIME’]].groupby(‘FAMILY_NAME’).count()
|
|
FAMILY_NAME |
FAILURES |
Petyr Baelish |
2744406 |
EXPLANATION : In order to find the top family with least fails we have grouped the data by FAMILY_NAME and found the FAILURES of delay for each FAMILY_NAME from telenor_most_delays category and lastly we have sorted the values and found the top family.
PROBLEM : (v) The family member with most fails :
|
|
MEMBER_NAME |
FAILURES |
Petyr Baelish |
2744406 |
EXPLANATION : In order to find the top family member with most fails we have grouped the data by MEMBER_NAME and found the FAILURES of delay for each MEMBER_NAME from telenor_most_delays category and lastly we have sorted the values and found the top_10.
PROBLEM – (vi) The family member with least fails :
SOLUTION:
|
|
MEMBER_NAME |
FAILURES |
Euron |
491454 |
EXPLANATION : In order to find the top family member with least fails we have grouped the data by MEMBER_NAME and found the FAILURES of delay for each MEMBER_NAME from telenor_most_delays category and lastly we have sorted the values and found the top family member.
5. MODELLING – TIME SERIES ANALYSIS:
We have taken the date and number of rows in data set and created one data set and we exported that data set and we are doing the time series analysis in R Mark Down file.We have done Time series analysis by using Arima, Simple exponential analysis and Recurrent Neural networks (RNN).
TIME SERIES ANALYSIS USING ARIMA:
Introduction to ARIMA:
ARIMA stands for Autoregressive Integrated Moving Average models. Univariate (single vector) ARIMA is a forecasting technique that projects the future values of a series based entirely on its own inertia. Its main application is in the area of short term forecasting requiring at least 40 historical data points. It works best when your data exhibits a stable or consistent pattern over time with a minimum amount of outliers. The first step in applying ARIMA methodology is to check for stationarity. “Stationarity” implies that the series remains at a fairly constant level over time. If a trend exists, as in most economic or business applications, then your data is NOT stationary. The data should also show a constant variance in its fluctuations over time. This is easily seen with a series that is heavily seasonal and growing at a faster rate
SOLUTION:
############################### FITTING THE AUTOMATED FORECASTING ARIMA MODEL ###########################
fit1 <- auto.arima(data_new$total_failures)
forecast(fit1, 4)
summary(fit1)
plot(forecast(fit1)
“`
TIME SERIES ANALYSIS USING SIMPLE EXPONENTIAL METHOD:
INTRODUCTION TO SIMPLE EXPONENTIAL METHOD:
Exponential forecasting is another smoothing method and has been around since the 1950s. Where niave forecasting places 100% weight on the most recent observation and moving averages place equal weight on k values, exponential smoothing allows for weighted averages where greater weight can be placed on recent observations and lesser weight on older observations. Exponential smoothing methods are intuitive, computationally efficient, and generally applicable to a wide range of time series.
SOLUTION:
################################### FITTING THE SIMPLE EXPONENTIAL MODEL ##################################
fit2 <- holt(data_new$total_failures)
accuracy(fit2)
“`
ME RMSE MAE MPE MAPE MASE ACF1
Training set 503.3918 15844.26 12202.75 0.02569473 1.258304 0.8891327 0.3502734
PREDICTION OF FUTURE FOUR VALUES BY USING TIME SERIES MODEL:
################################ FITTING THE SIMPLE EXPONENTIAL MODEL ############################
forecast(fit2, 4)
plot(forecast(fit2, 4))
“`
TIME SERIES ANALYSIS USING RECURRENT NEURAL NETWORK:
INTRODUCTION TO RECURRENT NEURAL NETWORK:
A powerful type of neural network designed to handle sequence dependence is called RECURRENT NEURAL NETWORK. The Long Short-Term Memory network or LSTM network is a type of recurrent neural network used in deep learning because very large architectures can be successfully trained.
SOLUTION:
######################################### FITTING THE NEURAL NETWORK #####################################
fit5 <- nnetar(data_new$total_failures)
plot(forecast(fit5,h=4))
“`
6. DEPLOYMENT : PREDICTING THE BEST TIME SERIES MODEL :
##################################### ROOT MEAN SQUARE ERROR FOR ARIMA ##########################
a1 = 975227-974967
a2 = 970960-971685
a3 = 958250-971685
a4 = 946177-971685
arima = (a1**2+a2**2+a3**2+a4**2)/4
arima
“`
OUTPUT VALUE : 207937629
“`{r}
################################# ROOT MEAN SQUARE ERROR FOR THE RNN ################################
r1 = 975227-972724
r2 = 970960-966955
r3 = 958250-965626
r4 = 946177-965540
rnn = (r1**2+r2**2+r3**2+r4**2)/4
rnn
“`
OUTPUT VALUE : 112909045
“`{r}
######################## ROOT MEAN SQUARE ERROR FOR THE SIMPLE EXPONENTIAL ############################
e1 = 975227-973776
e2 = 970960-973725
e3 = 958250-973674
e4 = 946177-973623
exponential =(e1**2+e2**2+e3**2+e3**2)/4
exponential
“`
OUTPUT VALUE : 121387545
In order to predict the next four future values we have fitted the three different models. We have derived dataset named as fail_data in which it contains DATE,TIME and FAILURES. For predicting the data we have derived the predict_data from the TELENOR data and applied all the three models to predict the future four values. The predict_data consists of DATE and FAILURES. From these predict_data and fail_data we have calculated the RMSE values and also next four future values for the dates of 06/08/2018, 07/08/2018,08/08/2018,09/08/2018.
(i) The future four values of ARIMA model are 974967,971685,971685,971685.
(ii) The future four values of SIMPLE EXPONENTIAL model are 973776,973725,973674,973623.
(iii) The future four values of RECURRENT NEURAL NETWORK model are 972724,966955,965626,965540.
After predicting the next four future values we have found the Root mean square error values for each model and also plotted the graphs for each model. The RMSE (Root mean square error) values for each model are as follows:
(i) The RMSE (Root mean square error) value for ARIMA model is 207937629.
(ii) The RMSE (Root mean square error) value for SIMPLE EXPONENTIAL model is 121387545.
(iii) The RMSE (Root mean square error) value for RECURRENT NEURAL NETWORK model is 112909045.
7. CONCLUSION: Finally we conclude that by considering the Root mean square error for these algorithms, we got RNN (Recurrent Neural Networks) as the best algorithm to predict the future for days. because it has the lowest Root mean square error when compared to all the other two models. So, based on the RNN algorithm we have predicted the delays for the next four days based on given TELENORdataset.
FINAL SUBMISSION – Includes the PYTHON AND R MARK DOWN CODES FOR PROBLEM SOLVING , ANALYZING AND PREDICTING THE TIME SERIES MODELS.
3 thoughts on “Datathon Telenor Solution – WILDLINGS ANALYSIS ON TELENOR – GAME OF THRONES”
Great work, guys!
I like how you approached it using different methods. What approaches would you recommend to remove seasonality from time-series?
Thanks for all your appreciations jury members. The approaches which we would recommend to remove seasonality from time-series is Seasonal ARIMA( Auto regressive Integrated Moving Average models . Seasonal difference is a crude form of additive seasonal adjustment: the “index” which is subtracted from each value of the time series is simply the value that was observed in the same season for one year. Seasonal Autoregressive Integrated Moving Average (SARIMA) models can satisfactorily describe time series that exhibit non-stationary behaviors both within and across seasons.
Finally we conclude that we RNN is the best time series model for this dataset because we got the accurate result and it can be used for deep learning analysis.