Team Members:-
- Ranajay Kr. Senapati([email protected])
- Vasuraj Bhatia([email protected])
- Marvel Jacob([email protected] )
- Rameez Raza([email protected])
- Vishal Deshwal([email protected])
1. BUSINESS UNDERSTANDING:
- Business Objective:- Mobile communication systems has revolutionized the way people communicate with each other . Evolution of technologies starting from first generation (1G) to fourth generation (4G) has changed the performance, quality of both voice calls with high speed data. Dependency of humans on telecommunication has enhanced exponentially. With so much higher dependency of telecommunication, telecom companies are swearing on cutting edge technology to enhance customer satisfaction. Here we have been given mobile data fails dataset which we analyse failures in ravens (which are basically network sites here) and predict the number of failures of ravens in future days so as to rectify them thus, making the communication system better.
- Assess Situation:- Applying prediction model and by doing time series analysis on the mobile data fails we can predict beforehand which network site is going to fail. As a result of this telecom companies get to know the to-be defect network site and send engineers to rectify the degrading site. This saves huge resource in terms of time and money.
- Determine Data Mining Goals:- By analyzing real time data from each and every network sites we can identify and pinpoint network failures and performance of each and every site. It allows staff to address issues even before the site starts to degrade.
- Produce Project Plans:- The project is implied by us using the following tools and techniques:-
- R programming
- Machine Learning Algorithm-ARIMA
- R-Studio
- MS-Excel
2.DATA UNDERSTANDING:-
a) Collect Initial Data:-The data was provided to us by the Data Science Society which consist of Telenor mobile data fails dataset with timestamps of 15 minutes for a range of 31 days.
b) Describe Data:- Based on one month data with flight fails, we have to make time-series analysis and predict the future amount of fails and also to find out how many of the ravens sent are not going to make it. As we all know that communication is an essential part of the human existence and lives are solely dependent on regular communication, it is necessary to find out the failures, so as to improve them and make the communication system better and to predict further flaws in the system and correct them. As we dig deep into this data, we will find valuable insights that would help us to improve the failure rate and downgrade it, so as to improve the success rate and provide a better working system to work on in the near future.
c) Explore Data:- The dataset contains 16 columns and 3,00,91,754 rows.
CODE:
> dim(data1)
OUTPUT:
[1] 30091754 16
Assumptions: There are 11 columns that capture he type of delays. If there are zero values in each and every column representing the type of delays associated with a raven, we take that as a success, otherwise we take it as a failure.
CODE:
>Mobile_Data_Failure = data1[FIRST_GET_RESPONSE_SUCCESS_D != 0 | PAGE_BROWSING_DELAY != 0 | TCP_SETUP_TOTAL_DELAY != 0 | PAGE_CONTENT_DOWNLOAD_TOTAL_D != 0|FIRST_DNS_RESPONSE_SUCCESS_D!=0|DNS_RESPONSE_SUCCESS_DELAY!=0| FIRST_TCP_RESPONSE_SUCCESS_D!=0|PAGE_SR_DELAYS!=0|SYN_SYN_DELAY!=0|TCP_CONNECT_DELAY!=0| PAGE_BROWSING_DELAYS!=0]
>dim(Mobile_Data_Failure)
OUTPUT:
[1] 30080943 16
SIGNIFICANCE:
There are total 16 columns and 30080943 ravens we consider facing failure.
CODE:
>Mobile_Data_Without_Failure = data1[FIRST_GET_RESPONSE_SUCCESS_D == 0 & PAGE_BROWSING_DELAY == 0 & TCP_SETUP_TOTAL_DELAY == 0 & PAGE_CONTENT_DOWNLOAD_TOTAL_D == 0&FIRST_DNS_RESPONSE_SUCCESS_D==0&DNS_RESPONSE_SUCCESS_DELAY==0& FIRST_TCP_RESPONSE_SUCCESS_D==0&PAGE_SR_DELAYS==0&SYN_SYN_DELAY==0&TCP_CONNECT_DELAY==0& PAGE_BROWSING_DELAYS==0]
> dim(Mobile_Data_Without_Failure)
OUTPUT:
10811 16
SIGNIFICANCE:
There are total 16 columns and 10811 ravens we consider facing not a single failure.
d) Verify Data Quality:- The data does not contain any null values. Also the data is clean and neatly organized.
3.Data Preparation:-
a) Data selection: The dataset we are going to analyse is the Telenor dataset to mine out results based on one month data with flight fails, we have to make time-series analysis and predict the future amount of fails and also to find out how many of the ravens sent are not going to make it. Communication being an essential part of the human existence and lives are solely dependent on regular communication, it is necessary to find out the failures, so as to improve them and make the communication system better and to predict further flaws in the system and correct them. As we dig deep into this data, we will find valuable insights that would help us to improve the failure rate and downgrade it, so as to improve the success rate and provide a better working system to work on in the near future. The file chosen is a 4GB csv format file having 3,00,91,754 rows and 16 columns.
b) Clean data: The data before analysis was already clean to carry out different analytical methodologies.
c) Construct data: The data provided to us was already clean, so we started working on it.
d) Integrate data: The data was already well integrated so we focused on grouping up the specific columns for gaining the adequate results for our EDA part as well as the modelling part.
e) Format data: the format of the data was well versed and grouped for further exploratory analysis for both the parts.
4.Modelling:-
a) Select Modelling Technique: The Machine Learning model used for the modelling was ARIMA(Auto-regressive Integrated Moving Average Models) because our data needed a uni-variate modelling algorithm
b) Generate Test Design: the test data was generated by grouping up the two columns of the data provided which were the date and the transmission fails recorded.
c) Build Model Parameter Setting: Model:-ARIMA popular and widely used statistical method for time series forecasting is the ARIMA model. ARIMA is an acronym that stands for Auto-Regressive Integrated Moving Average. It is a class of model that captures a suite of different standard temporal structures in time series data.This acronym is descriptive, capturing the key aspects of the model itself. Briefly, they are: AR: Auto-regression. A model that uses the dependent relationship between an observation and some number of lagged observations.I: Integrated. The use of differencing of raw observations (e.g. subtracting an observation from an observation at the previous time step) in order to make the time series stationary.MA: Moving Average. A model that uses the dependency between an observation and a residual error from a moving average model applied to lagged observations.Each of these components are explicitly specified in the model as a parameter. A standard notation is used of ARIMA(p,d,q) where the parameters are substituted with integer values to quickly indicate the specific ARIMA model being used. The parameters of the ARIMA model are defined as follows: p: The number of lag observations included in the model, also called the lag order.d: The number of times that the raw observations are differentiated, also called the degree of differencing.q: The size of the moving average window, also called the order of moving average.
d)Assess Modelling: The modelling is assessed using the test data which is mentioned above and parameters were attained.
5.Evaluation:-
After the model was run the results were evaluated
Firstly, to asses the data and gain insights from it the machine learning algorithm- ARIMA is used by us, We were facing some difficulties finding the time series day-wise so we assessed the model monthly and the output is therefore:-
Mobile_Data_Failure = data1[FIRST_GET_RESPONSE_SUCCESS_D != 0 | PAGE_BROWSING_DELAY != 0 | TCP_SETUP_TOTAL_DELAY != 0 |
PAGE_CONTENT_DOWNLOAD_TOTAL_D != 0|FIRST_DNS_RESPONSE_SUCCESS_D!=0|DNS_RESPONSE_SUCCESS_DELAY!=0|
FIRST_TCP_RESPONSE_SUCCESS_D!=0|PAGE_SR_DELAYS!=0|SYN_SYN_DELAY!=0|TCP_CONNECT_DELAY!=0|
PAGE_BROWSING_DELAYS!=0]
View(Mobile_Data_Failure)
dim(Mobile_Data_Failure)
Top_10_Raven_Fails = Mobile_Data_Failure %>% group_by(RAVEN_NAME) %>% summarise(Count=n()) %>% arrange(-Count) %>% head(10)
View(Top_10_Raven_Fails)
Result-
Question 2)Top 10 ravens without fails:-
Mobile_Data_WithoutFailure = data1[FIRST_GET_RESPONSE_SUCCESS_D == 0 & PAGE_BROWSING_DELAY == 0 & TCP_SETUP_TOTAL_DELAY == 0 &
PAGE_CONTENT_DOWNLOAD_TOTAL_D == 0&FIRST_DNS_RESPONSE_SUCCESS_D==0&DNS_RESPONSE_SUCCESS_DELAY==0&
FIRST_TCP_RESPONSE_SUCCESS_D==0&PAGE_SR_DELAYS==0&SYN_SYN_DELAY==0&TCP_CONNECT_DELAY==0&
PAGE_BROWSING_DELAYS==0]
View(Mobile_Data_WithoutFailure)
Top_10_Raven_WithoutFails = Mobile_Data_WithoutFailure %>% group_by(RAVEN_NAME) %>% summarise(Count=n()) %>% arrange(-Count) %>% head(10)
View(Top_10_Raven_WithoutFails)
Result:-
Question 3) The family with most fails:-
Family_Most_Fails = Mobile_Data_Failure %>% group_by(FAMILY_NAME) %>% summarise(Count=n()) %>% arrange(-Count) %>% head(1)
View(Family_Most_Fails)
Result:-
Question 4) The family with least fails:-
Family_Least_Fails = Mobile_Data_WithoutFailure %>% group_by(FAMILY_NAME) %>% summarise(Count=n()) %>% arrange(-Count) %>% head(1)
View(Family_Least_Fails)
Result:-
Question 5) The family member with most fails:-
Family_Member_Most_Fails = Mobile_Data_Failure %>% group_by(MEMBER_NAME) %>% summarise(Count=n()) %>% arrange(-Count) %>% head(1)
View(Family_Member_Most_Fails)
Result:-
Question 6)The family member with least fails:-
Family_Member_Least_Fails = Mobile_Data_WithoutFailure %>% group_by(MEMBER_NAME) %>% summarise(Count=n()) %>% arrange(-Count) %>% head(1)
View(Family_Member_Least_Fails)
Result:-
6.Deployment:-
Plan Deployment:The ARIMA procedure provides the identification, parameter estimation, and forecast-ing of auto-regressive integrated moving average (Box-Jenkins) models, seasonal ARIMA models, transfer function models, and intervention models. The ARIMA procedure offers complete ARIMA (Box-Jenkins) modeling with no limits on the order of auto-regressive or moving average processes. Estimation can be done by exact maximum likelihood, conditional least squares, or unconditional least squares. In addition you can model intervention models, regression models with ARIMA errors, transfer function models with fully general rational transfer functions, and seasonal ARIMA models. PROC ARIMA’s model identification diagnostics include plots of auto-correlation, partial auto-correlation, inverse auto-correlation, and cross-correlation functions. PROC ARIMA also allows tentative auto-regressive moving average (ARIMA) order identification based on smallest canonical correlation, extended sample auto-correlation function, or information criterion analysis. ARIMA model-based interpolation of missing values is permitted. Forecasting is tied to parameter estimation methods. Finite memory forecasts are used for models estimated by maximum likelihood or exact nonlinear least squares, while infinite memory forecasts are used for models estimated by conditional least squares. The ARIMA procedure offers a variety of model diagnostic statistics, including Akaike’s information criterion (AIC) Schwarz’s Bayesian criterion (SBC or BIC) Ljung-Box chi-square test statistics for white noise residuals stationarity tests, including Augmented Dickey-Fuller (including seasonal unit root testing), Phillips-Perron, and random-walk with drift tests The DFTEST macro performs Dickey-Fuller tests for simple unit roots or seasonal unit roots in a time series. The DFTEST macro is useful to test for stationarity and determine the order of differencing needed for the ARIMA modeling of a time series
2 thoughts on “Datathon Telenor Solution – Winner Winner-Data Dinner- The Telenor Case”
Great work, guys!
How would you tackle seasonality in your data? I saw you were using seasonal ARIMA – any other approaches you would recommend to use?
Thanks for your appreciation. We used the concept of deseasonalization or seasonal adjustment. There are ARCH and GARCH style models which we wanted to use but found this algorithm as an easy and proper approach