Analysis Of Mobile Data Connectivity Delays
Names – Aashi Agarwal, Ananya Jena , Devesh Tripathi , Prasad shripathi
The Telenor Case – What do Game of Thrones and Telecoms Have in Common?
- Team Toolset : R studio , Ms-Excel
DATA SCIENCE LIFECYCLE :
Business Understanding ::
Sending ravens is one of the most fundamental parameters in mobile communications engineering. For land-based mobile communications, the received raven variation is primarily the result of multipath fading caused by obstacles such as buildings (or clutter) or terrain irregularities; the distance between link end points; predatory animals, and interference among multiple transmissions, for example wars. This inevitable raven variation is the cause of communication dropping, one of the most significant quality of service measure in operative communication. For this reason, various techniques and schemes are employed in the planning, design and optimization of raven networks to combat these propagation effects. This normally covers the network physical configuration which include all aspects of network infrastructure deployment such as locations of base nests; additional food; sometimes guards, etc. A typical example of these schemes and techniques is the use of models for flight prediction based on measured data.
Based on one month data with flight fails, the we have to make time-series analysis and predict the future amount of fails.
To analyse the data Set , we are required to perform these tasks through Exploratory Data Analysis :
-Top 10 ravens with fails
-Top 10 ravens without fails
-The family with most fails
-The family with least fails
-The family member with most fails
-The family member with least fails
Data Understanding ::
We are working on the mobile data fails dataset.
As we analyse the data,
- Every row has fifteen minute basis communication
- It contains information about which raven made the communication and by which network (2G/3G/4G).
- Also, who initiated the communication – family and member name
- Different types of fails – sum of delays.
–>The given dataset contains 22401305 rows and 16 columns in CSV file format , the following columns represent :
- DateTime : At what date and time , the raven made the communication.
- Raven_name : Name of the Raven which made the communication
- Family_name : It refers to the person’s Family Name who has initiated the communication.
- Member_name : It refers to the person’s name who has initiated the communication .
- Network : It contains information about the raven that made the communication through which network (2G/3G/4G).
THE REST ALL COLUMNS (i.e from column 6 to column 16) ARE REPRESENTING DIFFERENT TYPES OF DELAYS. THERE ARE 11 SUCH COLUMNS THAT ARE
- FIRST_GET_RESPONSE_SUCCESS_D
- PAGE_BROWSING_DELAY
- TCP_SETUP_TOTAL_DELAY
- PAGE_CONTENT_DOWNLOAD_TOTAL_D
- FIRST_DNS_RESPONSE_SUCCESS_D
- DNS_RESPONSE_SUCCESS_DELAY
- FIRST_TCP_RESPONSE_SUCCESS_D
- PAGE_SR_DELAYS14. SYN_SYN_DELAY
- TCP_CONNECT_DELAY
- PAGE_BROWSING_DELAYS
In above columns, ‘0’ signifies success of the message being passed without any delay .
Reading the Data and Preprocessing:
#rcode :: data= fread(input = “C:/Users/Administrator/Desktop/data.csv”)
Task 1: Top 10 ravens with fails
#rcode ::
total_delay_with_fails = data%>%filter(FIRST_GET_RESPONSE_SUCCESS_D !=0 | PAGE_BROWSING_DELAY !=0 |
TCP_SETUP_TOTAL_DELAY !=0 | PAGE_CONTENT_DOWNLOAD_TOTAL_D !=0 |
FIRST_DNS_RESPONSE_SUCCESS_D !=0 | DNS_RESPONSE_SUCCESS_DELAY !=0 |
FIRST_TCP_RESPONSE_SUCCESS_D !=0 | PAGE_SR_DELAYS !=0 |
SYN_SYN_DELAY !=0 | TCP_CONNECT_DELAY !=0 | PAGE_BROWSING_DELAYS !=0)%>%
group_by(RAVEN_NAME)%>%
summarise(count = n())%>%
arrange(-count)%>%
head(10)
View(total_delay_with_fails)
Task 2 : Top 10 ravens without fails
#rcode ::
total_delay_without_fails = data%>%filter(FIRST_GET_RESPONSE_SUCCESS_D ==0 & PAGE_BROWSING_DELAY ==0 &
TCP_SETUP_TOTAL_DELAY ==0 & PAGE_CONTENT_DOWNLOAD_TOTAL_D ==0 &
FIRST_DNS_RESPONSE_SUCCESS_D ==0 & DNS_RESPONSE_SUCCESS_DELAY ==0 &
FIRST_TCP_RESPONSE_SUCCESS_D ==0 & PAGE_SR_DELAYS ==0 &
SYN_SYN_DELAY ==0 & TCP_CONNECT_DELAY ==0 & PAGE_BROWSING_DELAYS ==0)%>%
group_by(RAVEN_NAME)%>%
summarise(count = n())%>%
arrange(-count)
View(total_delay_without_fails)
top_10_without_fails = total_delay_without_fails%>%head(10)
View(top_10_without_fails)
Task 3 : The family with most fails
#rcode ::
family_with_fails = data%>%filter(FIRST_GET_RESPONSE_SUCCESS_D !=0 & PAGE_BROWSING_DELAY !=0 &
TCP_SETUP_TOTAL_DELAY !=0 & PAGE_CONTENT_DOWNLOAD_TOTAL_D !=0 &
FIRST_DNS_RESPONSE_SUCCESS_D !=0 & DNS_RESPONSE_SUCCESS_DELAY !=0 &
FIRST_TCP_RESPONSE_SUCCESS_D !=0 & PAGE_SR_DELAYS !=0 &
SYN_SYN_DELAY !=0 & TCP_CONNECT_DELAY !=0 & PAGE_BROWSING_DELAYS !=0)%>%
group_by(FAMILY_NAME)%>%
summarise(count = n())%>%
arrange(-count)
View(family_with_fails)
family_with_most_fails = family_with_fails%>%head(1)
View(family_with_most_fails)
Task 4 :: The family with least fails
#rcode ::
family_with_least_fails = family_with_fails%>%tail(1)
View(family_with_least_fails)
Task 5 :: The family member with most fails
#rcode ::
family_member_with_fails = data%>%filter(FIRST_GET_RESPONSE_SUCCESS_D !=0 & PAGE_BROWSING_DELAY !=0 &
TCP_SETUP_TOTAL_DELAY !=0 & PAGE_CONTENT_DOWNLOAD_TOTAL_D !=0 &
FIRST_DNS_RESPONSE_SUCCESS_D !=0 & DNS_RESPONSE_SUCCESS_DELAY !=0 &
FIRST_TCP_RESPONSE_SUCCESS_D !=0 & PAGE_SR_DELAYS !=0 &
SYN_SYN_DELAY !=0 & TCP_CONNECT_DELAY !=0 & PAGE_BROWSING_DELAYS !=0)%>%
group_by(MEMBER_NAME)%>%
summarise(count = n())%>%
arrange(-count)
View(family_member_with_fails)
family_member_with_most_fails = family_member_with_fails%>%head(1)
View(family_member_with_most_fails)
Task 6 ::The family member with least fails
#rcode :
family_member_with_least_fails = family_member_with_fails%>%tail(1)
View(family_member_with_least_fails)
EDA – grouping and plotting :
For task1 :Top 10 ravens with fails
plot for task1 :
ggplot(top_10_delay_with_fails, aes(x=RAVEN_NAME, y=count, fill=count))+
scale_fill_gradient(low=’deeppink’, high = ‘deeppink4′)+
geom_segment(data = top_10_delay_with_fails, aes(x=RAVEN_NAME, xend=RAVEN_NAME, y=0, yend=count), linetype=’solid’, color=’orchid4′, size=5) +
geom_label(aes(label=count), color=’white’, size=4, vjust=-0.9, label.size = 0, fontface=’bold’ )+
coord_flip()+ theme_gray()+
theme(axis.text.y = element_text(face=’bold’))
Conclusion : These are the top 10 ravens that were failed(delayed). Brass Raven Birdy has been failed(or delayed) for the most number of times i.e. 162617 , followed by Brown raven ruby i.e. 156739 , then Yellow raven Rio i.e. 155702, Blue raven Axel : 141519 , Razzle Dazzle Rose raven Cleo : 139054 , Cadmium Red raven Bubba : 137518 , Vain And Lazy raven Polly :131633 , Fearful Carrion raven Gizmo :130680 , Blast Off Bronze raven Zazu : 126439 , Lilac Luster raven Pegasus : 126317 .
For task 2 : Top 10 ravens without fail
plot for task2 :
ggplot(top_10_without_fails, aes(x=RAVEN_NAME, y=count, fill=count))+
scale_fill_gradient(low=’deeppink’, high = ‘deeppink4′)+
geom_segment(data = top_10_without_fails, aes(x=RAVEN_NAME, xend=RAVEN_NAME, y=0, yend=count), linetype=’solid’, color=’orchid4′, size=5) +
geom_label(aes(label=count), color=’white’, size=4, vjust=-0.9, label.size = 0, fontface=’bold’ )+
coord_flip()+ theme_gray()+
theme(axis.text.y = element_text(face=’bold’))
Conclusion : These are the top 10 ravens that were not failed(delayed) . Metallic Sunburst raven Polly : 225 , Green Sheen raven Azul : 161 , Less Combative raven Zazu : 141 ,Weak raven Buddy : 136 , Copper raven Tweety :130 , Spectral Yellow raven Zazu : 118 , Mythical raven Tiki : 90 , Mysterious And Venerable raven Bubba : 78, Cyber Grape raven Faith : 77 , Shadow Blue raven Sammy : 77
For task 3 & 4:The family with most fails and with least fails
plot for task3&4 :
ggplot(family_with_fails, aes(x= reorder(FAMILY_NAME,-count), y = count))+
geom_bar(stat = ‘Identity’, width = 0.5,aes(fill = FAMILY_NAME))+theme_bw()+
geom_text(aes(label=count), vjust=-0.3, size=3.5)+
labs(title = “FAMILY WITH MOST AND LEAST FAILS”, subset = “delays”, x = “family name”, y = “total fails”)+
theme(legend.position = “right”)+theme(legend.key.width = unit(.5,”cm”),legend.key.height = unit(.5,”cm”))+
theme(legend.title = element_blank())+
theme(axis.text.x = element_blank(),axis.ticks.x = element_blank(),axis.title.x = element_blank())
Conclusion : This is the family having the most number of fails i.e Targerian with 363100 fails and The family having the least number of fails i.e Lannister with 158227 fails .
For task 5 : The family member with most fails
plot for task5 :
ggplot(family_member_with_most_fails, aes(x= reorder(MEMBER_NAME,-count), y = count))+
geom_bar(stat = ‘Identity’, width = 0.5,aes(fill = MEMBER_NAME))+theme_bw()+
geom_text(aes(label=count), vjust=-0.3, size=3.5)+
labs(title = “FAMILY MEMBER WITH MOST FAILS”, subset = “delays”, x = “member name”, y = “total fails”)+
theme(legend.position = “right”)+theme(legend.key.width = unit(.5,”cm”),legend.key.height = unit(.5,”cm”))+
theme(legend.title = element_blank())+
theme(axis.text.x = element_blank(),axis.ticks.x = element_blank(),axis.title.x = element_blank())
Conclusion : The family member with most number of fails is Petyr Baelish with 160210
For task 6: The family member with least fails
plot for task6:
ggplot(family_member_with_least_fails, aes(x= reorder(MEMBER_NAME,-count), y = count))+
geom_bar(stat = ‘Identity’, width = 0.5,aes(fill = MEMBER_NAME))+theme_bw()+
geom_text(aes(label=count), vjust=-0.3, size=3.5)+
labs(title = “FAMILY MEMBER WITH LEAST FAILS”, subset = “delays”, x = “member name”, y = “total fails”)+
theme(legend.position = “right”)+theme(legend.key.width = unit(.5,”cm”),legend.key.height = unit(.5,”cm”))+
theme(legend.title = element_blank())+
theme(axis.text.x = element_blank(),axis.ticks.x = element_blank(),axis.title.x = element_blank())
Conclusion : The family member with least number of fails is Euron with 11759.
Time Series Forecasting
TIME SERIES ANALYSIS : ARIMA models are a popular and flexible class of forecasting model that utilize historical information to make predictions. It is a statistical technique that deals with the time series data, or trend analysis. Time series data means that data is in a series of particular time periods or intervals.
Objectives :
- Plotting, analyzing, and preparation of series for modeling
- Testing for stationarity and applying appropriate transformations
- Decide on the order of an ARIMA model
- Forecast the series
ARIMA model::
Based on one month data with browsing fails, We have to use time-series analysis and to predict the future amount of fails.
The following code in done using R Studio ::
total_date = data%>%filter(FIRST_GET_RESPONSE_SUCCESS_D !=0 & PAGE_BROWSING_DELAY !=0 &
TCP_SETUP_TOTAL_DELAY !=0 & PAGE_CONTENT_DOWNLOAD_TOTAL_D !=0 &
FIRST_DNS_RESPONSE_SUCCESS_D !=0 & DNS_RESPONSE_SUCCESS_DELAY !=0 &
FIRST_TCP_RESPONSE_SUCCESS_D !=0 & PAGE_SR_DELAYS !=0 &
SYN_SYN_DELAY !=0 & TCP_CONNECT_DELAY !=0)%>%
group_by(DATETIME)%>%
summarise(count = n())
View(total_date)
##DateTime series
##Using ts function we are assigning date time series on count i.e sating that failure depends on date-time
##column.assigning start date and end date to run the model for that period
date_ts =ts(total_date$count, start =c(2018, 7, 6), end = c(2018, 8, 5), frequency = 15)
##Preparing the difference variable
d_total_sum = diff(total_date$count)
##Adf test to check
adf.test(d_total_sum)
## estimating AR model
fit_diff_ar = arima(d_total_sum, order = c(1, 0, 2))
d_total_sum = diff(d_total_sum, 96)
acf(d_total_sum)
pacf(d_total_sum)
head(data2$DATETIME)
fit_diff_ar = arima(d_total_sum, order = c(1, 0, 2))
fit_diff_ar
#(p,d,q)
#(1,0,1)
#(1,0,2)
#(2,0,1)
#(2,0,2)
AIC(arima(d_total_sum,c(1,0,1)))
AIC(arima(d_total_sum,c(1,0,2)))
AIC(arima(d_total_sum,c(2,0,1)))
AIC(arima(d_total_sum,c(2,0,2)))
d = ndiffs(d_total_sum)
auto.arima(d_total_sum)
## forcasting
fit_diff_arf = forecast(fit_diff_ar, h = 384)
plot(forecast(fit_diff_ar, h = 384))
model = nnetar(d_total_sum)
nnetar(d_total_sum)
temp = forecast(object = model,h = 384)
plot(temp, 384)
## Forecasting from ARIMA daywise
data2 = data%>%filter(FIRST_GET_RESPONSE_SUCCESS_D !=0 & PAGE_BROWSING_DELAY !=0 &
TCP_SETUP_TOTAL_DELAY !=0 & PAGE_CONTENT_DOWNLOAD_TOTAL_D !=0 &
FIRST_DNS_RESPONSE_SUCCESS_D !=0 & DNS_RESPONSE_SUCCESS_DELAY !=0 &
FIRST_TCP_RESPONSE_SUCCESS_D !=0 & PAGE_SR_DELAYS !=0 &
SYN_SYN_DELAY !=0 & TCP_CONNECT_DELAY !=0)%>%
group_by(DATETIME, MEMBER_NAME) %>%
summarise(total = n())
extract = as.Date(data2$DATETIME, “%Y-%m-%d %H:%M”)
data2$Date = format(extract, “%Y-%m-%d”)
View(data2)
data3 = data2%>%group_by(Date)%>%summarise(total.sum = sum(total))
sum(data3$total.sum)
View(data3)
date_ts =ts(data3$total.sum, start =c(2018, 7, 6), end = c(2018, 8, 5), frequency = 1)
d_total_sum = diff(data3$total.sum, 1)
fit_diff_ar = arima(d_total_sum, order = c(1, 0, 2))
auto.arima(d_total_sum)
fit_diff_ar
## forcasting
fit_diff_arf = forecast(fit_diff_ar, h = 4)
plot(forecast(fit_diff_ar, h = 4))
PLOTTING:
Here , x axis represents Date- Time ,
y axis represents the number of fails