telenor
THE TELENOR CASE
MENTOR-Dr.Subhabaha Pal
GROUP NAME – Grenadiers
PARTICIPANTS-1.Abhinav Gaharwar([email protected]) 2. Sanjeev Biswas([email protected])
3.Gauranga Mallick([email protected]) 4.Rhishikesh Padole ([email protected])
5.Sreekar([email protected])
WEAPON-R ,Python
LIBRARY USED-forecast,data.table,lubridate,tseries.
1.BUSINESS UNDERSTANDING
In telecommunications, the dropped-call rate is the fraction of the telephone calls which, due to technical reasons, were disconnected before the communicating parties had finished their conversation. This fraction is usually measured as a percentage of all calls, similarly their have also been mobile data fail which widely affects the telecommunication setup.
This inevitable drops or variation is the cause of communication dropping
Based on one month data with flight fails,we will make the prediction for amount of fails with the help of given dataset .We have tried to explore some questions using ARIMA MODEL.
2.DATA UNDERSTANDING
Data was provided in the form of csv file .The first part was to convert it into a readable format and uploading it into Rstudio. The data comprises of nearly 30 million rows and 10 columns about the mobile data fail.
The data is on a hourly basis,It contains information about which raven made the communication depicting the cell and by which network (2G/3G). Data also converse who initiated the communication – family and member name. It also accounts for different types of fails or errors.
3.MODELLING
1.Step 1-exploratory data analysis
Used the R algorithm to do the data analysis, vectorize operation for finding the relation and dependency among columns with each other.
1#Top 10 ravens with fails
#filter the data with at least one delay of nonzero seconds
new_data1=data[FIRST_DNS_RESPONSE_SUCCESS_D !=0|PAGE_BROWSING_DELAY !=0|TCP_SETUP_TOTAL_DELAY !=0|
PAGE_CONTENT_DOWNLOAD_TOTAL_D !=0|FIRST_DNS_RESPONSE_SUCCESS_D !=0|DNS_RESPONSE_SUCCESS_DELAY !=0|
FIRST_TCP_RESPONSE_SUCCESS_D !=0|PAGE_SR_DELAYS !=0|SYN_SYN_DELAY!=0|TCP_CONNECT_DELAY!=0|
PAGE_BROWSING_DELAYS!=0]
# we grouping the data for each raven name and drawing the frequencies(total_fails) for each raven name
# we are taking top 10 ravens in descending order of frequencies(total_fails)
new_data1%>% group_by(RAVEN_NAME)%>% summarise(total_fail=n())%>% arrange(-total_fail)%>% head(10)%>%
RAVEN NAMES(X-AXIS) VS NO OF FAILS(Y-AXIS)
SUMMARY->
Among the top 10 ravens ‘Brass Raven Birdy’ has the maximum numbers of ravens with fail that is 218372 followed by ‘ Brown Raven Ruby ‘ which comprises of nearly 210211 fails.
2
#Top 10 ravens without fails
we are filtering out data with at least one delay of zero seconds
new_data2 = data[FIRST_DNS_RESPONSE_SUCCESS_D ==0 & PAGE_BROWSING_DELAY == 0 & TCP_SETUP_TOTAL_DELAY ==0 &
PAGE_CONTENT_DOWNLOAD_TOTAL_D ==0 & FIRST_DNS_RESPONSE_SUCCESS_D == 0 & DNS_RESPONSE_SUCCESS_DELAY ==0 &
FIRST_TCP_RESPONSE_SUCCESS_D == 0 & PAGE_SR_DELAYS == 0 & SYN_SYN_DELAY == 0 & TCP_CONNECT_DELAY ==0 &
PAGE_BROWSING_DELAYS == 0]
# we grouping the data for each raven name and drawing the frequencies(total_successive_deliveries) for each raven name without fails
# we are taking top 10 ravens in descending order of frequencies(total_successive_deliveries)
# plotting graph for top 10 raven names without fails
new_data2%>% group_by(RAVEN_NAME)%>% summarise(total_fail=n())%>% arrange(-total_fail)%>% head(10)%>%
ggplot(aes(x = reorder(RAVEN_NAME,-total_fail),y=total_fail))+
geom_bar(stat=’Identity’,width = .5,aes(fill=RAVEN_NAME))+
theme_bw()+labs(x = “RAVEN NAMES”, y=”Total Successive Deliveries”)+
geom_text(aes(label=total_fail),size=5, vjust=-1)+theme(axis.text.x = element_blank())
RAVEN NAME(x-axis) vs NO OF FAILS(y-axis)
SUMMARY->
In case of ravens without fail ‘ Metallic Sunburst raven Polly ‘ comprises of nearly 297 ravens without fail followed by ‘ Green Sheen raven Azul ‘ who has a count of 211 ravens without fail.
3
#The family with most fails
# we are grouping the data with fail deliveries for each family name
# we are taking the family name with most number of fail deliveries
new_data1%>% group_by(FAMILY_NAME)%>% summarise(total_fail=n())%>% arrange(-total_fail)%>%head(1)
& # The family with least fails
# we are grouping the data with fail deliveries for each family name
# we are taking the family name with least number of fail deliveries
new_data1 %>% group_by(FAMILY_NAME)%>%summarise(total_fail=n())%>% arrange(-total_fail)%>%tail(1)
SUMMARY->
Among the families ‘ Targerian ‘ family comprises of nearly 8154000 fails which is the largest and the ’ Baelish ’ family has nearly 2744406 fails which is the least count
4
#The family member with most fails
# we are grouping the data with fail deliveries for each member name
# we are taking the member name with most number of fail deliveries
new_data1 %>% group_by(MEMBER_NAME)%>% summarise(total_fail=n())%>% arrange(-total_fail)%>%head(1)
& # The family member with least fails
# we are grouping the data with fail deliveries for each member name
# we are taking the member name with least number of fail deliveries
new_data1 %>% group_by(MEMBER_NAME)%>% summarise(total_fail=n()) %>% arrange(-total_fail)%>%tail(1)
SUMMARY->
In case of family members ’ Petyr Baelish ‘ has a most fail of 2744406 against ‘Euron’ which has a count of 491454 fail least among all family members
STEP 2
# we are extracting the date and time from datetime column
s= as.Date(new_data1$DATETIME,”%Y-%m-%d %H:%M”)
new_data1$date=format(s,”%Y-%m-%d”)
new_data1$time=format(strptime(new_data1$DATETIME,”%Y-%m-%d %H:%M”),’%H:%M’)
# we are grouping the by RAVEN_NAME for frequncies(no of days for each RAVEN NAME)
set=new_data1%>% select(date,RAVEN_NAME)%>%group_by(date)%>%summarise(count=n())
View(set)
# we are applying time series model on telenor sample data
test=ts(set$count,start=c(1),frequency=10)
View(test)
# plotting to check the trend of given time series model
plot(test)
# Applying Augmented Dickey-Fuller Test on test data to find p-Value
adf.test(test, alternative =”stationary”, k=12)
# Applying Auto-ARIMA Test to find mean and standard error for the data according to the trend in time series
auto.arima(test)
# Finding population mean (sigma^2) according to trends in time series
arimatest=arima(test,order=c(0,1,1),seasonal=c(0,1,0))
arimatest
# Forecasting the number in data delay for next 4 days
arimafuture=forecast(arimatest,h=4)
plot(x = arimafuture, shadebars = TRUE,type = ‘b’,ylab = ‘no of delays’,xlab = ‘Days’, col = ‘blue’,fcol = ‘red’)
#OUTPUT->
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
4.10 956875.8 925029.9 988721.7 908171.7 1005579.9
4.20 922865.8 878790.8 966940.8 855458.9 990272.7
4.30 902749.8 849167.6 956332.0 820802.9 984696.7
4.40 926953.8 865313.7 988593.9 832683.4 1021224.2
SUMMARY->
By using ARIMA model ,we can predict that on 7 august 2018 the predicted data delay will be 956875.8 varying between 925029.9 -988721.7 on 80 % scale or between 908171.7 – 1005579.9 on a 95% scale.
On 8 August 2018 ,the data delay will be 922865.8 varying between 878790.8 – 966940.8 on 80 % scale or between 855458.9 -990272.7 1 on a 95% scale.
On 9 August 2018 ,the data delay will be 902749.8 varying between 849167.6 -956332.0 on 80 % scale or between 820802.9 -984696.7 on a 95% scale.
On 9 August 2018 ,the data delay will be 926953.8 varying between 865313.7 988593.9 on 80 % scale or between 832683.4 -1021224.2 on 95 % scale.
DAYS VS NUMBER OF DELAYS
EVALUATION –
- Time series analysis with Exploratory Data Analysis provides solution and predict future amount of data drops.
- APPROACH 1- Exploratory Data Analysis provides dependency and correlation of Ravens,Family, Family members with Data drops
- APPROACH 2-Time Series Analysis predicting the future Data Drops using ARIMA modelling
- The results provided proves to be satisfactory predicting the data drops.
- it was very challenging to analyse and process such a huge data set.
2 thoughts on “Datathon Telenor Solution – WRANGLING WITH DATA DROPS”
Team, please do provide process how you prepared data, build and evaluated model(s) for prediction. For any “scientific” article and for anybody who has access to data, whole process should be repetable, i.e. anyone should be able to take your code/work and get same end results. Also, focus of this case is “The main task is to predict the fails in the next four days (on both files).” which is clearly stated in case description, so please focus on adding this part of information into your article.
Good work. As Tomislav sad – It will be ncie to have some more descriptions on your process steps.