Datathons Solutions

Datathon Telenor Solution – WRANGLING WITH DATA DROPS

This article proposes very tractable approach to modelling changes in regime .The parameters of Time & Date are viewed in the outcome for this analysis.
In the 21st century cell phones are the most commonly used and important wireless technology. Cell phones are so common that it can be seen in everyone’s hand doesn’t matter what age group that individual belongs to, whether that individual is old, young or teenager belonging to any terrain .India has a population of 1.32 billion and comprises of nearly 340 million cell phones. It is used for communication ,messaging , downloading and uploading data on the internet.
There are times when an user counter issues in communications like termination of call,data drop in between of communication, wrong connections, etc. which may have an impact on the overall experience of the network subscribers. The telecom service providers have to implement certain data management technology to improve their infrastructure to minimize the effect of call drop and data drop to provide quality services to their customers.
Nearly all signals contain energy at harmonic frequencies, in addition to the energy at the fundamental frequency. If all the energy in a signal is contained at the fundamental frequency, then that signal is a perfect sine wave. The telecommunication signals also contains many harmonics which are affected a lot because of semiconductor interfacing , physical or digital barriers.
Keywords- Data drop,Call drop

1
votes

telenor

THE TELENOR CASE

MENTOR-Dr.Subhabaha Pal

GROUP NAME – Grenadiers

PARTICIPANTS-1.Abhinav Gaharwar([email protected]) 2. Sanjeev Biswas([email protected])

3.Gauranga Mallick([email protected]) 4.Rhishikesh  Padole ([email protected])

5.Sreekar([email protected])

WEAPON-R ,Python

LIBRARY USED-forecast,data.table,lubridate,tseries.

 

1.BUSINESS UNDERSTANDING

In telecommunications, the dropped-call rate  is the fraction of the telephone calls which, due to technical reasons, were disconnected before  the communicating parties had finished their conversation. This fraction is usually measured as a percentage of all calls, similarly their have also been mobile data fail which widely affects the telecommunication setup.

This inevitable drops or variation is the cause of communication dropping

Based on one month data with flight fails,we will make the prediction for amount of fails with the help of given dataset .We have tried to explore some questions using ARIMA MODEL.

 

2.DATA UNDERSTANDING

Data was provided in the form of csv file .The first part was to convert it into a readable format and uploading it into Rstudio. The data comprises of nearly  30 million rows and 10 columns about the mobile data fail.

The data is on a hourly basis,It contains information about which raven made the communication depicting the cell  and by which network (2G/3G). Data also converse who initiated the communication – family and member name. It also accounts for different types of fails or errors.

 

 

 

 

 

 

 

3.MODELLING

1.Step 1-exploratory data analysis

Used the R algorithm to do the data analysis, vectorize  operation for finding the relation and dependency among columns with each other.

 

 

1#Top 10 ravens with fails

#filter the data with at least one delay of nonzero seconds 
new_data1=data[FIRST_DNS_RESPONSE_SUCCESS_D !=0|PAGE_BROWSING_DELAY !=0|TCP_SETUP_TOTAL_DELAY !=0|
               PAGE_CONTENT_DOWNLOAD_TOTAL_D !=0|FIRST_DNS_RESPONSE_SUCCESS_D !=0|DNS_RESPONSE_SUCCESS_DELAY !=0|
               FIRST_TCP_RESPONSE_SUCCESS_D !=0|PAGE_SR_DELAYS !=0|SYN_SYN_DELAY!=0|TCP_CONNECT_DELAY!=0|
               PAGE_BROWSING_DELAYS!=0]
# we grouping the data for each raven name and drawing the frequencies(total_fails) for each raven name 
# we are taking top 10 ravens in descending order of frequencies(total_fails)
new_data1%>% group_by(RAVEN_NAME)%>% summarise(total_fail=n())%>% arrange(-total_fail)%>% head(10)%>%

 

 

 

 

RAVEN NAMES(X-AXIS)  VS  NO OF FAILS(Y-AXIS)

SUMMARY->

Among the top 10 ravens ‘Brass Raven Birdy’ has the maximum numbers of ravens with fail that is 218372 followed by ‘ Brown Raven Ruby ‘ which comprises of nearly  210211 fails.

 

2

#Top 10 ravens without fails

we are filtering out data with at least one delay of zero seconds
new_data2 = data[FIRST_DNS_RESPONSE_SUCCESS_D ==0 & PAGE_BROWSING_DELAY == 0 & TCP_SETUP_TOTAL_DELAY ==0 &
                 PAGE_CONTENT_DOWNLOAD_TOTAL_D ==0 & FIRST_DNS_RESPONSE_SUCCESS_D == 0 & DNS_RESPONSE_SUCCESS_DELAY ==0 &
                 FIRST_TCP_RESPONSE_SUCCESS_D == 0 & PAGE_SR_DELAYS == 0 & SYN_SYN_DELAY == 0  & TCP_CONNECT_DELAY ==0 &
                 PAGE_BROWSING_DELAYS == 0]
# we grouping the data for each raven name and drawing the frequencies(total_successive_deliveries) for each raven name without fails
# we are taking top 10 ravens in descending order of frequencies(total_successive_deliveries)
# plotting graph for top 10 raven names without fails
new_data2%>% group_by(RAVEN_NAME)%>% summarise(total_fail=n())%>% arrange(-total_fail)%>% head(10)%>%
  ggplot(aes(x = reorder(RAVEN_NAME,-total_fail),y=total_fail))+
  geom_bar(stat=’Identity’,width = .5,aes(fill=RAVEN_NAME))+
  theme_bw()+labs(x = “RAVEN NAMES”, y=”Total Successive Deliveries”)+
  geom_text(aes(label=total_fail),size=5, vjust=-1)+theme(axis.text.x = element_blank())

 

RAVEN NAME(x-axis) vs NO OF FAILS(y-axis)

SUMMARY->

In case of ravens without fail ‘ Metallic Sunburst  raven Polly ‘ comprises of nearly 297 ravens  without fail followed by ‘ Green Sheen raven Azul ‘ who has a count of 211 ravens without fail.

3

#The family with most fails   

# we are grouping the data with fail deliveries for each family name 
# we are taking the family name with most number of fail deliveries
new_data1%>% group_by(FAMILY_NAME)%>% summarise(total_fail=n())%>% arrange(-total_fail)%>%head(1)
&   # The family with least fails
# we are grouping the data with fail deliveries for each family name 
# we are taking the family name with least number of fail deliveries
new_data1 %>% group_by(FAMILY_NAME)%>%summarise(total_fail=n())%>% arrange(-total_fail)%>%tail(1)
SUMMARY->

Among the families ‘ Targerian ‘ family comprises of nearly 8154000  fails which is the largest and the ’ Baelish ’ family has nearly 2744406 fails which is the least count

 4

#The family member with most fails 

# we are grouping the data with fail deliveries for each member name 
# we are taking the member name with most number of fail deliveries
new_data1 %>% group_by(MEMBER_NAME)%>% summarise(total_fail=n())%>% arrange(-total_fail)%>%head(1)
&  # The family member with least fails
# we are grouping the data with fail deliveries for each member name 
# we are taking the member name with least number of fail deliveries
new_data1 %>% group_by(MEMBER_NAME)%>% summarise(total_fail=n()) %>% arrange(-total_fail)%>%tail(1)

SUMMARY->

In case of family members ’ Petyr Baelish ‘ has a most  fail of  2744406  against ‘Euron’ which has a count of 491454 fail least among all family members

STEP 2

# we are  extracting the date and time from datetime column 
s= as.Date(new_data1$DATETIME,”%Y-%m-%d %H:%M”)
new_data1$date=format(s,”%Y-%m-%d”)
new_data1$time=format(strptime(new_data1$DATETIME,”%Y-%m-%d %H:%M”),’%H:%M’)
# we are grouping the by RAVEN_NAME for frequncies(no of days for each RAVEN NAME)  
set=new_data1%>% select(date,RAVEN_NAME)%>%group_by(date)%>%summarise(count=n())
View(set)
# we are applying time series model on telenor sample data
test=ts(set$count,start=c(1),frequency=10)
View(test)
# plotting to check the trend of given time series model
plot(test)
# Applying Augmented Dickey-Fuller Test on test data to find p-Value
adf.test(test, alternative =”stationary”, k=12)
# Applying Auto-ARIMA Test to find mean and standard error for the data according to the trend in time series
auto.arima(test)
# Finding population mean (sigma^2) according to trends in time series 
arimatest=arima(test,order=c(0,1,1),seasonal=c(0,1,0))
arimatest
# Forecasting the number in data delay for next 4 days
arimafuture=forecast(arimatest,h=4)
plot(x = arimafuture, shadebars = TRUE,type = ‘b’,ylab = ‘no of delays’,xlab = ‘Days’, col = ‘blue’,fcol = ‘red’)
#OUTPUT->
Point       Forecast    Lo 80         Hi 80        Lo 95          Hi 95
4.10       956875.8   925029.9   988721.7   908171.7   1005579.9
4.20       922865.8  878790.8  966940.8  855458.9   990272.7
4.30       902749.8  849167.6  956332.0  820802.9   984696.7
4.40       926953.8  865313.7  988593.9  832683.4   1021224.2

SUMMARY->

By using ARIMA model ,we can predict that on 7 august 2018 the predicted data delay will be 956875.8 varying between 925029.9 -988721.7 on 80 % scale or between 908171.7 – 1005579.9 on a 95% scale.

On 8 August 2018 ,the data delay will be  922865.8 varying between 878790.8  –  966940.8 on 80 % scale   or between 855458.9  -990272.7 1 on a 95% scale.

On 9 August 2018 ,the data delay will be   902749.8    varying between 849167.6 -956332.0    on 80 % scale or between  820802.9  -984696.7  on a 95% scale.

On 9 August 2018 ,the data delay will be   926953.8        varying between 865313.7  988593.9  on 80 % scale or between   832683.4  -1021224.2 on 95 % scale.

DAYS VS NUMBER OF DELAYS

 

EVALUATION –

  • Time series analysis with Exploratory Data Analysis provides solution and predict future amount of data drops.
  • APPROACH 1- Exploratory Data Analysis provides dependency and correlation of Ravens,Family, Family members with Data drops
  • APPROACH 2-Time Series Analysis predicting the future Data Drops using ARIMA modelling
  • The results provided proves to be satisfactory predicting the data drops.
  • it was very challenging to analyse and process such a huge data set.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Share this

2 thoughts on “Datathon Telenor Solution – WRANGLING WITH DATA DROPS

  1. 1
    votes

    Team, please do provide process how you prepared data, build and evaluated model(s) for prediction. For any “scientific” article and for anybody who has access to data, whole process should be repetable, i.e. anyone should be able to take your code/work and get same end results. Also, focus of this case is “The main task is to predict the fails in the next four days (on both files).” which is clearly stated in case description, so please focus on adding this part of information into your article.

Leave a Reply