The business understanding of this task was limited because of the confidentiality of the data domain. The data received was social graph represented as list of edges. Extra domain knowledge for the structure of the data was provided. Summary of it follows:
The data can be considered as a graph of calls between users. The data is structured and has 3 different part – 2 membranes and 1 core. Visual representation follows:
import pandas as pd %matplotlib inline import networkx
telenor_links = pd.read_csv('./Datathon_2018_Dataset_Hashbyte_New.csv', delimiter=';')
The data graph is represented as list of edges. Each edge has start/end points, flag showing the "importance" of the edge and flag Real_Event_Flag showing is the event real or an accident.
Note: there is a typo in one of the datasets columns. It's super annoying but was never removed, because the pain of handling it was the only thing keeping us awake most of the time.
Number of unique nodes in the graph:
Number of edges in the graph:
This section is a summary of the whole process of the data preparation. Here are only written the most important parts which were influential for development of the further approaches used.
8 edges in the dataset have no from field. Based on the huge number of edges those inconsistent edges were removed.
Based on the information from the mentors that the graph can be considered as network of phone calls, the following statistics are marked as important:
def callers_and_receivers_stats(callers, recievers, index_name): firsts = set(list(callers.value_counts().index)) seconds = set(list(recievers.value_counts().index)) results = pd.DataFrame(index=[index_name]); results['unique_callers'] = len(firsts) results['unique_receivers'] = len(seconds) results['callers_and_recievers'] = len(firsts & seconds) results['callers_only'] = len(firsts - seconds) results['receivers_only'] = len(seconds - firsts) return results def get_callers_only(df=telenor_links): firsts = set(list(telenor_links['Subscriber_A'].value_counts().index)) seconds = set(list(telenor_links['Subsciber_B'].value_counts().index)) return df[df['Subscriber_A'].isin(list(firsts - seconds))] def get_recievers_only(df=telenor_links): firsts = set(list(telenor_links['Subscriber_A'].value_counts().index)) seconds = set(list(telenor_links['Subsciber_B'].value_counts().index)) return df[df['Subsciber_B'].isin(list(seconds - firsts))]
callers_only = get_callers_only() recievers_only = get_recievers_only()
callers_and_receivers_stats(telenor_links['Subscriber_A'], telenor_links['Subsciber_B'], 'Count in all data')
|Count in all data||54594||99808||35712||18882||64096|
After further investigation based on the previous statistics significant conclusions about the structure of the graph were drawn:
- The callers_only group(nodes that has only outgoing edges) contains 18882(15% of all nodes) but has called only 2237(1% of all nodes) notes
- The recieveres_only group contains 64096 (54 % of all nodes) but those recievers were called only from 2210(1% of all nodes)
There is no direct call from a caller_only to reciever_only group (how this was determined will be shown later in this document)
The callers_and_recievers contains 35712 (30% of all nodes) and 1011619(69% of all edges) edges
- Strong correlation between the numbers of incoming and outgoing calls was determined (0.8980025480971654).
This means that most of the people has called as many people as they were called by. After a short research it appears that some experts consider such behaviour in communicational SNA-s for "normal". (Example for such research https://www.aaai.org/ocs/index.php/ICWSM/ICWSM11/paper/viewFile/2812/3224
Visual proof can be seen in the diagram above
# Outgoing egdes outgoing_edges = telenor_links.groupby('Subscriber_A').count()['Label'] outgoing_edges = outgoing_edges.rename('Outgoing') ingoing_edges = telenor_links.groupby('Subsciber_B').count()['Label'] ingoing_edges = ingoing_edges.rename('Ingoing') out_vs_in_edges = pd.concat([outgoing_edges, ingoing_edges], axis=1).fillna(0)[['Outgoing', 'Ingoing']] out_vs_in_edges['Outgoing'].corr(out_vs_in_edges['Ingoing'])
from plotly import __version__ from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot init_notebook_mode(connected=True) from plotly import graph_objs as go trace = go.Scatter( x = out_vs_in_edges['Outgoing'], y = out_vs_in_edges['Ingoing'], mode = 'markers' ) data = [trace] ## Plot and embed in ipython notebook! iplot(data, filename='basic-scatter')