Telenor wants to identify Social Network leaders from a list of A and B nodes, their connection counts and connection strengths. A second goal is to characterize a node’s possibility of turning “bad” from its relationship to a list of truly bad nodes.
We have 118 690 unique nodes with 1 452 755 connections between them.
Some nodes exhibit a number of outgoing connections as high as 2 460, which signifies that we can have call centers and taxi services in the data. This is particularly important as they can form “rank sinks” during the implementation of an algorithm similar to pagerank.
Dealing with 0/1 call flag and NULLs
We have omitted the zero-calls and the rows with NULL values from the dataset.
Approaches to the Low-Medium-High connection strengths
One approach is to reserve part of the data as the test set. Train the remaining portion of the data using log of the contribution metric, which is the percentage of the connection compared to the entire network. We expect these percentages to be very small – log on it will make the numbers easier to work with. Then we can use cross validation to fit the best low, medium and high values (on scale of 1), and test the model on the remaining test set.
Joint data preparation
Our team has agreed to produce in parallel a number of datasets with unique node IDs in column1 and different algorithm outputs in the other columns. Since there are different algorithms for information on the network, node pairs, ranking the nodes or clustering the nodes, but we are investigating the nodes in their differences, we have agreed to mark with R_ columns that represent ranking of nodes and with C_ clustering algorithm outputs.
The following Jupyter notebook produces R_core, R_triangles and R_cluster: 1Datathon2018.ipynb
Other variables were produces with igraph in R.
And the following set of notebooks produce one of the data sets:
We have used the combined table from the data prep step to identify bad nodes and leaders by different approaches.
Using igraph package in R several statistics were calculated. We had 118k observations and only 140 were labeled as bad, others were unclassified. So new set was generated 600 randomly selected from unknown. The new set was split to training and test. With this proportion it’s inevitable to have bads in unclassified set labeled as goods, but the high proportion of bads will suppress their influence.
Logistic regression was performed. After correlation analysis using these variables were selected: R_degree_out, R_degree_in, R_centr_eigen_no_names, R_coreness_in, R_page_rank + R_components_strong and 0.17 threshold.
Using this coef. 5 bads have misclassification and other 4 unknown are labeled as bad. The accuracy is 92.62%.
Description of predictors:
R_degree_out and R_degree_in – are the degrees of a vertex in and out.
R_centr_eigen_no_names – eigenvector centralities of positions
R_coreness_in – K-in-cores decomposition of graphs
R_page_rank – page rank algorithm
R_components_strong – Calculate the maximal strongly connected components of a graph
Another approach was used for finding leaders by an augmented pagerank algorithm. Unfortunately the process for the whole dataset is not complete yet.
LeaderRank  (LR) is a modified PageRank algorithm that aiming for identifying influential users. It is able to find leaders who result in quick opinion spreading. In this project, we use LeaderRank to rank the nodes in Network X and the leader is defined as nodes with high LeaderRank Score.
The algorithm is implemented in Python 2.7. Input is Edge List with nodes renamed from 0 to N. (N is total number of unique nodes. Here is 118,691). For pair of nodes that share more than 1 interaction, we only consider one. The output of the algorithm the ranking score for each node. We use a list to store the mapping from NodeID (0,1,…,N) to NodeName (0xFC…). The result is show below …
One approach we have used for defining a “leader” is to define rules for the R_cores and look into each core’s R_triangles to identify the leaders. Gephi demonstration (please, note that the largest part of the dataset is spread around the first graph and on it we see only the cores).
. Lü, L., Zhang, Y. C., Yeung, C. H., & Zhou, T. (2011). Leaders in social networks, the delicious case. PloS one, 6(6), e21202
. Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). The PageRank citation ranking: Bringing order to the web. Stanford InfoLab.