There is great potential in unlocking hidden insights in linked data through investigation of social structures as businesses are interested to gain insight into the relationships customers have with other people in their network. Identifying communities and influencers within these networks complements and enhances the currently existing BI solutions in risk management, fraud investigation, churn mitigation, new customers acquisition, and viral marketing. The current article is focused on solving two main problems for the Telenor case:
- Identifying leaders in the community
- Calculating degree of association with a given community
The original dataset contains 1,452,747 observations. It contains the following variables
- Subscriber A (caller party ID)
- Subscriber B (called party ID)
- Label (Strength of connection – Low, Medium, or High)
- Real Event Flag (0 and 1).
There are 176, 337 exclusions applied to the original dataset, for which the Real Event Flag is 0, as they are considered “fake” calls.
For modelling purposes the Label variable was assigned numerical values as follows:
- High = 1.0
- Medium = 0.5
- Low = 0.25
To solve this task we employed the following methods:
- We calculated two graph ranking algorithms:
- Hubs & Authorities
- We calculated node embeddings for the graph by training a SkipGram model
- We trained a neural network using the embeddings from 2. to predict association of a given node with a given group.
1. Link Analysis/Graph Ranking
We employed the PageRank and Hubs and Authorities algorithms to find the important nodes in the graph. They both are iterative algorithms which start by assigning equal scores to all nodes in the graph. These scores are then iteratively redistributed based on the link structure of the graph. While PageRank provides a single metric for every node, the Hubs & Authorities algorithm provides two scores – the hub score and the authority score. The hub score is a measure of the quality of the outgoing links – do they point at good authorities – while the authority score is a measure of the quality of the incoming links. We developed both weighted and unweighted variants of the algorithms.
Among the top 2000 users from the PageRank algorithm, 1725 are leaders according to the golden dataset. For more information, please refer to the Implementation part before.
2. Graph Node Embeddings
We investigated an alternative approach inspired by the recent advances in NLP related to word embeddings. We use the SkipGram Model to train node embeddings We define a neighbourhood as the set of a given node and all of its neighbours. We supply these neighbourhoods as context to the SkipGram model. To make up for the word order assumption in the skipgram model we repeat and reshuffle longer neighbourhoods.
The intuition behind the idea is as follows: When applied to words, the skipgram model tries to predict co-occurring words. It is only natural that when applied to a graph it should try to predict nodes which co-occur in a lot of neighbourhoods. The following are the projections of the embeddings in 2-d space for each group:
These embeddings were then fed into a simple softmax classifier. We argue that they have some predictive power as they achieve a macro-f1 score of 0.6667 and the following confusion matrix:
[[111902 2 7 7 56 3 36] [ 7 12 8 0 0 0 1] [ 33 2 69 5 6 3 72] [ 17 0 3 77 2 0 11] [ 326 0 0 0 461 1 14] [ 36 0 0 0 2 85 17] [ 250 1 35 19 26 10 378]]
The scores output from this model can be interpreted as *association* to the given group and used for various research tasks.
We developed a simple, yet cool UI to explore the graph manually. The UI consists of a single input field where the user supplies the START_NODE of the graph. The system then displays the START_NODE and all of its‘ outgoing links. The user can then proceed to click on the neighbouring nodes to expand the graph in that direction. Nodes from G5 are colored red. Check out the video below:
Супер сте, мачкайте!