Team solutions

Case_VMWare TEAM anteater



R: rvest, text2vec, Matrix, textcat, irlba, NNMF

Business Understanding

Facilitate topic identification for Knowledge Base articles

Data Understanding

  • The Knowledge Base consists of 34,646 html files which have mostly homogeneous structure. (example below)

  • The articles are highly domain specific and have a lot of terms which are not present in standard language dictionaries.
  • The articles are not in the same language (~24,000 are in English). Due to time constraints our team decided to focus on those. The languages were identified using textcat.

Data Preparation

  • The html files were parsed with rvest and the content within div elements with class=”content” was extracted. The individual documents were created by concatenating the Details, Solution and Keywords subsections from the html files.
  • Then the raw contents were tokenized and POS-tagged using openNLP. This the most computationally expensive task.
  • We tried different ways to obtain a reasonable vocabulary for the construction a document-term matrix. The most effective strategy was filtering out tokens tagged with something different than NN, NNS, NNP, NNPS. This was not very surprising since nouns give substantial part of information about the topic.
  • The vocabulary was also pruned based on frequency metrics, e.g terms which appear in many or in very few documents were removed.


Matrix factorization approaches were considered to battle the curse of dimensionality which you would see with distance-based clustering techniques. The results of matrix factorisation also elucidates topics when the dimensionality of the data is reduced.

To get suitable a matrix for factorisation, we converted the values of the document-term matrix to tf-idf scores, before applying two different factorisation techniques, as listed below:

  1. SVD (Singular Value Decomposition)
  2. NMF (Non-negative Matrix Factorisation)


  1. SVD
    • Despite the dimensionality-reduction benefits it brings, interpretation of the results with SVD still poses a challenge. The term-topic matrix contains both positive and negative values, in which the positive values “promote” a certain term in a topic, and negative values “demote” the term in a topic. While the first and second topics could be rather easily explained in this way, the remaining topics are harder to handle, leading to the second approach with NMF.
  2. NMF
    • Results are clearer than with the SVD, as weights of terms now have only positive values; the higher the weights, the more a term “promotes” a certain topic. In the below screenshot, it is evident that topic 1 is about VMWare Fusion, and the other is about exposures and vulnerabilities.


Share this

One thought on “Case_VMWare TEAM anteater

  1. 0

    The VMWare use case is a very well defined real life problem. Congratulation to TEAM anteater for competing in this use case.
    I’m impressed that the team has tried two different approaches, SVD and NMF. Like most NLP problems, there was some strong signals which was appearing in multiple topics for e.g. “DaaS” appears in both topic 1 and topic 2. Ideally in such cases, one should optimize the parameters of the algorithms such that the features don’t overlap in the topics. Another approach is to try hieratical models like Agglomerative clustering.
    Overall, I think the team did a good job in completing the task in a short time however, I suggest that the choice of algorithm should be based on the outcome of the baseline results.
    Best of luck,

Leave a Reply