R: rvest, text2vec, Matrix, textcat, irlba, NNMF
Facilitate topic identification for Knowledge Base articles
- The Knowledge Base consists of 34,646 html files which have mostly homogeneous structure. (example below)
- The articles are highly domain specific and have a lot of terms which are not present in standard language dictionaries.
- The articles are not in the same language (~24,000 are in English). Due to time constraints our team decided to focus on those. The languages were identified using textcat.
- The html files were parsed with rvest and the content within div elements with class=”content” was extracted. The individual documents were created by concatenating the Details, Solution and Keywords subsections from the html files.
- Then the raw contents were tokenized and POS-tagged using openNLP. This the most computationally expensive task.
- We tried different ways to obtain a reasonable vocabulary for the construction a document-term matrix. The most effective strategy was filtering out tokens tagged with something different than NN, NNS, NNP, NNPS. This was not very surprising since nouns give substantial part of information about the topic.
- The vocabulary was also pruned based on frequency metrics, e.g terms which appear in many or in very few documents were removed.
Matrix factorization approaches were considered to battle the curse of dimensionality which you would see with distance-based clustering techniques. The results of matrix factorisation also elucidates topics when the dimensionality of the data is reduced.
To get suitable a matrix for factorisation, we converted the values of the document-term matrix to tf-idf scores, before applying two different factorisation techniques, as listed below:
- SVD (Singular Value Decomposition)
- NMF (Non-negative Matrix Factorisation)
- Despite the dimensionality-reduction benefits it brings, interpretation of the results with SVD still poses a challenge. The term-topic matrix contains both positive and negative values, in which the positive values “promote” a certain term in a topic, and negative values “demote” the term in a topic. While the first and second topics could be rather easily explained in this way, the remaining topics are harder to handle, leading to the second approach with NMF.
- Results are clearer than with the SVD, as weights of terms now have only positive values; the higher the weights, the more a term “promotes” a certain topic. In the below screenshot, it is evident that topic 1 is about VMWare Fusion, and the other is about exposures and vulnerabilities.