VMware Case – Document Clustering
Parsing the html files:
- Only files tagged as English articles are kept by the parsing module.
- Option 1 – Beautiful Soup script which extracts the text from inside the tags of all of the html documents. It splits the html into paragraphs, based on the tags. Each html file is converted into a json file, containing only relevant keywords (Symptoms, Details, etc)
Option 1 was used in the end, since it preserved some of the document’s structure (distinct sections for titles, document IDs, solution descriptions etc).
Text Transformation Module (TTM):
- Changes every word to lower case, removes white-spaces, punctuation and specific symbols.
- Removes English stop words.
- Lemmatization – strips the words to their base forms (the lemmas).
- Applies weights to specific tags, paragraphs and keywords in the corpus (titles have more weight than other parts of the document for example).
- The name is short for Term Frequency–Inverse Document Frequency
- This matrix is used to track the number of times each lemma appears in the corpus.
- The TF-IDF is created based on the output from the TTM.
We applied Latent Dirichlet Allocation (LDA) to model the data. It is an unsupervised, soft-clustering algorithm. This model assumes that documents are just collections of words. Syntax and semantics aren’t taken into account on the model level, only word counts. LDA tries to group words, which occur often together.
LDA algorithm steps:
- Assign each word in the document to a topic randomly.
- For each word in the document:
- Assume that it’s topic assignment might be wrong, but all other assignments are right.
- Reassign the topic for the current word, based on the other topic assignments.
- Keep going until the topic assignments for all the words in the documents are stable.
Parameters we used for tuning the model:
- n_samples = None # number of articles to process – None for all
- n_features = 500 # number of words to scan in each document
- n_components = 5 # the number of topics/clusters to create
- n_top_words = 10 # for printing the top-most words that define a given topic
Results – Visualization and Evaluation
Interactive graphical visualization of the results was implemented, for a more intuitive and human-friendly way to show the word distributions among the topics/clusters. Also, topic distribution for each document can be viewed in tabular form. The results file is attached to the article, along with a short video explanation of the case. This procedure is reproducible.
The visualization of the word-per-topic distributions was used for a short EDA (exploratory data analysis). We were able to get better defined clusters by identifying words which should be excluded, but were not removed by the TTM module. These would be words that appear frequently in too many topics, but are not very relevant to any one of them. Words like: “VMware”, “thanks”, “sign-on”, “machine”, “virtual”, “manager”, etc.
Possible steps for improving the LDA model:
- Automating the process of removing words, which are prevalent in many topics. This was done manually as a result of the EDA, but it could be easily automated for any number of overlapping topics.
- Introducing n-grams into the model.
- Expanding the model to support other languages.
- Applying weights to the TF-IDF matrix, to make LDA semi-supervised. Essentially trying to make the model create a well defined cluster for a specific group of words.
- Comparing the LDA solution to a Fuzzy Clustering model.
- Fitting other models on individual clusters produced by the primary model.
Platform – Microsoft
This is a reasonable solution: it is based on LDA, which is a state-of-the-art clustering algorithm. The preprocessing makes sense, and also the implementation and the planned future work. There is also visualization and some analysis, which is nice to see.
My worry is about the number of clusters: why 5 clusters? How was this number selected? Is it too little, given that there are so many articles?
Regarding the topics, we did suspect that the actual number would be higher, but we somehow got the impression that 5 topics is what we were expected to look for in this case. Perhaps we misunderstood the requirements.
Getting the model running took about a day and it left us very little time to play with the data and get to know it. To be sure, the model needs refining, but we were more focused on getting the thing to work from start to finish.
Thanks for the comment. Yes, 5 clusters is too small number. We tried with 15, 20 as well. We planned to run a hyper parameter search over number of features, number of clusters, ‘choice’ of segmented articles (different paragraphs -> they are split in title, symptoms, details, products, etc..) and the text of the different paragraphs is taken with different importance during the preprocessing (lemmatization and tf-idf vectorization). We could not experiment too much since it took us tu much time to do the structure of the pipeline with different steps implemented. We created a video showing the decision in 2 minutes, unfortunately we maybe missed to save the article:
Not sure if the name “Critical Outliers” was chosen before seeing the data. I’ll say that this problem definitely needed that kind of critical thinking from the teams.
It is definitive appreciative of the team that they have been successful in building an end to end working system to solve a quite complex but very practical industry problem in the NLP space.
The approach to use LDA is quite classical and has been known to work on variety of data though the challenge is always to find the optimal model parameters to get it working perfectively. It is very important to make sure that the model parameters are chosen such that the features don’t overlap significantly in the topic space – Remember the elbow rule in the learning curve?
In conclusion, I’ll say that nice to see a working system but it would have been impressive to see some implementation of hieratical models like Agglomerative clustering.
Best of luck,
Thanks for the comment and the ideas. We wanted to experiment with clustering too. Also we started to consider some fuzzy clustering and added this as future ideas in the article since we didn’t have time to finish experimenting with it. We extracted about 20 000 KBs in English, so for hierarchical clustering I am not sure it will work with such high number of documents. Therefore we was thinking about some hybrid clustering (fuzzy K-means + hierarchical). Hierarchical Agglomerative clustering gives good results when we have smaller number of elements and if we are looking for many but small number of clusters whereas K-means like clustering algorithms are working well if we want to achieve small number of large clusters. We are going to experiment with this next weekend for fun. Thanks for the ideas.
You could include some results in the paper.
LDA is good algorithm to find topics, but it requires to specify the number of topics. Finding the right number of topics could be tricky since it requires some domain knowledge and deep understanding of the data.
Number of topics is too small.
It could be helpful to segment articles based on the metadata (product, version) and find topics (similarities) in each group.
The number of topics is parameter in the run.py script (sources_python.zip attachment) where it can be chosen arbitrary, this is a dataframe (csv file) showing the distributions of article IDs over topics. The results for 5 topics are in the attached file results_document_to_5topics.py. There is also a segmentation over the different paragraphs -> title, symptoms, details, products, etc… We created a parser (Beautiful Soup) parse_htmls.py (sources_python.zip attachment) which parses the raw html articles into structured json files with separated paragraphs. You can run parse_htmls.py with parameters htms_raw_directory and target_directory, where the json files will be parsed. After that we take into account the importance of each paragraph before lemmatisation and tf-idf-vectorization. For example the title and products, keywords are taken to be 5 times more important than other paragraphs (like details, purpose).
This was a short video describing the solution of the problem in 2 minutes. It should be included in the article, it is strange that I don’t see it there, may be we missed to save something.
We have c
Using LDA for solving this problem would have been ideal if we one had the knowledge of number of topics as number of topics is a required input to the model. In my opinion, the best way to get the number of topics would have been using some segmentation or agglomerative clustering. Once we have the number of topics, then giving it as input to the LDA model would have given better results. Just limiting yourself to 5 topics doesn’t give the desired result. While, I will give you thumbs-up on the chosen algorithm, I think some more though was needed on getting optimum input parameters (no.of topics)
Actually our next step would be some hybrid clustering (combining e.g. agglomerative with top-down) to extract adaptively optimal number of topics. The English KBs were too many (about 20 K I think) for pure agglomerative clustering – it gives best results for high number of smaller clusters where. Thanks for the ideas.