VMware Case – Document Clustering
Parsing the html files:
- Only files tagged as English articles are kept by the parsing module.
- Option 1 – Beautiful Soup script which extracts the text from inside the tags of all of the html documents. It splits the html into paragraphs, based on the tags. Each html file is converted into a json file, containing only relevant keywords (Symptoms, Details, etc)
Option 1 was used in the end, since it preserved some of the document’s structure (distinct sections for titles, document IDs, solution descriptions etc).
Text Transformation Module (TTM):
- Changes every word to lower case, removes white-spaces, punctuation and specific symbols.
- Removes English stop words.
- Lemmatization – strips the words to their base forms (the lemmas).
- Applies weights to specific tags, paragraphs and keywords in the corpus (titles have more weight than other parts of the document for example).
- The name is short for Term Frequency–Inverse Document Frequency
- This matrix is used to track the number of times each lemma appears in the corpus.
- The TF-IDF is created based on the output from the TTM.
We applied Latent Dirichlet Allocation (LDA) to model the data. It is an unsupervised, soft-clustering algorithm. This model assumes that documents are just collections of words. Syntax and semantics aren’t taken into account on the model level, only word counts. LDA tries to group words, which occur often together.
LDA algorithm steps:
- Assign each word in the document to a topic randomly.
- For each word in the document:
- Assume that it’s topic assignment might be wrong, but all other assignments are right.
- Reassign the topic for the current word, based on the other topic assignments.
- Keep going until the topic assignments for all the words in the documents are stable.
Parameters we used for tuning the model:
- n_samples = None # number of articles to process – None for all
- n_features = 500 # number of words to scan in each document
- n_components = 5 # the number of topics/clusters to create
- n_top_words = 10 # for printing the top-most words that define a given topic
Results – Visualization and Evaluation
Interactive graphical visualization of the results was implemented, for a more intuitive and human-friendly way to show the word distributions among the topics/clusters. Also, topic distribution for each document can be viewed in tabular form. The results file is attached to the article, along with a short video explanation of the case. This procedure is reproducible.
The visualization of the word-per-topic distributions was used for a short EDA (exploratory data analysis). We were able to get better defined clusters by identifying words which should be excluded, but were not removed by the TTM module. These would be words that appear frequently in too many topics, but are not very relevant to any one of them. Words like: “VMware”, “thanks”, “sign-on”, “machine”, “virtual”, “manager”, etc.
Possible steps for improving the LDA model:
- Automating the process of removing words, which are prevalent in many topics. This was done manually as a result of the EDA, but it could be easily automated for any number of overlapping topics.
- Introducing n-grams into the model.
- Expanding the model to support other languages.
- Applying weights to the TF-IDF matrix, to make LDA semi-supervised. Essentially trying to make the model create a well defined cluster for a specific group of words.
- Comparing the LDA solution to a Fuzzy Clustering model.
- Fitting other models on individual clusters produced by the primary model.
Platform – Microsoft