Team solutions

Critical Outliers – VMware Case

Overview of the data flow.
Data => Parser => TTM => TF-IDF => Model => Document Clusters

0
votes

9 thoughts on “Critical Outliers – VMware Case

  1. 2
    votes

    This is a reasonable solution: it is based on LDA, which is a state-of-the-art clustering algorithm. The preprocessing makes sense, and also the implementation and the planned future work. There is also visualization and some analysis, which is nice to see.

    My worry is about the number of clusters: why 5 clusters? How was this number selected? Is it too little, given that there are so many articles?

    1. 0
      votes

      Regarding the topics, we did suspect that the actual number would be higher, but we somehow got the impression that 5 topics is what we were expected to look for in this case. Perhaps we misunderstood the requirements.
      Getting the model running took about a day and it left us very little time to play with the data and get to know it. To be sure, the model needs refining, but we were more focused on getting the thing to work from start to finish.

    2. 0
      votes

      Thanks for the comment. Yes, 5 clusters is too small number. We tried with 15, 20 as well. We planned to run a hyper parameter search over number of features, number of clusters, ‘choice’ of segmented articles (different paragraphs -> they are split in title, symptoms, details, products, etc..) and the text of the different paragraphs is taken with different importance during the preprocessing (lemmatization and tf-idf vectorization). We could not experiment too much since it took us tu much time to do the structure of the pipeline with different steps implemented. We created a video showing the decision in 2 minutes, unfortunately we maybe missed to save the article:
      https://www.youtube.com/watch?v=VvL0tEHyw_U

  2. 2
    votes

    Not sure if the name “Critical Outliers” was chosen before seeing the data. I’ll say that this problem definitely needed that kind of critical thinking from the teams.
    It is definitive appreciative of the team that they have been successful in building an end to end working system to solve a quite complex but very practical industry problem in the NLP space.
    The approach to use LDA is quite classical and has been known to work on variety of data though the challenge is always to find the optimal model parameters to get it working perfectively. It is very important to make sure that the model parameters are chosen such that the features don’t overlap significantly in the topic space – Remember the elbow rule in the learning curve?
    In conclusion, I’ll say that nice to see a working system but it would have been impressive to see some implementation of hieratical models like Agglomerative clustering.
    Best of luck,

    1. 0
      votes

      Thanks for the comment and the ideas. We wanted to experiment with clustering too. Also we started to consider some fuzzy clustering and added this as future ideas in the article since we didn’t have time to finish experimenting with it. We extracted about 20 000 KBs in English, so for hierarchical clustering I am not sure it will work with such high number of documents. Therefore we was thinking about some hybrid clustering (fuzzy K-means + hierarchical). Hierarchical Agglomerative clustering gives good results when we have smaller number of elements and if we are looking for many but small number of clusters whereas K-means like clustering algorithms are working well if we want to achieve small number of large clusters. We are going to experiment with this next weekend for fun. Thanks for the ideas.

  3. 0
    votes

    You could include some results in the paper.
    LDA is good algorithm to find topics, but it requires to specify the number of topics. Finding the right number of topics could be tricky since it requires some domain knowledge and deep understanding of the data.
    Number of topics is too small.
    It could be helpful to segment articles based on the metadata (product, version) and find topics (similarities) in each group.

    1. 0
      votes

      The number of topics is parameter in the run.py script (sources_python.zip attachment) where it can be chosen arbitrary, this is a dataframe (csv file) showing the distributions of article IDs over topics. The results for 5 topics are in the attached file results_document_to_5topics.py. There is also a segmentation over the different paragraphs -> title, symptoms, details, products, etc… We created a parser (Beautiful Soup) parse_htmls.py (sources_python.zip attachment) which parses the raw html articles into structured json files with separated paragraphs. You can run parse_htmls.py with parameters htms_raw_directory and target_directory, where the json files will be parsed. After that we take into account the importance of each paragraph before lemmatisation and tf-idf-vectorization. For example the title and products, keywords are taken to be 5 times more important than other paragraphs (like details, purpose).
      This was a short video describing the solution of the problem in 2 minutes. It should be included in the article, it is strange that I don’t see it there, may be we missed to save something.
      https://www.youtube.com/watch?v=VvL0tEHyw_U
      We have c

  4. 1
    votes

    Using LDA for solving this problem would have been ideal if we one had the knowledge of number of topics as number of topics is a required input to the model. In my opinion, the best way to get the number of topics would have been using some segmentation or agglomerative clustering. Once we have the number of topics, then giving it as input to the LDA model would have given better results. Just limiting yourself to 5 topics doesn’t give the desired result. While, I will give you thumbs-up on the chosen algorithm, I think some more though was needed on getting optimum input parameters (no.of topics)

    1. 0
      votes

      Actually our next step would be some hybrid clustering (combining e.g. agglomerative with top-down) to extract adaptively optimal number of topics. The English KBs were too many (about 20 K I think) for pure agglomerative clustering – it gives best results for high number of smaller clusters where. Thanks for the ideas.

Leave a Reply