For 20 years now, VMWare, Inc. has been providing cloud computing and virtualization software and services. Current VMware product line includes software in multiple categories:
- Server software – vSphere, ESX, vCenter
- Networking and security – NSX
- Storage and availability – vSAN
- Cloud management – vRealize Automation, Horizon
- Desktop software – Workstation, Fusion
The VMware Knowledge Base provides support solutions, error messages and troubleshooting guides.
Currently, there are around 35000 KB articles, covering multiple product versions, product combinations and written in several languages. With such variety and number of articles we face the problem of information duplication or solution same problem to be spread over multiple articles.
Your task will be cluster similar KB articles together so that they form clusters where the same problem is discussed.
We do not restrict you by techniques and algorithms you may use. Hence there is no labeled data set with ground truth and the context of the articles is very specific we encourage you to use unsupervised approaches.
You will need to go through the content of around 35000 articles and find the ones that resolve similar issues.
KB articles usually follow a stricture:
- Document id
- Purpose – brief summary of the guide, present if this article is a usage guide.
- Symptoms – symptoms of the system and problems that have occurred, present if this article is a troubleshooting guide.
- Cause – reasons why the issue might have occurred.
- Resolution – explains steps to be taken by the users.
- Workaround – steps that can be taken if the resolution guide is not applicable to the users’ case.
Each article also has a metadata, which contains – last update date, view count, category, language and list of products for which the article is applicable to. We are going to focus only on articles written in English.
The expected output is a list of topics and corresponding KB articles to this topic. You are not expected to provide meaningful summary of each topic, in other words you can just enumerate your topics.
We are going to provide __NUMBER_TOP_TOPICS__ of top topics and a list of corresponding KB – document ids. The KB articles have been placed to these topics by domain experts. You can use these set of topics as a validation set. We are also withholding different set of topics with corresponding KBs to be used as a private test set when we evaluate your work. We are going evaluate your work by looking through it by hand, however you can still use the validation set to get a feeling what we expect as an outcome.
- Download the data ___LINK___
- Run the example python code, which parses the language metadata from one of the html files.
- Good luck