|For how many years have you been experimenting with data?||
Popular articles by jubi01
Popular comments by jubi01
Actually our next step would be some hybrid clustering (combining e.g. agglomerative with top-down) to extract adaptively optimal number of topics. The English KBs were too many (about 20 K I think) for pure agglomerative clustering – it gives best results for high number of smaller clusters where. Thanks for the ideas.
Thanks for the comment and the ideas. We wanted to experiment with clustering too. Also we started to consider some fuzzy clustering and added this as future ideas in the article since we didn’t have time to finish experimenting with it. We extracted about 20 000 KBs in English, so for hierarchical clustering I am not sure it will work with such high number of documents. Therefore we was thinking about some hybrid clustering (fuzzy K-means + hierarchical). Hierarchical Agglomerative clustering gives good results when we have smaller number of elements and if we are looking for many but small number of clusters whereas K-means like clustering algorithms are working well if we want to achieve small number of large clusters. We are going to experiment with this next weekend for fun. Thanks for the ideas.
Thanks for the comment. Yes, 5 clusters is too small number. We tried with 15, 20 as well. We planned to run a hyper parameter search over number of features, number of clusters, ‘choice’ of segmented articles (different paragraphs -> they are split in title, symptoms, details, products, etc..) and the text of the different paragraphs is taken with different importance during the preprocessing (lemmatization and tf-idf vectorization). We could not experiment too much since it took us tu much time to do the structure of the pipeline with different steps implemented. We created a video showing the decision in 2 minutes, unfortunately we maybe missed to save the article:
The number of topics is parameter in the run.py script (sources_python.zip attachment) where it can be chosen arbitrary, this is a dataframe (csv file) showing the distributions of article IDs over topics. The results for 5 topics are in the attached file results_document_to_5topics.py. There is also a segmentation over the different paragraphs -> title, symptoms, details, products, etc… We created a parser (Beautiful Soup) parse_htmls.py (sources_python.zip attachment) which parses the raw html articles into structured json files with separated paragraphs. You can run parse_htmls.py with parameters htms_raw_directory and target_directory, where the json files will be parsed. After that we take into account the importance of each paragraph before lemmatisation and tf-idf-vectorization. For example the title and products, keywords are taken to be 5 times more important than other paragraphs (like details, purpose).
This was a short video describing the solution of the problem in 2 minutes. It should be included in the article, it is strange that I don’t see it there, may be we missed to save something.
We have c