Md. Rashadul Hasan Rakib, Magdalena Jankowska, N. Zeh, E. Milios
{"title":"Improving Short Text Clustering by Similarity Matrix Sparsification","authors":"Md. Rashadul Hasan Rakib, Magdalena Jankowska, N. Zeh, E. Milios","doi":"10.1145/3209280.3229114","DOIUrl":null,"url":null,"abstract":"Short text clustering is an important but challenging task. We investigate impact of similarity matrix sparsification on the performance of short text clustering. We show that two sparsification methods (the proposed Similarity Distribution based, and k-nearest neighbors) that aim to retain a prescribed number of similarity elements per text, improve hierarchical clustering quality of short texts for various text similarities. These methods using a word embedding based similarity yield competitive results with state-of-the-art methods for short text clustering especially for general domain, and are faster than the main state-of-the-art baseline.","PeriodicalId":234145,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2018","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACM Symposium on Document Engineering 2018","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3209280.3229114","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
Short text clustering is an important but challenging task. We investigate impact of similarity matrix sparsification on the performance of short text clustering. We show that two sparsification methods (the proposed Similarity Distribution based, and k-nearest neighbors) that aim to retain a prescribed number of similarity elements per text, improve hierarchical clustering quality of short texts for various text similarities. These methods using a word embedding based similarity yield competitive results with state-of-the-art methods for short text clustering especially for general domain, and are faster than the main state-of-the-art baseline.