Sima Rezaeipourfarsangi, Ningyuan Pei, Ehsan Sherkat, E. Milios
{"title":"Interactive clustering and high-recall information retrieval using language models","authors":"Sima Rezaeipourfarsangi, Ningyuan Pei, Ehsan Sherkat, E. Milios","doi":"10.1145/3531073.3531174","DOIUrl":null,"url":null,"abstract":"Clustering is a crucial text mining technique for organizing digital document sets, enabling users to understand their data better. It has been demonstrated that involving users can often significantly improve clustering quality. We propose a novel system that combines deep language models (SBERT, Infer-Sent, and Universal Sentence Encoder) with interactive clustering enabling users to steer the clustering algorithm towards results meaningful to them through interactive document and cluster visualizations. Our system is comprised of several visual components, each of which allows the user to apply their domain knowledge to the clustering process. The use of deep language models for representing sentences addresses the vocabulary mismatch problem that affects bag-of-words representations of documents. We employ sentence embeddings to obtain document embeddings as an input to the clustering algorithm, a modified version of K-means. We conduct a two-stage evaluation of our system. First, we evaluate the proposed clustering models in automatic clustering of various publicly available data sets, and we confirm that they are competitive with state-of-the-art. Second, we conduct a formal expert study of a specific data set consisting of our research group’s readings (research papers in machine learning, text mining, and natural language processing) over several years. The domain expert is a graduate student whose thesis is in the above field. The expert study concludes that our system is significantly better at producing meaningful clusters than the baseline system (Vis-Kt).","PeriodicalId":412533,"journal":{"name":"Proceedings of the 2022 International Conference on Advanced Visual Interfaces","volume":"77 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 International Conference on Advanced Visual Interfaces","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3531073.3531174","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Clustering is a crucial text mining technique for organizing digital document sets, enabling users to understand their data better. It has been demonstrated that involving users can often significantly improve clustering quality. We propose a novel system that combines deep language models (SBERT, Infer-Sent, and Universal Sentence Encoder) with interactive clustering enabling users to steer the clustering algorithm towards results meaningful to them through interactive document and cluster visualizations. Our system is comprised of several visual components, each of which allows the user to apply their domain knowledge to the clustering process. The use of deep language models for representing sentences addresses the vocabulary mismatch problem that affects bag-of-words representations of documents. We employ sentence embeddings to obtain document embeddings as an input to the clustering algorithm, a modified version of K-means. We conduct a two-stage evaluation of our system. First, we evaluate the proposed clustering models in automatic clustering of various publicly available data sets, and we confirm that they are competitive with state-of-the-art. Second, we conduct a formal expert study of a specific data set consisting of our research group’s readings (research papers in machine learning, text mining, and natural language processing) over several years. The domain expert is a graduate student whose thesis is in the above field. The expert study concludes that our system is significantly better at producing meaningful clusters than the baseline system (Vis-Kt).