Tung Nguyen , Tung Pham , Linh Ngo Van, Ha-Bang Ban, Khoat Than
{"title":"Out-of-vocabulary handling and topic quality control strategies in streaming topic models","authors":"Tung Nguyen , Tung Pham , Linh Ngo Van, Ha-Bang Ban, Khoat Than","doi":"10.1016/j.neucom.2024.128757","DOIUrl":null,"url":null,"abstract":"<div><div>Topic models have become ubiquitous tools for analyzing streaming data. However, existing streaming topic models suffer from several limitations when applied to real-world data streams. This includes the inability to accommodate evolving vocabularies and control topic quality throughout the streaming process. In this paper, we propose a novel streaming topic modeling approach that dynamically adapts to the changing nature of data streams. Our method leverages Byte-Pair Encoding embedding (BPEmb) to resolve the out-of-vocabulary problem that arises with new words in the stream. Additionally, we introduce a topic change variable that provides fine-grained control over topics’ parameter updates and present a preservation approach to retain high-coherence topics at each time step, helping preserve semantic quality. To further enhance model adaptability, our method allows dynamical adjustment of topic space size as needed. To the best of our knowledge, we are the first to address the expansion of vocabulary and maintain topic quality during the streaming process. Extensive experiments show the superior effectiveness of our method.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"614 ","pages":"Article 128757"},"PeriodicalIF":5.5000,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231224015285","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Topic models have become ubiquitous tools for analyzing streaming data. However, existing streaming topic models suffer from several limitations when applied to real-world data streams. This includes the inability to accommodate evolving vocabularies and control topic quality throughout the streaming process. In this paper, we propose a novel streaming topic modeling approach that dynamically adapts to the changing nature of data streams. Our method leverages Byte-Pair Encoding embedding (BPEmb) to resolve the out-of-vocabulary problem that arises with new words in the stream. Additionally, we introduce a topic change variable that provides fine-grained control over topics’ parameter updates and present a preservation approach to retain high-coherence topics at each time step, helping preserve semantic quality. To further enhance model adaptability, our method allows dynamical adjustment of topic space size as needed. To the best of our knowledge, we are the first to address the expansion of vocabulary and maintain topic quality during the streaming process. Extensive experiments show the superior effectiveness of our method.
期刊介绍:
Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.