{"title":"Integrating Embedding and LSHiForest in English Text Anomaly Detection","authors":"Qingquan Tong, Rongju Yao","doi":"10.1002/cpe.8370","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>In the realm of natural language processing (NLP), anomaly detection plays a critical role in identifying irregularities and outliers within textual data. Traditional methods often struggle with the high-dimensional and sparse nature of text data, leading to inefficiencies in detecting meaningful anomalies, especially in the big data application context. To address these challenges, this paper proposes the integration of LSHiForest (Locality-Sensitive Hashing Isolation Forest) into the process of English text anomaly detection. LSHiForest, which synergistically combines the dimensionality reduction capabilities of locality-sensitive hashing (LSH) with the robust outlier detection of Isolation Forest, offers a novel approach to handling the complexities of textual data. The proposed approach involves transforming English text into feature vectors, followed by the application of LSHiForest to detect anomalies across various text datasets. The effectiveness of this approach is evaluated through comparative experiments with traditional anomaly detection methods, with various performance metrics. The experimental results demonstrate that LSHiForest significantly improves the efficiency and accuracy of outlier identification in English text, particularly in scenarios involving large-scale and high-dimensional datasets.</p>\n </div>","PeriodicalId":55214,"journal":{"name":"Concurrency and Computation-Practice & Experience","volume":"37 3","pages":""},"PeriodicalIF":1.5000,"publicationDate":"2025-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Concurrency and Computation-Practice & Experience","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/cpe.8370","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0
Abstract
In the realm of natural language processing (NLP), anomaly detection plays a critical role in identifying irregularities and outliers within textual data. Traditional methods often struggle with the high-dimensional and sparse nature of text data, leading to inefficiencies in detecting meaningful anomalies, especially in the big data application context. To address these challenges, this paper proposes the integration of LSHiForest (Locality-Sensitive Hashing Isolation Forest) into the process of English text anomaly detection. LSHiForest, which synergistically combines the dimensionality reduction capabilities of locality-sensitive hashing (LSH) with the robust outlier detection of Isolation Forest, offers a novel approach to handling the complexities of textual data. The proposed approach involves transforming English text into feature vectors, followed by the application of LSHiForest to detect anomalies across various text datasets. The effectiveness of this approach is evaluated through comparative experiments with traditional anomaly detection methods, with various performance metrics. The experimental results demonstrate that LSHiForest significantly improves the efficiency and accuracy of outlier identification in English text, particularly in scenarios involving large-scale and high-dimensional datasets.
期刊介绍:
Concurrency and Computation: Practice and Experience (CCPE) publishes high-quality, original research papers, and authoritative research review papers, in the overlapping fields of:
Parallel and distributed computing;
High-performance computing;
Computational and data science;
Artificial intelligence and machine learning;
Big data applications, algorithms, and systems;
Network science;
Ontologies and semantics;
Security and privacy;
Cloud/edge/fog computing;
Green computing; and
Quantum computing.