Integrating Embedding and LSHiForest in English Text Anomaly Detection

IF 1.5 4区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING Concurrency and Computation-Practice & Experience Pub Date : 2025-01-20 DOI:10.1002/cpe.8370

Qingquan Tong, Rongju Yao

{"title":"Integrating Embedding and LSHiForest in English Text Anomaly Detection","authors":"Qingquan Tong, Rongju Yao","doi":"10.1002/cpe.8370","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>In the realm of natural language processing (NLP), anomaly detection plays a critical role in identifying irregularities and outliers within textual data. Traditional methods often struggle with the high-dimensional and sparse nature of text data, leading to inefficiencies in detecting meaningful anomalies, especially in the big data application context. To address these challenges, this paper proposes the integration of LSHiForest (Locality-Sensitive Hashing Isolation Forest) into the process of English text anomaly detection. LSHiForest, which synergistically combines the dimensionality reduction capabilities of locality-sensitive hashing (LSH) with the robust outlier detection of Isolation Forest, offers a novel approach to handling the complexities of textual data. The proposed approach involves transforming English text into feature vectors, followed by the application of LSHiForest to detect anomalies across various text datasets. The effectiveness of this approach is evaluated through comparative experiments with traditional anomaly detection methods, with various performance metrics. The experimental results demonstrate that LSHiForest significantly improves the efficiency and accuracy of outlier identification in English text, particularly in scenarios involving large-scale and high-dimensional datasets.</p>\n </div>","PeriodicalId":55214,"journal":{"name":"Concurrency and Computation-Practice & Experience","volume":"37 3","pages":""},"PeriodicalIF":1.5000,"publicationDate":"2025-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Concurrency and Computation-Practice & Experience","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/cpe.8370","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

In the realm of natural language processing (NLP), anomaly detection plays a critical role in identifying irregularities and outliers within textual data. Traditional methods often struggle with the high-dimensional and sparse nature of text data, leading to inefficiencies in detecting meaningful anomalies, especially in the big data application context. To address these challenges, this paper proposes the integration of LSHiForest (Locality-Sensitive Hashing Isolation Forest) into the process of English text anomaly detection. LSHiForest, which synergistically combines the dimensionality reduction capabilities of locality-sensitive hashing (LSH) with the robust outlier detection of Isolation Forest, offers a novel approach to handling the complexities of textual data. The proposed approach involves transforming English text into feature vectors, followed by the application of LSHiForest to detect anomalies across various text datasets. The effectiveness of this approach is evaluated through comparative experiments with traditional anomaly detection methods, with various performance metrics. The experimental results demonstrate that LSHiForest significantly improves the efficiency and accuracy of outlier identification in English text, particularly in scenarios involving large-scale and high-dimensional datasets.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

求助全文

约1分钟内获得全文去求助

来源期刊

Concurrency and Computation-Practice & Experience 工程技术-计算机：理论方法

CiteScore

5.00

自引率

10.00%

发文量

664

审稿时长

9.6 months

期刊介绍： Concurrency and Computation: Practice and Experience (CCPE) publishes high-quality, original research papers, and authoritative research review papers, in the overlapping fields of: Parallel and distributed computing; High-performance computing; Computational and data science; Artificial intelligence and machine learning; Big data applications, algorithms, and systems; Network science; Ontologies and semantics; Security and privacy; Cloud/edge/fog computing; Green computing; and Quantum computing.