Alexander BakumenkoClemson University, USA, Kateřina Hlaváčková-SchindlerUniversity of Vienna, Austria, Claudia PlantUniversity of Vienna, Austria, Nina C. HubigClemson University, USA
{"title":"推进异常检测:使用 LLM 进行非语义金融数据编码","authors":"Alexander BakumenkoClemson University, USA, Kateřina Hlaváčková-SchindlerUniversity of Vienna, Austria, Claudia PlantUniversity of Vienna, Austria, Nina C. HubigClemson University, USA","doi":"arxiv-2406.03614","DOIUrl":null,"url":null,"abstract":"Detecting anomalies in general ledger data is of utmost importance to ensure\ntrustworthiness of financial records. Financial audits increasingly rely on\nmachine learning (ML) algorithms to identify irregular or potentially\nfraudulent journal entries, each characterized by a varying number of\ntransactions. In machine learning, heterogeneity in feature dimensions adds\nsignificant complexity to data analysis. In this paper, we introduce a novel\napproach to anomaly detection in financial data using Large Language Models\n(LLMs) embeddings. To encode non-semantic categorical data from real-world\nfinancial records, we tested 3 pre-trained general purpose sentence-transformer\nmodels. For the downstream classification task, we implemented and evaluated 5\noptimized ML models including Logistic Regression, Random Forest, Gradient\nBoosting Machines, Support Vector Machines, and Neural Networks. Our\nexperiments demonstrate that LLMs contribute valuable information to anomaly\ndetection as our models outperform the baselines, in selected settings even by\na large margin. The findings further underscore the effectiveness of LLMs in\nenhancing anomaly detection in financial journal entries, particularly by\ntackling feature sparsity. We discuss a promising perspective on using LLM\nembeddings for non-semantic data in the financial context and beyond.","PeriodicalId":501128,"journal":{"name":"arXiv - QuantFin - Risk Management","volume":"67 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Advancing Anomaly Detection: Non-Semantic Financial Data Encoding with LLMs\",\"authors\":\"Alexander BakumenkoClemson University, USA, Kateřina Hlaváčková-SchindlerUniversity of Vienna, Austria, Claudia PlantUniversity of Vienna, Austria, Nina C. HubigClemson University, USA\",\"doi\":\"arxiv-2406.03614\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Detecting anomalies in general ledger data is of utmost importance to ensure\\ntrustworthiness of financial records. Financial audits increasingly rely on\\nmachine learning (ML) algorithms to identify irregular or potentially\\nfraudulent journal entries, each characterized by a varying number of\\ntransactions. In machine learning, heterogeneity in feature dimensions adds\\nsignificant complexity to data analysis. In this paper, we introduce a novel\\napproach to anomaly detection in financial data using Large Language Models\\n(LLMs) embeddings. To encode non-semantic categorical data from real-world\\nfinancial records, we tested 3 pre-trained general purpose sentence-transformer\\nmodels. For the downstream classification task, we implemented and evaluated 5\\noptimized ML models including Logistic Regression, Random Forest, Gradient\\nBoosting Machines, Support Vector Machines, and Neural Networks. Our\\nexperiments demonstrate that LLMs contribute valuable information to anomaly\\ndetection as our models outperform the baselines, in selected settings even by\\na large margin. The findings further underscore the effectiveness of LLMs in\\nenhancing anomaly detection in financial journal entries, particularly by\\ntackling feature sparsity. We discuss a promising perspective on using LLM\\nembeddings for non-semantic data in the financial context and beyond.\",\"PeriodicalId\":501128,\"journal\":{\"name\":\"arXiv - QuantFin - Risk Management\",\"volume\":\"67 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-06-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - QuantFin - Risk Management\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2406.03614\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuantFin - Risk Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2406.03614","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Advancing Anomaly Detection: Non-Semantic Financial Data Encoding with LLMs
Detecting anomalies in general ledger data is of utmost importance to ensure
trustworthiness of financial records. Financial audits increasingly rely on
machine learning (ML) algorithms to identify irregular or potentially
fraudulent journal entries, each characterized by a varying number of
transactions. In machine learning, heterogeneity in feature dimensions adds
significant complexity to data analysis. In this paper, we introduce a novel
approach to anomaly detection in financial data using Large Language Models
(LLMs) embeddings. To encode non-semantic categorical data from real-world
financial records, we tested 3 pre-trained general purpose sentence-transformer
models. For the downstream classification task, we implemented and evaluated 5
optimized ML models including Logistic Regression, Random Forest, Gradient
Boosting Machines, Support Vector Machines, and Neural Networks. Our
experiments demonstrate that LLMs contribute valuable information to anomaly
detection as our models outperform the baselines, in selected settings even by
a large margin. The findings further underscore the effectiveness of LLMs in
enhancing anomaly detection in financial journal entries, particularly by
tackling feature sparsity. We discuss a promising perspective on using LLM
embeddings for non-semantic data in the financial context and beyond.