Anomaly Detection Between Judicial Text-Based Documents

2020 IEEE 14th International Conference on Application of Information and Communication Technologies (AICT) Pub Date : 2020-10-07 DOI:10.1109/AICT50176.2020.9368621

Mukhsimbayev Bobur, Kuralbayev Aibek, Bekbaganbetov Abay, Fuad Hajiyev

{"title":"Anomaly Detection Between Judicial Text-Based Documents","authors":"Mukhsimbayev Bobur, Kuralbayev Aibek, Bekbaganbetov Abay, Fuad Hajiyev","doi":"10.1109/AICT50176.2020.9368621","DOIUrl":null,"url":null,"abstract":"The problem of searching for anomalies or outliers are extremely important in various fields with problems like fraud detection, crime research, network reliability analysis, medical diagnostics etc.What is an anomaly in the judicial system? A court case is to be considered as an anomaly if the judge’s decision differs significantly from existing decisions in similar cases.In most cases, the existing outlier’s search methods use high-dimensional domains in which data can contain hundreds of dimensions. Such an approach requires lots of resources and clearly is not efficient.Objectives: In this article, the authors:•present two methods (or two models) for searching for anomalies in judicial practice;•give a comparative analysis of the results of the effectiveness of both methods.Methodology: The First method for searching for anomalies is a mix of two models: classification and similarity algorithms. Here algorithms like Logistic regression, Extreme Gradient Boosting (XGBoost), Tensorflow for classification and Latent Dirichlet Allocation (LDA), Latent semantic indexing (LSI) to find similar documents. The Second method shows the usage of the Bidirectional Encoder Representations from Transformers (BERT) embedding model and the Annoy indexing model.Findings: The second method shows better and fast results for searching outliers.Data source: Authors used the set of acts provided by the Supreme Court of the Republic of Kazakhstan. The dataset contains 1 million text documents and metadata.","PeriodicalId":136491,"journal":{"name":"2020 IEEE 14th International Conference on Application of Information and Communication Technologies (AICT)","volume":"53 4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 14th International Conference on Application of Information and Communication Technologies (AICT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AICT50176.2020.9368621","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

The problem of searching for anomalies or outliers are extremely important in various fields with problems like fraud detection, crime research, network reliability analysis, medical diagnostics etc.What is an anomaly in the judicial system? A court case is to be considered as an anomaly if the judge’s decision differs significantly from existing decisions in similar cases.In most cases, the existing outlier’s search methods use high-dimensional domains in which data can contain hundreds of dimensions. Such an approach requires lots of resources and clearly is not efficient.Objectives: In this article, the authors:•present two methods (or two models) for searching for anomalies in judicial practice;•give a comparative analysis of the results of the effectiveness of both methods.Methodology: The First method for searching for anomalies is a mix of two models: classification and similarity algorithms. Here algorithms like Logistic regression, Extreme Gradient Boosting (XGBoost), Tensorflow for classification and Latent Dirichlet Allocation (LDA), Latent semantic indexing (LSI) to find similar documents. The Second method shows the usage of the Bidirectional Encoder Representations from Transformers (BERT) embedding model and the Annoy indexing model.Findings: The second method shows better and fast results for searching outliers.Data source: Authors used the set of acts provided by the Supreme Court of the Republic of Kazakhstan. The dataset contains 1 million text documents and metadata.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于文本的司法文书异常检测

在欺诈检测、犯罪研究、网络可靠性分析、医疗诊断等各个领域中，寻找异常或异常值的问题都是非常重要的。司法系统中的异常是什么?如果法官的判决与类似案件的现有判决有重大不同，则法院案件将被视为异常案件。在大多数情况下，现有的离群值搜索方法使用高维域，其中的数据可以包含数百个维度。这种方法需要大量资源，显然效率不高。目的:在本文中，作者:•提出了两种方法(或两种模型)来搜索司法实践中的异常;•对两种方法的有效性结果进行了比较分析。方法:搜索异常的第一种方法是两种模型的混合:分类和相似算法。这里的算法包括逻辑回归、极端梯度增强(XGBoost)、用于分类的Tensorflow和用于查找类似文档的潜在狄利克雷分配(LDA)、潜在语义索引(LSI)。第二种方法展示了双向编码器表示从变压器(BERT)嵌入模型和骚扰索引模型的使用。结果:第二种方法对异常值的搜索结果更好、更快。数据来源:作者使用了哈萨克斯坦共和国最高法院提供的一套法令。该数据集包含100万个文本文档和元数据。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2020 IEEE 14th International Conference on Application of Information and Communication Technologies (AICT)

自引率

0.00%

发文量