Mukhsimbayev Bobur, Kuralbayev Aibek, Bekbaganbetov Abay, Fuad Hajiyev
{"title":"Anomaly Detection Between Judicial Text-Based Documents","authors":"Mukhsimbayev Bobur, Kuralbayev Aibek, Bekbaganbetov Abay, Fuad Hajiyev","doi":"10.1109/AICT50176.2020.9368621","DOIUrl":null,"url":null,"abstract":"The problem of searching for anomalies or outliers are extremely important in various fields with problems like fraud detection, crime research, network reliability analysis, medical diagnostics etc.What is an anomaly in the judicial system? A court case is to be considered as an anomaly if the judge’s decision differs significantly from existing decisions in similar cases.In most cases, the existing outlier’s search methods use high-dimensional domains in which data can contain hundreds of dimensions. Such an approach requires lots of resources and clearly is not efficient.Objectives: In this article, the authors:•present two methods (or two models) for searching for anomalies in judicial practice;•give a comparative analysis of the results of the effectiveness of both methods.Methodology: The First method for searching for anomalies is a mix of two models: classification and similarity algorithms. Here algorithms like Logistic regression, Extreme Gradient Boosting (XGBoost), Tensorflow for classification and Latent Dirichlet Allocation (LDA), Latent semantic indexing (LSI) to find similar documents. The Second method shows the usage of the Bidirectional Encoder Representations from Transformers (BERT) embedding model and the Annoy indexing model.Findings: The second method shows better and fast results for searching outliers.Data source: Authors used the set of acts provided by the Supreme Court of the Republic of Kazakhstan. The dataset contains 1 million text documents and metadata.","PeriodicalId":136491,"journal":{"name":"2020 IEEE 14th International Conference on Application of Information and Communication Technologies (AICT)","volume":"53 4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 14th International Conference on Application of Information and Communication Technologies (AICT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AICT50176.2020.9368621","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
The problem of searching for anomalies or outliers are extremely important in various fields with problems like fraud detection, crime research, network reliability analysis, medical diagnostics etc.What is an anomaly in the judicial system? A court case is to be considered as an anomaly if the judge’s decision differs significantly from existing decisions in similar cases.In most cases, the existing outlier’s search methods use high-dimensional domains in which data can contain hundreds of dimensions. Such an approach requires lots of resources and clearly is not efficient.Objectives: In this article, the authors:•present two methods (or two models) for searching for anomalies in judicial practice;•give a comparative analysis of the results of the effectiveness of both methods.Methodology: The First method for searching for anomalies is a mix of two models: classification and similarity algorithms. Here algorithms like Logistic regression, Extreme Gradient Boosting (XGBoost), Tensorflow for classification and Latent Dirichlet Allocation (LDA), Latent semantic indexing (LSI) to find similar documents. The Second method shows the usage of the Bidirectional Encoder Representations from Transformers (BERT) embedding model and the Annoy indexing model.Findings: The second method shows better and fast results for searching outliers.Data source: Authors used the set of acts provided by the Supreme Court of the Republic of Kazakhstan. The dataset contains 1 million text documents and metadata.