{"title":"A classification model for Portuguese documents in the juridical domain","authors":"Luís Pinto, Andrés Melgar","doi":"10.1109/CISTI.2016.7521594","DOIUrl":null,"url":null,"abstract":"The attorney's office in Brazil, receive daily a lot of notifications. These notifications must be manually analyzed by procurators to determine what kind of document should they prepare to respond. This situation causes in many cases notifications are not answered in time causing these prescribed. All this has motivated the development of this work whose main objective is the development of a computational model to understand the meaning of each notification and indicate what kind of response should be prepared for every situation. For the construction of this model, machine-learning algorithms are used. The problem is modeled as one of classification using free text documents. The texts were extracted from notification documents, which were written in Portuguese. The method to assess the performance of the algorithms was the area under the curve. During the experiment, four algorithms were evaluated, including k-Nearest Neighbor, Support Vector Machine, Naive Bayes and Complement Naive Bayes. The algorithms were trained using a collection of Portuguese documents in the juridical domain, which includes 5471 documents divided into 8 categories. A 25-fold cross validation method was used to measure the unbiased estimate of these prediction models. This paper is a comparative study of machine learning algorithms for the problem of categorization of notifications. As a result of this study, an algorithm model was constructed in order to classify the documents in the corresponding class. The area under the curve value of Support Vector Machine, k-Nearest Neighbor, Naive Bayes and Complement Naive Bayes was 0.846, 0.831, 0.815 and 0.712 respectively. Our study shows that out of these four classification models Support Vector Machine predicts with highest area under the curve value.","PeriodicalId":339556,"journal":{"name":"2016 11th Iberian Conference on Information Systems and Technologies (CISTI)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 11th Iberian Conference on Information Systems and Technologies (CISTI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CISTI.2016.7521594","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
The attorney's office in Brazil, receive daily a lot of notifications. These notifications must be manually analyzed by procurators to determine what kind of document should they prepare to respond. This situation causes in many cases notifications are not answered in time causing these prescribed. All this has motivated the development of this work whose main objective is the development of a computational model to understand the meaning of each notification and indicate what kind of response should be prepared for every situation. For the construction of this model, machine-learning algorithms are used. The problem is modeled as one of classification using free text documents. The texts were extracted from notification documents, which were written in Portuguese. The method to assess the performance of the algorithms was the area under the curve. During the experiment, four algorithms were evaluated, including k-Nearest Neighbor, Support Vector Machine, Naive Bayes and Complement Naive Bayes. The algorithms were trained using a collection of Portuguese documents in the juridical domain, which includes 5471 documents divided into 8 categories. A 25-fold cross validation method was used to measure the unbiased estimate of these prediction models. This paper is a comparative study of machine learning algorithms for the problem of categorization of notifications. As a result of this study, an algorithm model was constructed in order to classify the documents in the corresponding class. The area under the curve value of Support Vector Machine, k-Nearest Neighbor, Naive Bayes and Complement Naive Bayes was 0.846, 0.831, 0.815 and 0.712 respectively. Our study shows that out of these four classification models Support Vector Machine predicts with highest area under the curve value.