A classification model for Portuguese documents in the juridical domain

Luís Pinto, Andrés Melgar
{"title":"A classification model for Portuguese documents in the juridical domain","authors":"Luís Pinto, Andrés Melgar","doi":"10.1109/CISTI.2016.7521594","DOIUrl":null,"url":null,"abstract":"The attorney's office in Brazil, receive daily a lot of notifications. These notifications must be manually analyzed by procurators to determine what kind of document should they prepare to respond. This situation causes in many cases notifications are not answered in time causing these prescribed. All this has motivated the development of this work whose main objective is the development of a computational model to understand the meaning of each notification and indicate what kind of response should be prepared for every situation. For the construction of this model, machine-learning algorithms are used. The problem is modeled as one of classification using free text documents. The texts were extracted from notification documents, which were written in Portuguese. The method to assess the performance of the algorithms was the area under the curve. During the experiment, four algorithms were evaluated, including k-Nearest Neighbor, Support Vector Machine, Naive Bayes and Complement Naive Bayes. The algorithms were trained using a collection of Portuguese documents in the juridical domain, which includes 5471 documents divided into 8 categories. A 25-fold cross validation method was used to measure the unbiased estimate of these prediction models. This paper is a comparative study of machine learning algorithms for the problem of categorization of notifications. As a result of this study, an algorithm model was constructed in order to classify the documents in the corresponding class. The area under the curve value of Support Vector Machine, k-Nearest Neighbor, Naive Bayes and Complement Naive Bayes was 0.846, 0.831, 0.815 and 0.712 respectively. Our study shows that out of these four classification models Support Vector Machine predicts with highest area under the curve value.","PeriodicalId":339556,"journal":{"name":"2016 11th Iberian Conference on Information Systems and Technologies (CISTI)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 11th Iberian Conference on Information Systems and Technologies (CISTI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CISTI.2016.7521594","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

The attorney's office in Brazil, receive daily a lot of notifications. These notifications must be manually analyzed by procurators to determine what kind of document should they prepare to respond. This situation causes in many cases notifications are not answered in time causing these prescribed. All this has motivated the development of this work whose main objective is the development of a computational model to understand the meaning of each notification and indicate what kind of response should be prepared for every situation. For the construction of this model, machine-learning algorithms are used. The problem is modeled as one of classification using free text documents. The texts were extracted from notification documents, which were written in Portuguese. The method to assess the performance of the algorithms was the area under the curve. During the experiment, four algorithms were evaluated, including k-Nearest Neighbor, Support Vector Machine, Naive Bayes and Complement Naive Bayes. The algorithms were trained using a collection of Portuguese documents in the juridical domain, which includes 5471 documents divided into 8 categories. A 25-fold cross validation method was used to measure the unbiased estimate of these prediction models. This paper is a comparative study of machine learning algorithms for the problem of categorization of notifications. As a result of this study, an algorithm model was constructed in order to classify the documents in the corresponding class. The area under the curve value of Support Vector Machine, k-Nearest Neighbor, Naive Bayes and Complement Naive Bayes was 0.846, 0.831, 0.815 and 0.712 respectively. Our study shows that out of these four classification models Support Vector Machine predicts with highest area under the curve value.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
司法领域葡萄牙语文献的分类模型
在巴西的律师办公室,每天都会收到很多通知。检察官必须对这些通知进行人工分析,以确定他们应准备何种文件作出答复。这种情况导致在许多情况下,通知没有及时答复,造成这些规定。所有这些都推动了这项工作的发展,其主要目标是开发一个计算模型,以了解每个通知的含义,并指出应该为每种情况准备什么样的反应。对于该模型的构建,使用了机器学习算法。该问题被建模为使用自由文本文档的分类问题。这些文本摘自以葡萄牙语写成的通知文件。评估算法性能的方法是曲线下面积。在实验中,我们评估了四种算法,包括k-最近邻算法、支持向量机算法、朴素贝叶斯算法和补朴素贝叶斯算法。这些算法是使用一组葡萄牙语法律领域的文档进行训练的,其中包括5471个文档,分为8个类别。采用25倍交叉验证法对这些预测模型进行无偏估计。本文是针对通知分类问题的机器学习算法的比较研究。在此基础上,构建了一种算法模型,对相应类别的文档进行分类。支持向量机、k近邻、朴素贝叶斯和补朴素贝叶斯的曲线值下面积分别为0.846、0.831、0.815和0.712。我们的研究表明,在这四种分类模型中,支持向量机预测曲线下面积值最高。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
A maturity model for information governance Calidad de las web de los hospitales privados web quality of private hospitals Proposed framework for the CSIRT protection An approach password graphic for access control web Personnel scheduling: The starting point for solving real cases
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1