首页 > 最新文献

Proceedings of the Natural Legal Language Processing Workshop 2021最新文献

英文 中文
Searching for Legal Documents at Paragraph Level: Automating Label Generation and Use of an Extended Attention Mask for Boosting Neural Models of Semantic Similarity 在段落级别搜索法律文件:自动标签生成和使用扩展注意掩码来增强语义相似度的神经模型
Pub Date : 1900-01-01 DOI: 10.18653/v1/2021.nllp-1.12
Li Tang, S. Clematide
Searching for legal documents is a specialized Information Retrieval task that is relevant for expert users (lawyers and their assistants) and for non-expert users. By searching previous court decisions (cases), a user can better prepare the legal reasoning of a new case. Being able to search using a natural language text snippet instead of a more artificial query could help to prevent query formulation issues. Also, if semantic similarity could be modeled beyond exact lexical matches, more relevant results can be found even if the query terms don’t match exactly. For this domain, we formulated a task to compare different ways of modeling semantic similarity at paragraph level, using neural and non-neural systems. We compared systems that encode the query and the search collection paragraphs as vectors, enabling the use of cosine similarity for results ranking. After building a German dataset for cases and statutes from Switzerland, and extracting citations from cases to statutes, we developed an algorithm for estimating semantic similarity at paragraph level, using a link-based similarity method. When evaluating different systems in this way, we find that semantic similarity modeling by neural systems can be boosted with an extended attention mask that quenches noise in the inputs.
搜索法律文件是一项专门的信息检索任务,它与专家用户(律师及其助手)和非专家用户相关。通过搜索以前的法院判决(案件),用户可以更好地准备新案件的法律推理。能够使用自然语言文本片段而不是更人工的查询进行搜索可以帮助防止查询公式问题。此外,如果语义相似性可以在精确的词汇匹配之外建模,那么即使查询词不完全匹配,也可以找到更相关的结果。对于这个领域,我们制定了一个任务来比较使用神经和非神经系统在段落级别建模语义相似性的不同方法。我们比较了将查询和搜索集合段落编码为向量的系统,从而可以使用余弦相似度进行结果排序。在建立了瑞士案例和法规的德语数据集,并提取了案例对法规的引用之后,我们开发了一种算法,使用基于链接的相似度方法来估计段落级别的语义相似度。当以这种方式评估不同的系统时,我们发现神经系统的语义相似性建模可以通过扩展的注意掩模来增强,该掩模可以消除输入中的噪声。
{"title":"Searching for Legal Documents at Paragraph Level: Automating Label Generation and Use of an Extended Attention Mask for Boosting Neural Models of Semantic Similarity","authors":"Li Tang, S. Clematide","doi":"10.18653/v1/2021.nllp-1.12","DOIUrl":"https://doi.org/10.18653/v1/2021.nllp-1.12","url":null,"abstract":"Searching for legal documents is a specialized Information Retrieval task that is relevant for expert users (lawyers and their assistants) and for non-expert users. By searching previous court decisions (cases), a user can better prepare the legal reasoning of a new case. Being able to search using a natural language text snippet instead of a more artificial query could help to prevent query formulation issues. Also, if semantic similarity could be modeled beyond exact lexical matches, more relevant results can be found even if the query terms don’t match exactly. For this domain, we formulated a task to compare different ways of modeling semantic similarity at paragraph level, using neural and non-neural systems. We compared systems that encode the query and the search collection paragraphs as vectors, enabling the use of cosine similarity for results ranking. After building a German dataset for cases and statutes from Switzerland, and extracting citations from cases to statutes, we developed an algorithm for estimating semantic similarity at paragraph level, using a link-based similarity method. When evaluating different systems in this way, we find that semantic similarity modeling by neural systems can be boosted with an extended attention mask that quenches noise in the inputs.","PeriodicalId":191237,"journal":{"name":"Proceedings of the Natural Legal Language Processing Workshop 2021","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115891691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Few-shot and Zero-shot Approaches to Legal Text Classification: A Case Study in the Financial Sector 法律文本分类的少射与零射方法:以金融业为例
Pub Date : 1900-01-01 DOI: 10.18653/v1/2021.nllp-1.10
Rajdeep Sarkar, Atul Kr. Ojha, Jay Megaro, J. Mariano, Vall Herard, John P. Mccrae
The application of predictive coding techniques to legal texts has the potential to greatly reduce the cost of legal review of documents, however, there is such a wide array of legal tasks and continuously evolving legislation that it is hard to construct sufficient training data to cover all cases. In this paper, we investigate few-shot and zero-shot approaches that require substantially less training data and introduce a triplet architecture, which for promissory statements produces performance close to that of a supervised system. This method allows predictive coding methods to be rapidly developed for new regulations and markets.
将预测编码技术应用于法律文本有可能大大降低对文件进行法律审查的成本,然而,法律任务和不断发展的立法范围如此之广,以至于很难构建足够的培训数据来涵盖所有案例。在本文中,我们研究了需要较少训练数据的few-shot和zero-shot方法,并引入了一个三重体系结构,它对期冀陈述产生接近监督系统的性能。这种方法使预测编码方法能够迅速发展,以适应新的法规和市场。
{"title":"Few-shot and Zero-shot Approaches to Legal Text Classification: A Case Study in the Financial Sector","authors":"Rajdeep Sarkar, Atul Kr. Ojha, Jay Megaro, J. Mariano, Vall Herard, John P. Mccrae","doi":"10.18653/v1/2021.nllp-1.10","DOIUrl":"https://doi.org/10.18653/v1/2021.nllp-1.10","url":null,"abstract":"The application of predictive coding techniques to legal texts has the potential to greatly reduce the cost of legal review of documents, however, there is such a wide array of legal tasks and continuously evolving legislation that it is hard to construct sufficient training data to cover all cases. In this paper, we investigate few-shot and zero-shot approaches that require substantially less training data and introduce a triplet architecture, which for promissory statements produces performance close to that of a supervised system. This method allows predictive coding methods to be rapidly developed for new regulations and markets.","PeriodicalId":191237,"journal":{"name":"Proceedings of the Natural Legal Language Processing Workshop 2021","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114545874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Effectively Leveraging BERT for Legal Document Classification 有效利用BERT进行法律文件分类
Pub Date : 1900-01-01 DOI: 10.18653/v1/2021.nllp-1.22
Nut Limsopatham
Bidirectional Encoder Representations from Transformers (BERT) has achieved state-of-the-art performances on several text classification tasks, such as GLUE and sentiment analysis. Recent work in the legal domain started to use BERT on tasks, such as legal judgement prediction and violation prediction. A common practise in using BERT is to fine-tune a pre-trained model on a target task and truncate the input texts to the size of the BERT input (e.g. at most 512 tokens). However, due to the unique characteristics of legal documents, it is not clear how to effectively adapt BERT in the legal domain. In this work, we investigate how to deal with long documents, and how is the importance of pre-training on documents from the same domain as the target task. We conduct experiments on the two recent datasets: ECHR Violation Dataset and the Overruling Task Dataset, which are multi-label and binary classification tasks, respectively. Importantly, on average the number of tokens in a document from the ECHR Violation Dataset is more than 1,600. While the documents in the Overruling Task Dataset are shorter (the maximum number of tokens is 204). We thoroughly compare several techniques for adapting BERT on long documents and compare different models pre-trained on the legal and other domains. Our experimental results show that we need to explicitly adapt BERT to handle long documents, as the truncation leads to less effective performance. We also found that pre-training on the documents that are similar to the target task would result in more effective performance on several scenario.
来自变形金刚的双向编码器表示(BERT)在一些文本分类任务上取得了最先进的性能,例如GLUE和情感分析。最近在法律领域的工作开始使用BERT来完成法律判决预测和违法行为预测等任务。使用BERT的常见做法是对目标任务上的预训练模型进行微调,并将输入文本截断到BERT输入的大小(例如最多512个令牌)。然而,由于法律文书的独特性,如何将BERT有效地应用于法律领域尚不明确。在这项工作中,我们研究了如何处理长文档,以及对来自与目标任务相同领域的文档进行预训练的重要性。我们在两个最新的数据集上进行了实验:ECHR违规数据集和Overruling Task数据集,这两个数据集分别是多标签和二值分类任务。重要的是,ECHR违规数据集文件中的令牌平均数量超过1600个。而Overruling Task Dataset中的文档更短(令牌的最大数量是204)。我们全面比较了在长文档上使用BERT的几种技术,并比较了在法律和其他领域预训练的不同模型。我们的实验结果表明,我们需要显式地调整BERT来处理长文档,因为截断会导致效率降低。我们还发现,在几个场景中,对与目标任务相似的文档进行预训练会产生更有效的性能。
{"title":"Effectively Leveraging BERT for Legal Document Classification","authors":"Nut Limsopatham","doi":"10.18653/v1/2021.nllp-1.22","DOIUrl":"https://doi.org/10.18653/v1/2021.nllp-1.22","url":null,"abstract":"Bidirectional Encoder Representations from Transformers (BERT) has achieved state-of-the-art performances on several text classification tasks, such as GLUE and sentiment analysis. Recent work in the legal domain started to use BERT on tasks, such as legal judgement prediction and violation prediction. A common practise in using BERT is to fine-tune a pre-trained model on a target task and truncate the input texts to the size of the BERT input (e.g. at most 512 tokens). However, due to the unique characteristics of legal documents, it is not clear how to effectively adapt BERT in the legal domain. In this work, we investigate how to deal with long documents, and how is the importance of pre-training on documents from the same domain as the target task. We conduct experiments on the two recent datasets: ECHR Violation Dataset and the Overruling Task Dataset, which are multi-label and binary classification tasks, respectively. Importantly, on average the number of tokens in a document from the ECHR Violation Dataset is more than 1,600. While the documents in the Overruling Task Dataset are shorter (the maximum number of tokens is 204). We thoroughly compare several techniques for adapting BERT on long documents and compare different models pre-trained on the legal and other domains. Our experimental results show that we need to explicitly adapt BERT to handle long documents, as the truncation leads to less effective performance. We also found that pre-training on the documents that are similar to the target task would result in more effective performance on several scenario.","PeriodicalId":191237,"journal":{"name":"Proceedings of the Natural Legal Language Processing Workshop 2021","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128438981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Automating Claim Construction in Patent Applications: The CMUmine Dataset 专利申请中的自动权利要求构建:CMUmine数据集
Pub Date : 1900-01-01 DOI: 10.18653/v1/2021.nllp-1.21
O. Tonguz, Yiwei Qin, Yimeng Gu, Hyun Hannah Moon
Intellectual Property (IP) in the form of issued patents is a critical and very desirable element of innovation in high-tech. In this position paper, we explore the possibility of automating the legal task of Claim Construction in patent applications via Natural Language Processing (NLP) and Machine Learning (ML). To this end, we first create a large dataset known as CMUmine™and then demonstrate that, using NLP and ML techniques the Claim Construction in patent applications, a crucial legal task currently performed by IP attorneys, can be automated. To the best of our knowledge, this is the first public patent application dataset. Our results look very promising in automating the patent application process.
专利形式的知识产权(IP)是高科技创新的关键和非常可取的因素。在这篇立场文件中,我们探讨了通过自然语言处理(NLP)和机器学习(ML)自动化专利申请中权利要求构建法律任务的可能性。为此,我们首先创建了一个名为CMUmine™的大型数据集,然后证明,使用NLP和ML技术,专利申请中的权利要求构建(目前由知识产权律师执行的关键法律任务)可以自动化。据我们所知,这是第一个公开的专利申请数据集。我们的研究结果在专利申请过程自动化方面很有前景。
{"title":"Automating Claim Construction in Patent Applications: The CMUmine Dataset","authors":"O. Tonguz, Yiwei Qin, Yimeng Gu, Hyun Hannah Moon","doi":"10.18653/v1/2021.nllp-1.21","DOIUrl":"https://doi.org/10.18653/v1/2021.nllp-1.21","url":null,"abstract":"Intellectual Property (IP) in the form of issued patents is a critical and very desirable element of innovation in high-tech. In this position paper, we explore the possibility of automating the legal task of Claim Construction in patent applications via Natural Language Processing (NLP) and Machine Learning (ML). To this end, we first create a large dataset known as CMUmine™and then demonstrate that, using NLP and ML techniques the Claim Construction in patent applications, a crucial legal task currently performed by IP attorneys, can be automated. To the best of our knowledge, this is the first public patent application dataset. Our results look very promising in automating the patent application process.","PeriodicalId":191237,"journal":{"name":"Proceedings of the Natural Legal Language Processing Workshop 2021","volume":"03 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129243419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
Proceedings of the Natural Legal Language Processing Workshop 2021
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1