优化人机协作，高效提取文本文件中的高精度信息

ACM Journal on Responsible Computing Pub Date : 2024-03-26 DOI:10.1145/3652591

Bradley Butcher, Miri Zilka, Jiri Hron, Darren Cook, Adrian Weller

{"title":"优化人机协作，高效提取文本文件中的高精度信息","authors":"Bradley Butcher, Miri Zilka, Jiri Hron, Darren Cook, Adrian Weller","doi":"10.1145/3652591","DOIUrl":null,"url":null,"abstract":"From science to law enforcement, many research questions are answerable only by poring over a large amount of unstructured text documents. While people can extract information from such documents with high accuracy, this is often too time-consuming to be practical. On the other hand, automated approaches produce nearly-immediate results, but are not reliable enough for applications where near-perfect precision is essential. Motivated by two use cases from criminal justice, we consider the benefits and drawbacks of various human-only, human-machine, and machine-only approaches. Finding no tool well suited for our use cases, we develop a human-in-the-loop method for fast but accurate extraction of structured data from unstructured text. The tool is based on automated extraction followed by human validation, and is particularly useful in cases where purely manual extraction is not practical. Testing on three criminal justice datasets, we find that the combination of the computer speed and human understanding yields precision comparable to manual annotation while requiring only a fraction of time, and significantly outperforms the precision of all fully automated baselines.","PeriodicalId":486991,"journal":{"name":"ACM Journal on Responsible Computing","volume":"86 24","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Optimising Human-Machine Collaboration for Efficient High-Precision Information Extraction from Text Documents\",\"authors\":\"Bradley Butcher, Miri Zilka, Jiri Hron, Darren Cook, Adrian Weller\",\"doi\":\"10.1145/3652591\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"From science to law enforcement, many research questions are answerable only by poring over a large amount of unstructured text documents. While people can extract information from such documents with high accuracy, this is often too time-consuming to be practical. On the other hand, automated approaches produce nearly-immediate results, but are not reliable enough for applications where near-perfect precision is essential. Motivated by two use cases from criminal justice, we consider the benefits and drawbacks of various human-only, human-machine, and machine-only approaches. Finding no tool well suited for our use cases, we develop a human-in-the-loop method for fast but accurate extraction of structured data from unstructured text. The tool is based on automated extraction followed by human validation, and is particularly useful in cases where purely manual extraction is not practical. Testing on three criminal justice datasets, we find that the combination of the computer speed and human understanding yields precision comparable to manual annotation while requiring only a fraction of time, and significantly outperforms the precision of all fully automated baselines.\",\"PeriodicalId\":486991,\"journal\":{\"name\":\"ACM Journal on Responsible Computing\",\"volume\":\"86 24\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-03-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Journal on Responsible Computing\",\"FirstCategoryId\":\"0\",\"ListUrlMain\":\"https://doi.org/10.1145/3652591\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Journal on Responsible Computing","FirstCategoryId":"0","ListUrlMain":"https://doi.org/10.1145/3652591","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

从科学到执法，许多研究问题只能通过研究大量的非结构化文本文档才能找到答案。虽然人们可以从这些文档中提取出高精度的信息，但这往往过于耗时，不切实际。另一方面，自动方法几乎可以立即产生结果，但对于需要近乎完美的精确度的应用来说却不够可靠。受刑事司法中两个使用案例的启发，我们考虑了各种纯人工、人机和纯机器方法的优点和缺点。我们发现没有一种工具非常适合我们的使用案例，因此我们开发了一种 "人在回路中 "的方法，用于从非结构化文本中快速而准确地提取结构化数据。该工具以自动提取为基础，然后进行人工验证，在纯人工提取不可行的情况下特别有用。我们在三个刑事司法数据集上进行了测试，发现将计算机速度和人的理解力结合起来，可以获得与人工标注相当的精确度，而所需时间仅为人工标注的一小部分，其精确度大大超过了所有全自动基线。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Optimising Human-Machine Collaboration for Efficient High-Precision Information Extraction from Text Documents

From science to law enforcement, many research questions are answerable only by poring over a large amount of unstructured text documents. While people can extract information from such documents with high accuracy, this is often too time-consuming to be practical. On the other hand, automated approaches produce nearly-immediate results, but are not reliable enough for applications where near-perfect precision is essential. Motivated by two use cases from criminal justice, we consider the benefits and drawbacks of various human-only, human-machine, and machine-only approaches. Finding no tool well suited for our use cases, we develop a human-in-the-loop method for fast but accurate extraction of structured data from unstructured text. The tool is based on automated extraction followed by human validation, and is particularly useful in cases where purely manual extraction is not practical. Testing on three criminal justice datasets, we find that the combination of the computer speed and human understanding yields precision comparable to manual annotation while requiring only a fraction of time, and significantly outperforms the precision of all fully automated baselines.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACM Journal on Responsible Computing

自引率

0.00%

发文量