Infectious risk events and their novelty in event-based surveillance: new definitions and annotated corpus

IF 1.8 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Language Resources and Evaluation Pub Date : 2024-03-05 DOI:10.1007/s10579-024-09728-w

François Delon, Gabriel Bédubourg, Léo Bouscarrat, Jean-Baptiste Meynard, Aude Valois, Benjamin Queyriaux, Carlos Ramisch, Marc Tanti

{"title":"Infectious risk events and their novelty in event-based surveillance: new definitions and annotated corpus","authors":"François Delon, Gabriel Bédubourg, Léo Bouscarrat, Jean-Baptiste Meynard, Aude Valois, Benjamin Queyriaux, Carlos Ramisch, Marc Tanti","doi":"10.1007/s10579-024-09728-w","DOIUrl":null,"url":null,"abstract":"<p> Event-based surveillance (EBS) requires the analysis of an ever-increasing volume of documents, requiring automated processing to support human analysts. Few annotated corpora are available for the evaluation of information extraction tools in the EBS domain. The main objective of this work was to build a corpus containing documents which are representative of those collected in the current EBS information systems, and to annotate them with events and their novelty. We proposed new definitions of infectious events and their novelty suited to the background work of analysts working in the EBS domain, and we compiled a corpus of 305 documents describing 283 infectious events. There were 36 included documents in French, representing a total of 11 events, with the remainder in English. We annotated novelty for the 110 most recent documents in the corpus, resulting in 101 events. The inter-annotator agreement was 0.74 for event identification (F1-Score) and 0.69 [95% CI: 0.51; 0.88] (Kappa) for novelty annotation. The overall agreement for entity annotation was lower, with a significant variation according to the type of entities considered (range 0.30–0.68). This corpus is a useful tool for creating and evaluating algorithms and methods submitted by EBS research teams for event detection and annotation of their novelties, aiming at the operational improvement of document flow processing. The small size of this corpus makes it less suitable for training natural language processing models, although this limitation tends to fade away when using few-shots learning methods.\n</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"116 1","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2024-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Language Resources and Evaluation","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10579-024-09728-w","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Event-based surveillance (EBS) requires the analysis of an ever-increasing volume of documents, requiring automated processing to support human analysts. Few annotated corpora are available for the evaluation of information extraction tools in the EBS domain. The main objective of this work was to build a corpus containing documents which are representative of those collected in the current EBS information systems, and to annotate them with events and their novelty. We proposed new definitions of infectious events and their novelty suited to the background work of analysts working in the EBS domain, and we compiled a corpus of 305 documents describing 283 infectious events. There were 36 included documents in French, representing a total of 11 events, with the remainder in English. We annotated novelty for the 110 most recent documents in the corpus, resulting in 101 events. The inter-annotator agreement was 0.74 for event identification (F1-Score) and 0.69 [95% CI: 0.51; 0.88] (Kappa) for novelty annotation. The overall agreement for entity annotation was lower, with a significant variation according to the type of entities considered (range 0.30–0.68). This corpus is a useful tool for creating and evaluating algorithms and methods submitted by EBS research teams for event detection and annotation of their novelties, aiming at the operational improvement of document flow processing. The small size of this corpus makes it less suitable for training natural language processing models, although this limitation tends to fade away when using few-shots learning methods.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于事件的监测中的传染性风险事件及其新颖性：新定义和注释语料库

基于事件的监控（EBS）需要分析越来越多的文件，这就需要自动处理来支持人工分析人员。用于评估 EBS 领域信息提取工具的注释语料很少。这项工作的主要目标是建立一个语料库，其中包含当前 EBS 信息系统中收集的具有代表性的文档，并为这些文档标注事件及其新颖性。我们对传染病事件及其新颖性提出了新的定义，以适应在 EBS 领域工作的分析人员的背景工作，我们编制了一个包含 305 篇文档的语料库，描述了 283 个传染病事件。其中包括 36 篇法文文档，共代表 11 个事件，其余为英文文档。我们对语料库中最新的 110 篇文档进行了新颖性注释，共产生 101 个事件。在事件识别方面，标注者之间的一致性为 0.74（F1-Score），在新颖性标注方面，标注者之间的一致性为 0.69 [95% CI: 0.51; 0.88]（Kappa）。实体标注的总体一致性较低，根据考虑的实体类型不同而存在显著差异（范围为 0.30-0.68）。该语料库是一个有用的工具，可用于创建和评估 EBS 研究团队提交的事件检测和新颖性标注算法和方法，从而改进文档流处理的运行。该语料库的规模较小，因此不太适合用于训练自然语言处理模型，不过在使用少量学习方法时，这一限制会逐渐消失。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Language Resources and Evaluation 工程技术-计算机：跨学科应用

CiteScore

6.50

自引率

3.70%

发文量

审稿时长

>12 weeks

期刊介绍： Language Resources and Evaluation is the first publication devoted to the acquisition, creation, annotation, and use of language resources, together with methods for evaluation of resources, technologies, and applications. Language resources include language data and descriptions in machine readable form used to assist and augment language processing applications, such as written or spoken corpora and lexica, multimodal resources, grammars, terminology or domain specific databases and dictionaries, ontologies, multimedia databases, etc., as well as basic software tools for their acquisition, preparation, annotation, management, customization, and use. Evaluation of language resources concerns assessing the state-of-the-art for a given technology, comparing different approaches to a given problem, assessing the availability of resources and technologies for a given application, benchmarking, and assessing system usability and user satisfaction.