Infectious risk events and their novelty in event-based surveillance: new definitions and annotated corpus

IF 1.7 3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Language Resources and Evaluation Pub Date : 2024-03-05 DOI:10.1007/s10579-024-09728-w
François Delon, Gabriel Bédubourg, Léo Bouscarrat, Jean-Baptiste Meynard, Aude Valois, Benjamin Queyriaux, Carlos Ramisch, Marc Tanti
{"title":"Infectious risk events and their novelty in event-based surveillance: new definitions and annotated corpus","authors":"François Delon, Gabriel Bédubourg, Léo Bouscarrat, Jean-Baptiste Meynard, Aude Valois, Benjamin Queyriaux, Carlos Ramisch, Marc Tanti","doi":"10.1007/s10579-024-09728-w","DOIUrl":null,"url":null,"abstract":"<p> Event-based surveillance (EBS) requires the analysis of an ever-increasing volume of documents, requiring automated processing to support human analysts. Few annotated corpora are available for the evaluation of information extraction tools in the EBS domain. The main objective of this work was to build a corpus containing documents which are representative of those collected in the current EBS information systems, and to annotate them with events and their novelty. We proposed new definitions of infectious events and their novelty suited to the background work of analysts working in the EBS domain, and we compiled a corpus of 305 documents describing 283 infectious events. There were 36 included documents in French, representing a total of 11 events, with the remainder in English. We annotated novelty for the 110 most recent documents in the corpus, resulting in 101 events. The inter-annotator agreement was 0.74 for event identification (F1-Score) and 0.69 [95% CI: 0.51; 0.88] (Kappa) for novelty annotation. The overall agreement for entity annotation was lower, with a significant variation according to the type of entities considered (range 0.30–0.68). This corpus is a useful tool for creating and evaluating algorithms and methods submitted by EBS research teams for event detection and annotation of their novelties, aiming at the operational improvement of document flow processing. The small size of this corpus makes it less suitable for training natural language processing models, although this limitation tends to fade away when using few-shots learning methods.\n</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"116 1","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2024-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Language Resources and Evaluation","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10579-024-09728-w","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0

Abstract

Event-based surveillance (EBS) requires the analysis of an ever-increasing volume of documents, requiring automated processing to support human analysts. Few annotated corpora are available for the evaluation of information extraction tools in the EBS domain. The main objective of this work was to build a corpus containing documents which are representative of those collected in the current EBS information systems, and to annotate them with events and their novelty. We proposed new definitions of infectious events and their novelty suited to the background work of analysts working in the EBS domain, and we compiled a corpus of 305 documents describing 283 infectious events. There were 36 included documents in French, representing a total of 11 events, with the remainder in English. We annotated novelty for the 110 most recent documents in the corpus, resulting in 101 events. The inter-annotator agreement was 0.74 for event identification (F1-Score) and 0.69 [95% CI: 0.51; 0.88] (Kappa) for novelty annotation. The overall agreement for entity annotation was lower, with a significant variation according to the type of entities considered (range 0.30–0.68). This corpus is a useful tool for creating and evaluating algorithms and methods submitted by EBS research teams for event detection and annotation of their novelties, aiming at the operational improvement of document flow processing. The small size of this corpus makes it less suitable for training natural language processing models, although this limitation tends to fade away when using few-shots learning methods.

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于事件的监测中的传染性风险事件及其新颖性:新定义和注释语料库
基于事件的监控(EBS)需要分析越来越多的文件,这就需要自动处理来支持人工分析人员。用于评估 EBS 领域信息提取工具的注释语料很少。这项工作的主要目标是建立一个语料库,其中包含当前 EBS 信息系统中收集的具有代表性的文档,并为这些文档标注事件及其新颖性。我们对传染病事件及其新颖性提出了新的定义,以适应在 EBS 领域工作的分析人员的背景工作,我们编制了一个包含 305 篇文档的语料库,描述了 283 个传染病事件。其中包括 36 篇法文文档,共代表 11 个事件,其余为英文文档。我们对语料库中最新的 110 篇文档进行了新颖性注释,共产生 101 个事件。在事件识别方面,标注者之间的一致性为 0.74(F1-Score),在新颖性标注方面,标注者之间的一致性为 0.69 [95% CI: 0.51; 0.88](Kappa)。实体标注的总体一致性较低,根据考虑的实体类型不同而存在显著差异(范围为 0.30-0.68)。该语料库是一个有用的工具,可用于创建和评估 EBS 研究团队提交的事件检测和新颖性标注算法和方法,从而改进文档流处理的运行。该语料库的规模较小,因此不太适合用于训练自然语言处理模型,不过在使用少量学习方法时,这一限制会逐渐消失。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Language Resources and Evaluation
Language Resources and Evaluation 工程技术-计算机:跨学科应用
CiteScore
6.50
自引率
3.70%
发文量
55
审稿时长
>12 weeks
期刊介绍: Language Resources and Evaluation is the first publication devoted to the acquisition, creation, annotation, and use of language resources, together with methods for evaluation of resources, technologies, and applications. Language resources include language data and descriptions in machine readable form used to assist and augment language processing applications, such as written or spoken corpora and lexica, multimodal resources, grammars, terminology or domain specific databases and dictionaries, ontologies, multimedia databases, etc., as well as basic software tools for their acquisition, preparation, annotation, management, customization, and use. Evaluation of language resources concerns assessing the state-of-the-art for a given technology, comparing different approaches to a given problem, assessing the availability of resources and technologies for a given application, benchmarking, and assessing system usability and user satisfaction.
期刊最新文献
Sentiment analysis dataset in Moroccan dialect: bridging the gap between Arabic and Latin scripted dialect Studying word meaning evolution through incremental semantic shift detection PARSEME-AR: Arabic reference corpus for multiword expressions using PARSEME annotation guidelines Normalized dataset for Sanskrit word segmentation and morphological parsing Conversion of the Spanish WordNet databases into a Prolog-readable format
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1