COfEE: A comprehensive ontology for event extraction from text

IF 3.4 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Computer Speech and Language Pub Date : 2025-01-01 Epub Date: 2024-07-31 DOI:10.1016/j.csl.2024.101702

Ali Balali, Masoud Asadpour, Seyed Hossein Jafari

{"title":"COfEE: A comprehensive ontology for event extraction from text","authors":"Ali Balali, Masoud Asadpour, Seyed Hossein Jafari","doi":"10.1016/j.csl.2024.101702","DOIUrl":null,"url":null,"abstract":"<div><p>Large volumes of data are constantly being published on the web; however, the majority of this data is often unstructured, making it difficult to comprehend and interpret. To extract meaningful and structured information from such data, researchers and practitioners have turned to Information Extraction (IE) methods. One of the most challenging IE tasks is Event Extraction (EE), which involves extracting information related to specific incidents and their associated actors from text. EE has broad applications, including building a knowledge base, information retrieval, summarization, and online monitoring systems. Over the past few decades, various event ontologies, such as ACE, CAMEO, and ICEWS, have been developed to define event forms, actors, and dimensions of events observed in text. However, these ontologies have some limitations, such as covering only a few topics like political events, having inflexible structures in defining argument roles, lacking analytical dimensions, and insufficient gold-standard data. To address these concerns, we propose a new event ontology, COfEE, which integrates expert domain knowledge, previous ontologies, and a data-driven approach for identifying events from text. COfEE comprises two hierarchy levels (event types and event sub-types) that include new categories related to environmental issues, cyberspace, criminal activity, and natural disasters that require real-time monitoring. In addition, dynamic roles are defined for each event sub-type to capture various dimensions of events. The proposed ontology is evaluated on Wikipedia events, and it is shown to be comprehensive and general. Furthermore, to facilitate the preparation of gold-standard data for event extraction, we present a language-independent online tool based on COfEE. A gold-standard dataset annotated by ten human experts consisting of 24,000 news articles in Persian according to the COfEE ontology is also prepared. To diversify the data, news articles from the Wikipedia event portal and the 100 most popular Persian news agencies between 2008 and 2021 were collected. Finally, we introduce a supervised method based on deep learning techniques to automatically extract relevant events and their corresponding actors.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101702"},"PeriodicalIF":3.4000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000858/pdfft?md5=edd34515a4d99328a0c8d35808aa0fe2&pid=1-s2.0-S0885230824000858-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230824000858","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/7/31 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Large volumes of data are constantly being published on the web; however, the majority of this data is often unstructured, making it difficult to comprehend and interpret. To extract meaningful and structured information from such data, researchers and practitioners have turned to Information Extraction (IE) methods. One of the most challenging IE tasks is Event Extraction (EE), which involves extracting information related to specific incidents and their associated actors from text. EE has broad applications, including building a knowledge base, information retrieval, summarization, and online monitoring systems. Over the past few decades, various event ontologies, such as ACE, CAMEO, and ICEWS, have been developed to define event forms, actors, and dimensions of events observed in text. However, these ontologies have some limitations, such as covering only a few topics like political events, having inflexible structures in defining argument roles, lacking analytical dimensions, and insufficient gold-standard data. To address these concerns, we propose a new event ontology, COfEE, which integrates expert domain knowledge, previous ontologies, and a data-driven approach for identifying events from text. COfEE comprises two hierarchy levels (event types and event sub-types) that include new categories related to environmental issues, cyberspace, criminal activity, and natural disasters that require real-time monitoring. In addition, dynamic roles are defined for each event sub-type to capture various dimensions of events. The proposed ontology is evaluated on Wikipedia events, and it is shown to be comprehensive and general. Furthermore, to facilitate the preparation of gold-standard data for event extraction, we present a language-independent online tool based on COfEE. A gold-standard dataset annotated by ten human experts consisting of 24,000 news articles in Persian according to the COfEE ontology is also prepared. To diversify the data, news articles from the Wikipedia event portal and the 100 most popular Persian news agencies between 2008 and 2021 were collected. Finally, we introduce a supervised method based on deep learning techniques to automatically extract relevant events and their corresponding actors.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

COfEE：从文本中提取事件的综合本体论

大量数据不断在网络上发布；然而，这些数据大多是非结构化的，因此难以理解和解释。为了从这些数据中提取有意义的结构化信息，研究人员和从业人员转向了信息提取（IE）方法。最具挑战性的信息提取任务之一是事件提取（EE），它涉及从文本中提取与特定事件及其相关人员有关的信息。EE 应用广泛，包括建立知识库、信息检索、总结和在线监控系统。过去几十年来，人们开发了各种事件本体，如 ACE、CAMEO 和 ICEWS，用于定义文本中观察到的事件的形式、参与者和维度。然而，这些本体论也有一些局限性，如仅涵盖政治事件等少数主题、在定义论点角色时结构不够灵活、缺乏分析维度以及黄金标准数据不足等。为了解决这些问题，我们提出了一个新的事件本体--COfEE，它整合了专家领域知识、以前的本体和数据驱动方法，用于从文本中识别事件。COfEE 包括两个层次（事件类型和事件子类型），其中包括与需要实时监控的环境问题、网络空间、犯罪活动和自然灾害相关的新类别。此外，还为每个事件子类型定义了动态角色，以捕捉事件的各个层面。在维基百科事件上对所提出的本体进行了评估，结果表明本体是全面和通用的。此外，为了便于准备用于事件提取的黄金标准数据，我们还提出了一种基于 COfEE 的、与语言无关的在线工具。我们还准备了一个黄金标准数据集，该数据集由十位人类专家根据 COfEE 本体注释的 24,000 篇波斯语新闻文章组成。为了使数据多样化，我们还从维基百科事件门户网站和 2008 年至 2021 年间最受欢迎的 100 家波斯语通讯社中收集了新闻文章。最后，我们引入了一种基于深度学习技术的监督方法，以自动提取相关事件及其相应的参与者。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Computer Speech and Language 工程技术-计算机：人工智能

CiteScore

11.30

自引率

4.70%

发文量

审稿时长

22.9 weeks

期刊介绍： Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language. The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.