Data & Knowledge Engineering最新文献_第4页

CGT: A Clause Graph Transformer Structure for aspect-based sentiment analysis CGT：用于基于方面的情感分析的条款图转换器结构

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data & Knowledge Engineering

Pub Date : 2024-06-13 DOI: 10.1016/j.datak.2024.102332

Zelong Su , Bin Gao , Xiaoou Pan , Zhengjun Liu , Yu Ji , Shutian Liu

In the realm of natural language processing (NLP), aspect-based sentiment analysis plays a pivotal role. Recently, there has been a growing emphasis on techniques leveraging Graph Convolutional Neural Network (GCN). However, there are several challenges associated with current approaches: (1) Due to the inherent transitivity of CGN, training inevitably entails the acquisition of irrelevant semantic information. (2) Existing methodologies heavily depend on the dependency tree, neglecting to consider the contextual structure of the sentence. (3) Another limitation of the majority of methods is their failure to account for the interactions occurring between different aspects. In this study, we propose a Clause Graph Transformer Structure (CGT) to alleviate these limitations. Specifically, CGT comprises three modules. The preprocessing module extracts aspect clauses from each sentence by bi-directionally traversing the constituent tree, reducing reliance on syntax trees and extracting semantic information from the perspective of clauses. Additionally, we assert that a word’s vector direction signifies its underlying attitude in the semantic space, a feature often overlooked in recent research. Without the necessity for additional parameters, we introduce the Clause Attention encoder (CA-encoder) to the clause module to effectively capture the directed cross-correlation coefficient between the clause and the target aspect. To enhance the representation of the target component, we propose capturing the connections between various aspects. In the inter-aspect module, we intricately design a Balanced Attention encoder (BA-encoder) that forms an aspect sequence by navigating the extracted phrase tree. To effectively capture the emotion of implicit components, we introduce a Top-K Attention Graph Convolutional Network (KA-GCN). Our proposed method has showcased state-of-the-art (SOTA) performance through experiments conducted on four widely used datasets. Furthermore, our model demonstrates a significant improvement in the robustness of datasets subjected to disturbances.

在自然语言处理（NLP）领域，基于方面的情感分析起着举足轻重的作用。最近，人们越来越重视利用图卷积神经网络（GCN）的技术。然而，目前的方法面临着几个挑战：(1) 由于图卷积神经网络固有的传递性，训练不可避免地会获取不相关的语义信息。(2) 现有方法严重依赖依赖树，忽略了句子的上下文结构。(3) 大多数方法的另一个局限是没有考虑到不同方面之间的相互作用。在本研究中，我们提出了语句图转换器结构（CGT）来缓解这些局限性。具体来说，CGT 包括三个模块。预处理模块通过双向遍历成分树从每个句子中提取方面分句，从而减少对句法树的依赖，并从分句的角度提取语义信息。此外，我们认为一个词的矢量方向标志着它在语义空间中的基本态度，而这一特征在最近的研究中经常被忽视。在不需要额外参数的情况下，我们在分句模块中引入了分句注意编码器（CA-encoder），以有效捕捉分句与目标方面之间的定向交叉相关系数。为了增强目标成分的表示，我们建议捕捉各方面之间的联系。在方面间模块中，我们复杂地设计了一个平衡注意力编码器（BA-encoder），通过导航提取的短语树形成一个方面序列。为了有效捕捉隐含成分的情感，我们引入了顶层注意力图卷积网络（KA-GCN）。通过在四个广泛使用的数据集上进行实验，我们提出的方法展示了最先进的（SOTA）性能。此外，我们的模型还显著提高了数据集受干扰时的鲁棒性。

{"title":"CGT: A Clause Graph Transformer Structure for aspect-based sentiment analysis","authors":"Zelong Su , Bin Gao , Xiaoou Pan , Zhengjun Liu , Yu Ji , Shutian Liu","doi":"10.1016/j.datak.2024.102332","DOIUrl":"https://doi.org/10.1016/j.datak.2024.102332","url":null,"abstract":"<div><p>In the realm of natural language processing (NLP), aspect-based sentiment analysis plays a pivotal role. Recently, there has been a growing emphasis on techniques leveraging Graph Convolutional Neural Network (GCN). However, there are several challenges associated with current approaches: (1) Due to the inherent transitivity of CGN, training inevitably entails the acquisition of irrelevant semantic information. (2) Existing methodologies heavily depend on the dependency tree, neglecting to consider the contextual structure of the sentence. (3) Another limitation of the majority of methods is their failure to account for the interactions occurring between different aspects. In this study, we propose a Clause Graph Transformer Structure (CGT) to alleviate these limitations. Specifically, CGT comprises three modules. The preprocessing module extracts aspect clauses from each sentence by bi-directionally traversing the constituent tree, reducing reliance on syntax trees and extracting semantic information from the perspective of clauses. Additionally, we assert that a word’s vector direction signifies its underlying attitude in the semantic space, a feature often overlooked in recent research. Without the necessity for additional parameters, we introduce the Clause Attention encoder (CA-encoder) to the clause module to effectively capture the directed cross-correlation coefficient between the clause and the target aspect. To enhance the representation of the target component, we propose capturing the connections between various aspects. In the inter-aspect module, we intricately design a Balanced Attention encoder (BA-encoder) that forms an aspect sequence by navigating the extracted phrase tree. To effectively capture the emotion of implicit components, we introduce a Top-K Attention Graph Convolutional Network (KA-GCN). Our proposed method has showcased state-of-the-art (SOTA) performance through experiments conducted on four widely used datasets. Furthermore, our model demonstrates a significant improvement in the robustness of datasets subjected to disturbances.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"153 ","pages":"Article 102332"},"PeriodicalIF":2.5,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141328532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A design theory for data quality tools in data ecosystems: Findings from three industry cases 数据生态系统中数据质量工具的设计理论：从三个行业案例中得出的结论

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data & Knowledge Engineering

Pub Date : 2024-06-08 DOI: 10.1016/j.datak.2024.102333

Marcel Altendeitering , Tobias Moritz Guggenberger , Frederik Möller

Data ecosystems are a novel inter-organizational form of cooperation. They require at least one data provider and one or more data consumers. Existing research mainly addresses generativity mechanisms in this relationship, such as business models or role models for data ecosystems. However, an essential prerequisite for thriving data ecosystems is high data quality in the shared data. Without sufficient data quality, sharing data might lead to negative business consequences, given that the information drawn from them or services built on them might be incorrect or produce fraudulent results. We tackle this issue precisely since we report on a multi-case study deploying data quality tools in data ecosystem scenarios. From these cases, we derive generalized prescriptive design knowledge as a design theory to make the knowledge available for others designing data quality tools for data sharing. Subsequently, our study contributes to integrating the issue of data quality in data ecosystem research and provides practitioners with actionable guidelines inferred from three real-world cases.

数据生态系统是一种新颖的组织间合作形式。它们至少需要一个数据提供者和一个或多个数据消费者。现有研究主要涉及这种关系中的生成机制，如数据生态系统的商业模式或角色模型。然而，数据生态系统蓬勃发展的一个基本前提是共享数据的高质量。如果没有足够的数据质量，共享数据可能会导致负面的商业后果，因为从数据中获取的信息或基于数据构建的服务可能不正确或产生欺诈性结果。我们正是要解决这个问题，因为我们报告了在数据生态系统场景中部署数据质量工具的多案例研究。从这些案例中，我们得出了通用的规范性设计知识，并将其作为一种设计理论，为其他设计数据共享数据质量工具的人提供相关知识。随后，我们的研究有助于将数据质量问题纳入数据生态系统研究，并为从业人员提供从三个真实世界案例中推断出的可操作指南。

{"title":"A design theory for data quality tools in data ecosystems: Findings from three industry cases","authors":"Marcel Altendeitering , Tobias Moritz Guggenberger , Frederik Möller","doi":"10.1016/j.datak.2024.102333","DOIUrl":"https://doi.org/10.1016/j.datak.2024.102333","url":null,"abstract":"<div><p>Data ecosystems are a novel inter-organizational form of cooperation. They require at least one data provider and one or more data consumers. Existing research mainly addresses generativity mechanisms in this relationship, such as business models or role models for data ecosystems. However, an essential prerequisite for thriving data ecosystems is high data quality in the shared data. Without sufficient data quality, sharing data might lead to negative business consequences, given that the information drawn from them or services built on them might be incorrect or produce fraudulent results. We tackle this issue precisely since we report on a multi-case study deploying data quality tools in data ecosystem scenarios. From these cases, we derive generalized prescriptive design knowledge as a design theory to make the knowledge available for others designing data quality tools for data sharing. Subsequently, our study contributes to integrating the issue of data quality in data ecosystem research and provides practitioners with actionable guidelines inferred from three real-world cases.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"153 ","pages":"Article 102333"},"PeriodicalIF":2.5,"publicationDate":"2024-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169023X24000570/pdfft?md5=c13245062cdefc052035d38866a21318&pid=1-s2.0-S0169023X24000570-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141324527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An extended TF-IDF method for improving keyword extraction in traditional corpus-based research: An example of a climate change corpus 改进传统语料库研究中关键词提取的扩展 TF-IDF 方法：以气候变化语料库为例

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data & Knowledge Engineering

Pub Date : 2024-05-30 DOI: 10.1016/j.datak.2024.102322

Liang-Ching Chen

Keyword extraction involves the application of Natural Language Processing (NLP) algorithms or models developed in the realm of text mining. Keyword extraction is a common technique used to explore linguistic patterns in the corpus linguistic field, and Dunning’s Log-Likelihood Test (LLT) has long been integrated into corpus software as a statistic-based NLP model. While prior research has confirmed the widespread applicability of keyword extraction in corpus-based research, LLT has certain limitations that may impact the accuracy of keyword extraction in such research. This paper summarized the limitations of LLT, which include benchmark corpus interference, elimination of grammatical and generic words, consideration of sub-corpus relevance, flexibility in feature selection, and adaptability to different research goals. To address these limitations, this paper proposed an extended Term Frequency-Inverse Document Frequency (TF-IDF) method. To verify the applicability of the proposed method, 20 highly cited research articles on climate change from the Web of Science (WOS) database were used as the target corpus, and a comparison was conducted with the traditional method. The experimental results indicated that the proposed method could effectively overcome the limitations of the traditional method and demonstrated the feasibility and practicality of incorporating the TF-IDF algorithm into relevant corpus-based research.

关键词提取涉及应用文本挖掘领域开发的自然语言处理（NLP）算法或模型。关键词提取是语料库语言学领域探索语言模式的常用技术，而邓宁对数似然检验（LLT）作为一种基于统计的 NLP 模型，早已被整合到语料库软件中。虽然之前的研究已经证实了关键词提取在基于语料库的研究中的广泛适用性，但 LLT 存在一定的局限性，可能会影响此类研究中关键词提取的准确性。本文总结了 LLT 的局限性，其中包括基准语料干扰、消除语法词和通用词、考虑子语料相关性、特征选择的灵活性以及对不同研究目标的适应性。针对这些局限性，本文提出了一种扩展的词频-反向文档频率（TF-IDF）方法。为了验证该方法的适用性，本文使用了 Web of Science（WOS）数据库中 20 篇关于气候变化的高被引研究文章作为目标语料，并与传统方法进行了比较。实验结果表明，所提出的方法能有效克服传统方法的局限性，并证明了将 TF-IDF 算法纳入基于语料库的相关研究的可行性和实用性。

{"title":"An extended TF-IDF method for improving keyword extraction in traditional corpus-based research: An example of a climate change corpus","authors":"Liang-Ching Chen","doi":"10.1016/j.datak.2024.102322","DOIUrl":"https://doi.org/10.1016/j.datak.2024.102322","url":null,"abstract":"<div><p>Keyword extraction involves the application of Natural Language Processing (NLP) algorithms or models developed in the realm of text mining. Keyword extraction is a common technique used to explore linguistic patterns in the corpus linguistic field, and Dunning’s Log-Likelihood Test (LLT) has long been integrated into corpus software as a statistic-based NLP model. While prior research has confirmed the widespread applicability of keyword extraction in corpus-based research, LLT has certain limitations that may impact the accuracy of keyword extraction in such research. This paper summarized the limitations of LLT, which include benchmark corpus interference, elimination of grammatical and generic words, consideration of sub-corpus relevance, flexibility in feature selection, and adaptability to different research goals. To address these limitations, this paper proposed an extended Term Frequency-Inverse Document Frequency (TF-IDF) method. To verify the applicability of the proposed method, 20 highly cited research articles on climate change from the Web of Science (WOS) database were used as the target corpus, and a comparison was conducted with the traditional method. The experimental results indicated that the proposed method could effectively overcome the limitations of the traditional method and demonstrated the feasibility and practicality of incorporating the TF-IDF algorithm into relevant corpus-based research.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"153 ","pages":"Article 102322"},"PeriodicalIF":2.5,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141285881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Semantic requirements construction using ontologies and boilerplates 使用本体和模板构建语义需求

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data & Knowledge Engineering

Pub Date : 2024-05-28 DOI: 10.1016/j.datak.2024.102323

Christina Antoniou, Kalliopi Kravari, Nick Bassiliades

This paper presents a combination of an ontology and boilerplates, which are requirements templates for the syntactic structure of individual requirements that try to alleviate the problem of ambiguity caused using natural language and make it easier for inexperienced engineers to create requirements. However, still the use of boilerplates restricts the use of natural language only syntactically and not semantically. Boilerplates consists of fixed and attributes elements. Using ontologies, restricts the vocabulary of the words used in the requirements boilerplates to entities, their properties and entity relationships that are semantically meaningful to the application domain, leading thus to fewer errors. In this work we combine the advantages of boilerplates and ontologies. Usually, the attributes of boilerplates are completed with the help of the ontology. The contribution of this paper is that the whole boilerplates are stored in the ontology, based on the fact that RDF triples have similar syntax to the boilerplate syntax, so that attributes and fixed elements are part of the ontology. This combination helps to construct semantically and syntactically correct requirements. The contribution and novelty of our method is that we exploit the natural language syntax of boilerplates mapping them to Resource Description Framework triples which have also a linguistic nature. In this paper we created and present the development of a domain-specific ontology as well as a minimal set of boilerplates for a specific application domain, namely that of engineering software for an ATM, while maintaining flexibility on the one hand and generality on the other.

本文介绍了本体与模板的结合，后者是针对单个需求的语法结构的需求模板，试图缓解使用自然语言造成的歧义问题，并使缺乏经验的工程师更容易创建需求。然而，模板的使用仍然只限制了自然语言在语法上的使用，而没有限制其在语义上的使用。模板由固定元素和属性元素组成。使用本体可以将需求模板中使用的词汇限制在对应用领域有语义意义的实体、实体属性和实体关系上，从而减少错误。在这项工作中，我们结合了模板和本体的优点。通常，模板的属性是在本体的帮助下完成的。本文的贡献在于，基于 RDF 三元组具有与模板语法相似的语法这一事实，将整个模板存储在本体中，从而使属性和固定元素成为本体的一部分。这种组合有助于构建语义和语法正确的需求。我们的方法的贡献和新颖之处在于，我们利用了模板的自然语言语法，将其映射到同样具有语言性质的资源描述框架三元组中。在本文中，我们创建并介绍了针对特定应用领域（即自动取款机的工程软件）的领域本体和最小模板集的开发过程，同时保持了灵活性和通用性。

{"title":"Semantic requirements construction using ontologies and boilerplates","authors":"Christina Antoniou, Kalliopi Kravari, Nick Bassiliades","doi":"10.1016/j.datak.2024.102323","DOIUrl":"https://doi.org/10.1016/j.datak.2024.102323","url":null,"abstract":"<div><p>This paper presents a combination of an ontology and boilerplates, which are requirements templates for the syntactic structure of individual requirements that try to alleviate the problem of ambiguity caused using natural language and make it easier for inexperienced engineers to create requirements. However, still the use of boilerplates restricts the use of natural language only syntactically and not semantically. Boilerplates consists of fixed and attributes elements. Using ontologies, restricts the vocabulary of the words used in the requirements boilerplates to entities, their properties and entity relationships that are semantically meaningful to the application domain, leading thus to fewer errors. In this work we combine the advantages of boilerplates and ontologies. Usually, the attributes of boilerplates are completed with the help of the ontology. The contribution of this paper is that the whole boilerplates are stored in the ontology, based on the fact that RDF triples have similar syntax to the boilerplate syntax, so that attributes and fixed elements are part of the ontology. This combination helps to construct semantically and syntactically correct requirements. The contribution and novelty of our method is that we exploit the natural language syntax of boilerplates mapping them to Resource Description Framework triples which have also a linguistic nature. In this paper we created and present the development of a domain-specific ontology as well as a minimal set of boilerplates for a specific application domain, namely that of engineering software for an ATM, while maintaining flexibility on the one hand and generality on the other.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"152 ","pages":"Article 102323"},"PeriodicalIF":2.5,"publicationDate":"2024-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141289760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Large language models: Expectations for semantics-driven systems engineering 大型语言模型：对语义驱动系统工程的期望

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data & Knowledge Engineering

Pub Date : 2024-05-23 DOI: 10.1016/j.datak.2024.102324

Robert Buchmann , Johann Eder , Hans-Georg Fill , Ulrich Frank , Dimitris Karagiannis , Emanuele Laurenzi , John Mylopoulos , Dimitris Plexousakis , Maribel Yasmina Santos

The hype of Large Language Models manifests in disruptions, expectations or concerns in scientific communities that have focused for a long time on design-oriented research. The current experiences with Large Language Models and associated products (e.g. ChatGPT) lead to diverse positions regarding the foreseeable evolution of such products from the point of view of scholars who have been working with designed abstractions for most of their careers - typically relying on deterministic design decisions to ensure systems and automation reliability. Such expectations are collected in this paper in relation to a flavor of systems engineering that relies on explicit knowledge structures, introduced here as “semantics-driven systems engineering”.

The paper was motivated by the panel discussion that took place at CAiSE 2023 in Zaragoza, Spain, during the workshop on Knowledge Graphs for Semantics-driven Systems Engineering (KG4SDSE). The workshop brought together Conceptual Modeling researchers with an interest in specific applications of Knowledge Graphs and the semantic enrichment benefits they can bring to systems engineering. The panel context and consensus are summarized at the end of the paper, preceded by a proposed research agenda considering the expressed positions.

大型语言模型的炒作在长期专注于以设计为导向的研究的科学界表现为干扰、期望或担忧。大型语言模型和相关产品（如 ChatGPT）的当前经验，导致了学者们对此类产品可预见的发展持不同立场，他们在职业生涯的大部分时间里都在使用设计抽象--通常依赖于确定性的设计决策来确保系统和自动化的可靠性。本文收集的这些期望与依赖于显式知识结构的系统工程有关，在此被称为 "语义驱动的系统工程"。本文的灵感来自于在西班牙萨拉戈萨举行的 CAiSE 2023 大会上，在语义驱动的系统工程知识图谱（KG4SDSE）研讨会期间举行的小组讨论。该研讨会汇聚了对知识图谱的具体应用及其为系统工程带来的语义丰富化优势感兴趣的概念建模研究人员。本文末尾总结了小组讨论的背景和共识，并根据各方表达的立场提出了研究议程。

{"title":"Large language models: Expectations for semantics-driven systems engineering","authors":"Robert Buchmann , Johann Eder , Hans-Georg Fill , Ulrich Frank , Dimitris Karagiannis , Emanuele Laurenzi , John Mylopoulos , Dimitris Plexousakis , Maribel Yasmina Santos","doi":"10.1016/j.datak.2024.102324","DOIUrl":"10.1016/j.datak.2024.102324","url":null,"abstract":"<div><p>The hype of Large Language Models manifests in disruptions, expectations or concerns in scientific communities that have focused for a long time on design-oriented research. The current experiences with Large Language Models and associated products (e.g. ChatGPT) lead to diverse positions regarding the foreseeable evolution of such products from the point of view of scholars who have been working with designed abstractions for most of their careers - typically relying on deterministic design decisions to ensure systems and automation reliability. Such expectations are collected in this paper in relation to a flavor of systems engineering that relies on explicit knowledge structures, introduced here as “semantics-driven systems engineering”.</p><p>The paper was motivated by the panel discussion that took place at CAiSE 2023 in Zaragoza, Spain, during the workshop on Knowledge Graphs for Semantics-driven Systems Engineering (KG4SDSE). The workshop brought together Conceptual Modeling researchers with an interest in specific applications of Knowledge Graphs and the semantic enrichment benefits they can bring to systems engineering. The panel context and consensus are summarized at the end of the paper, preceded by a proposed research agenda considering the expressed positions.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"152 ","pages":"Article 102324"},"PeriodicalIF":2.5,"publicationDate":"2024-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141134203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Hermes, a low-latency transactional storage for binary data streams from remote devices Hermes，一种用于远程设备二进制数据流的低延迟事务存储设备

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data & Knowledge Engineering

Pub Date : 2024-05-16 DOI: 10.1016/j.datak.2024.102315

Gabriele Scaffidi Militone, Daniele Apiletti, Giovanni Malnati

In many contexts where data is streamed on a large scale, such as video surveillance systems, there is a dual requirement: secure data storage and continuous access to audio and video content by third parties, such as human operators or specific business logic, even while the media files are still being collected. However, using transactions to ensure data persistence often limits system throughput and latency. This paper presents a solution that enables both high ingestion rates with transactional data persistence and near real-time, low-latency access to the stream during collection. This immediate access enables the prompt application of specialized data engineering algorithms during data acquisition. The proposed solution is particularly suitable for binary data sources such as audio and video recordings in surveillance systems, and it can be extended to various big data scenarios via well-defined general interfaces. The scalability of the approach is based on the microservice architecture. Preliminary results obtained with Apache Kafka and MongoDB replica sets show that the proposed solution provides up to 3 times higher throughput and 2.2 times lower latency compared to standard multi-document transactions.

在视频监控系统等大规模流式传输数据的许多情况下，存在着双重需求：安全的数据存储和第三方（如人工操作员或特定业务逻辑）对音频和视频内容的持续访问，即使媒体文件仍在收集过程中。然而，使用事务来确保数据持久性往往会限制系统的吞吐量和延迟。本文提出了一种解决方案，既能通过事务数据持久性实现高摄取率，又能在采集过程中对数据流进行近乎实时的低延迟访问。这种即时访问可在数据采集期间迅速应用专门的数据工程算法。所提出的解决方案特别适用于二进制数据源，如监控系统中的音频和视频记录，并可通过定义明确的通用接口扩展到各种大数据场景。该方法的可扩展性基于微服务架构。使用 Apache Kafka 和 MongoDB 复制集获得的初步结果表明，与标准多文档事务相比，拟议解决方案的吞吐量提高了 3 倍，延迟降低了 2.2 倍。

{"title":"Hermes, a low-latency transactional storage for binary data streams from remote devices","authors":"Gabriele Scaffidi Militone, Daniele Apiletti, Giovanni Malnati","doi":"10.1016/j.datak.2024.102315","DOIUrl":"10.1016/j.datak.2024.102315","url":null,"abstract":"<div><p>In many contexts where data is streamed on a large scale, such as video surveillance systems, there is a dual requirement: secure data storage and continuous access to audio and video content by third parties, such as human operators or specific business logic, even while the media files are still being collected. However, using transactions to ensure data persistence often limits system throughput and latency. This paper presents a solution that enables both high ingestion rates with transactional data persistence and near real-time, low-latency access to the stream during collection. This immediate access enables the prompt application of specialized data engineering algorithms during data acquisition. The proposed solution is particularly suitable for binary data sources such as audio and video recordings in surveillance systems, and it can be extended to various big data scenarios via well-defined general interfaces. The scalability of the approach is based on the microservice architecture. Preliminary results obtained with Apache Kafka and MongoDB replica sets show that the proposed solution provides up to 3 times higher throughput and 2.2 times lower latency compared to standard multi-document transactions.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"153 ","pages":"Article 102315"},"PeriodicalIF":2.5,"publicationDate":"2024-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141042236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Analyzing fuzzy semantics of reviews for multi-criteria recommendations 为多标准推荐分析评论的模糊语义

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data & Knowledge Engineering

Pub Date : 2024-05-16 DOI: 10.1016/j.datak.2024.102314

Navreen Kaur Boparai , Himanshu Aggarwal , Rinkle Rani

Hotel reviews play a vital role in tourism recommender system. They should be analyzed effectively to enhance the accuracy of recommendations which can be generated either from crisp ratings on a fixed scale or real sentiments of reviews. But crisp ratings cannot represent the actual feelings of reviewers. Existing tourism recommender systems mostly recommend hotels on the basis of vague and sparse ratings resulting in inaccurate recommendations or preferences for online users. This paper presents a semantic approach to analyze the online reviews being crawled from tripadvisor.in. It discovers the underlying fuzzy semantics of reviews with respect to the multiple criteria of hotels rather than using the crisp ratings. The crawled reviews are preprocessed via data cleaning such as stopword and punctuation removal, tokenization, lemmatization, pos tagging to understand the semantics efficiently. Nouns representing frequent features of hotels are extracted from pre-processed reviews which are further used to identify opinion phrases. Fuzzy weights are derived from normalized frequency of frequent nouns and combined with sentiment score of all the synonyms of adjectives in the identified opinion phrases. This results in fuzzy semantics which form an ideal representation of reviews for a multi-criteria tourism recommender system. The proposed work is implemented in python by crawling the recent reviews of Jaipur hotels from TripAdvisor and analyzing their semantics. The resultant fuzzy semantics form a manually tagged dataset of reviews tagged with sentiments of identified aspects, respectively. Experimental results show improved sentiment score while considering all the synonyms of adjectives. The results are further used to fine-tune BERT models to form encodings for a query-based recommender system. The proposed approach can help tourism and hospitality service providers to take advantage of such sentiment analysis to examine the negative comments or unpleasant experiences of tourists and making appropriate improvements. Moreover, it will help online users to get better recommendations while planning their trips.

酒店评论在旅游推荐系统中起着至关重要的作用。应有效地分析这些评论，以提高推荐的准确性，而推荐的准确性可以通过固定比例的清晰评分或评论的真实情感来生成。但清晰的评分并不能代表评论者的真实感受。现有的旅游推荐系统大多根据模糊和稀疏的评分来推荐酒店，结果导致对在线用户的推荐或偏好不准确。本文提出了一种语义方法来分析从 tripadvisor.in 抓取的在线评论。它能根据酒店的多种标准发现评论的基本模糊语义，而不是使用清晰的评分。抓取到的评论会经过数据清理预处理，如删除停顿词和标点符号、标记化、词法化、pos 标记等，以便有效地理解语义。从预处理后的评论中提取代表酒店常见特征的名词，并进一步用于识别意见短语。根据频繁出现的名词的归一化频率得出模糊权重，并结合已识别意见短语中所有形容词同义词的情感得分。这就形成了模糊语义，为多标准旅游推荐系统提供了理想的评论表示。通过从 TripAdvisor 抓取斋浦尔酒店最近的评论并分析其语义，用 python 实现了提议的工作。由此产生的模糊语义形成了一个人工标记的评论数据集，分别标记了已识别方面的情感。实验结果表明，在考虑所有形容词同义词的情况下，情感评分有所提高。实验结果进一步用于微调 BERT 模型，为基于查询的推荐系统形成编码。所提出的方法可以帮助旅游和酒店服务提供商利用这种情感分析来检查游客的负面评论或不愉快经历，并做出适当的改进。此外，它还能帮助在线用户在规划旅行时获得更好的推荐。

{"title":"Analyzing fuzzy semantics of reviews for multi-criteria recommendations","authors":"Navreen Kaur Boparai , Himanshu Aggarwal , Rinkle Rani","doi":"10.1016/j.datak.2024.102314","DOIUrl":"10.1016/j.datak.2024.102314","url":null,"abstract":"<div><p>Hotel reviews play a vital role in tourism recommender system. They should be analyzed effectively to enhance the accuracy of recommendations which can be generated either from crisp ratings on a fixed scale or real sentiments of reviews. But crisp ratings cannot represent the actual feelings of reviewers. Existing tourism recommender systems mostly recommend hotels on the basis of vague and sparse ratings resulting in inaccurate recommendations or preferences for online users. This paper presents a semantic approach to analyze the online reviews being crawled from tripadvisor.in. It discovers the underlying fuzzy semantics of reviews with respect to the multiple criteria of hotels rather than using the crisp ratings. The crawled reviews are preprocessed via data cleaning such as stopword and punctuation removal, tokenization, lemmatization, pos tagging to understand the semantics efficiently. Nouns representing frequent features of hotels are extracted from pre-processed reviews which are further used to identify opinion phrases. Fuzzy weights are derived from normalized frequency of frequent nouns and combined with sentiment score of all the synonyms of adjectives in the identified opinion phrases. This results in fuzzy semantics which form an ideal representation of reviews for a multi-criteria tourism recommender system. The proposed work is implemented in python by crawling the recent reviews of Jaipur hotels from TripAdvisor and analyzing their semantics. The resultant fuzzy semantics form a manually tagged dataset of reviews tagged with sentiments of identified aspects, respectively. Experimental results show improved sentiment score while considering all the synonyms of adjectives. The results are further used to fine-tune BERT models to form encodings for a query-based recommender system. The proposed approach can help tourism and hospitality service providers to take advantage of such sentiment analysis to examine the negative comments or unpleasant experiences of tourists and making appropriate improvements. Moreover, it will help online users to get better recommendations while planning their trips.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"152 ","pages":"Article 102314"},"PeriodicalIF":2.5,"publicationDate":"2024-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141034319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Business intelligence and cognitive loads: Proposition of a dashboard adoption model 商业智能与认知负荷：仪表盘应用模型的提出

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data & Knowledge Engineering

Pub Date : 2024-05-11 DOI: 10.1016/j.datak.2024.102310

Corentin Burnay, Mathieu Lega, Sarah Bouraga

Decision makers in organizations strive to improve the quality of their decisions. One way to improve that process is to objectify the decisions with facts. Data-driven Decision Support Systems (data-driven DSS), and more specifically business intelligence (BI) intend to achieve this. Organizations invest massively in the development of BI data-driven DSS and expect them to be adopted and to effectively support decision makers. This raises many technical and methodological challenges, especially regarding the design of BI dashboards, which can be seen as the visible tip of the BI data-driven DSS iceberg and which play a major role in the adoption of the entire system. In this paper, the dashboard content is investigated as one possible root cause for BI data-driven DSS dashboard adoption or rejection through an early empirical research. More precisely, this work is composed of three parts. In the first part, the concept of cognitive loads is studied in the context of BI dashboards and the informational, the representational and the non-informational loads are introduced. In the second part, the effects of these loads on the adoption of BI dashboards are then studied through an experiment with 167 respondents and a Structural Equation Modeling (SEM) analysis. The result is a Dashboard Adoption Model, enriching the seminal Technology Acceptance Model with new content-oriented variables to support the design of more supportive BI data-driven DSS dashboards. Finally, in the third part, a set of indicators is proposed to help dashboards designers in the monitoring of the loads of their dashboards practically.

组织中的决策者都在努力提高决策质量。改进这一过程的方法之一就是用事实将决策客观化。数据驱动的决策支持系统（DSS），更具体地说就是商业智能（BI），就是为了实现这一目标。各组织在开发商业智能数据驱动型决策支持系统方面投入了大量资金，并期望这些系统能够被采用并为决策者提供有效支持。这就提出了许多技术和方法上的挑战，尤其是在商业智能仪表盘的设计方面，它可以被视为商业智能数据驱动型数据支持系统的冰山一角，在整个系统的采用方面发挥着重要作用。本文通过早期实证研究，将仪表盘内容作为 BI 数据驱动的 DSS 仪表盘采用或拒绝的可能根源之一进行调查。更确切地说，这项工作由三部分组成。在第一部分中，研究了 BI 面板背景下的认知负荷概念，并介绍了信息负荷、表征负荷和非信息负荷。在第二部分中，通过对 167 名受访者进行实验和结构方程建模（SEM）分析，研究了这些负载对采用商业智能仪表盘的影响。研究结果是仪表盘采用模型，该模型用新的内容导向变量丰富了开创性的技术接受模型，以支持设计更具支持性的商业智能数据驱动的 DSS 仪表盘。最后，第三部分提出了一套指标，以帮助仪表盘设计者切实监测仪表盘的负载情况。

{"title":"Business intelligence and cognitive loads: Proposition of a dashboard adoption model","authors":"Corentin Burnay, Mathieu Lega, Sarah Bouraga","doi":"10.1016/j.datak.2024.102310","DOIUrl":"https://doi.org/10.1016/j.datak.2024.102310","url":null,"abstract":"<div><p>Decision makers in organizations strive to improve the quality of their decisions. One way to improve that process is to objectify the decisions with facts. Data-driven Decision Support Systems (data-driven DSS), and more specifically business intelligence (BI) intend to achieve this. Organizations invest massively in the development of BI data-driven DSS and expect them to be adopted and to effectively support decision makers. This raises many technical and methodological challenges, especially regarding the design of BI dashboards, which can be seen as the visible tip of the BI data-driven DSS iceberg and which play a major role in the adoption of the entire system. In this paper, the dashboard content is investigated as one possible root cause for BI data-driven DSS dashboard adoption or rejection through an early empirical research. More precisely, this work is composed of three parts. In the first part, the concept of cognitive loads is studied in the context of BI dashboards and the informational, the representational and the non-informational loads are introduced. In the second part, the effects of these loads on the adoption of BI dashboards are then studied through an experiment with 167 respondents and a Structural Equation Modeling (SEM) analysis. The result is a Dashboard Adoption Model, enriching the seminal Technology Acceptance Model with new content-oriented variables to support the design of more supportive BI data-driven DSS dashboards. Finally, in the third part, a set of indicators is proposed to help dashboards designers in the monitoring of the loads of their dashboards practically.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"152 ","pages":"Article 102310"},"PeriodicalIF":2.5,"publicationDate":"2024-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140951807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Machine learning for predicting off-block delays: A case study at Paris — Charles de Gaulle International Airport 预测非阻塞延误的机器学习：巴黎戴高乐国际机场案例研究

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data & Knowledge Engineering

Pub Date : 2024-05-08 DOI: 10.1016/j.datak.2024.102303

Thibault Falque , Bertrand Mazure , Karim Tabia

Punctuality is a sensitive issue in large airports and hubs for passenger experience and for controlling operational costs. This paper presents a real and challenging problem of predicting and explaining flight off-block delays. We study the case of the international airport Paris Charles de Gaulle (Paris-CDG) starting from the specificities of this problem at Paris-CDG until the proposal of modelings then solutions and the analysis of the results on real data covering an entire year of activity. The proof of concept provided in this paper allows us to believe that the proposed approach could help improve the management of delays and reduce the impact of the resulting consequences.

准点率是大型机场和枢纽的一个敏感问题，关系到乘客体验和运营成本控制。本文提出了一个具有挑战性的实际问题，即如何预测和解释航班延误。我们以巴黎戴高乐国际机场（Paris Charles de Gaulle，简称 "Paris-CDG"）为例进行研究，从巴黎戴高乐机场这一问题的特殊性入手，到提出模型和解决方案，再到对全年活动的真实数据进行结果分析。本文所提供的概念证明让我们相信，所提出的方法有助于改善延误管理并减少由此造成的影响。

引用次数: 0

To prompt or not to prompt: Navigating the use of Large Language Models for integrating and modeling heterogeneous data 提示还是不提示？使用大型语言模型对异构数据进行整合和建模的导航

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data & Knowledge Engineering

Pub Date : 2024-05-06 DOI: 10.1016/j.datak.2024.102313

Adel Remadi , Karim El Hage , Yasmina Hobeika , Francesca Bugiotti

Manually integrating data of diverse formats and languages is vital to many artificial intelligence applications. However, the task itself remains challenging and time-consuming. This paper highlights the potential of Large Language Models (LLMs) to streamline data extraction and resolution processes. Our approach aims to address the ongoing challenge of integrating heterogeneous data sources, encouraging advancements in the field of data engineering. Applied on the specific use case of learning disorders in higher education, our research demonstrates LLMs’ capability to effectively extract data from unstructured sources. It is then further highlighted that LLMs can enhance data integration by providing the ability to resolve entities originating from multiple data sources. Crucially, the paper underscores the necessity of preliminary data modeling decisions to ensure the success of such technological applications. By merging human expertise with LLM-driven automation, this study advocates for the further exploration of semi-autonomous data engineering pipelines.

手动整合不同格式和语言的数据对许多人工智能应用来说都至关重要。然而，这项任务本身仍然具有挑战性且耗时。本文强调了大型语言模型（LLM）在简化数据提取和解析过程方面的潜力。我们的方法旨在解决整合异构数据源的持续挑战，推动数据工程领域的进步。我们的研究以高等教育中的学习障碍为特定用例，展示了 LLMs 从非结构化数据源中有效提取数据的能力。论文还进一步强调，LLM 可以提供解析来自多个数据源的实体的能力，从而加强数据整合。最重要的是，本文强调了初步数据建模决策的必要性，以确保此类技术应用的成功。通过将人类专业知识与 LLM 驱动的自动化相结合，本研究主张进一步探索半自主数据工程管道。

{"title":"To prompt or not to prompt: Navigating the use of Large Language Models for integrating and modeling heterogeneous data","authors":"Adel Remadi , Karim El Hage , Yasmina Hobeika , Francesca Bugiotti","doi":"10.1016/j.datak.2024.102313","DOIUrl":"https://doi.org/10.1016/j.datak.2024.102313","url":null,"abstract":"<div><p>Manually integrating data of diverse formats and languages is vital to many artificial intelligence applications. However, the task itself remains challenging and time-consuming. This paper highlights the potential of Large Language Models (LLMs) to streamline data extraction and resolution processes. Our approach aims to address the ongoing challenge of integrating heterogeneous data sources, encouraging advancements in the field of data engineering. Applied on the specific use case of learning disorders in higher education, our research demonstrates LLMs’ capability to effectively extract data from unstructured sources. It is then further highlighted that LLMs can enhance data integration by providing the ability to resolve entities originating from multiple data sources. Crucially, the paper underscores the necessity of preliminary data modeling decisions to ensure the success of such technological applications. By merging human expertise with LLM-driven automation, this study advocates for the further exploration of semi-autonomous data engineering pipelines.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"152 ","pages":"Article 102313"},"PeriodicalIF":2.5,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169023X24000375/pdfft?md5=11ee9c76542d55fac49075892a9a8c7d&pid=1-s2.0-S0169023X24000375-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140918204","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0