首页 > 最新文献

Data & Knowledge Engineering最新文献

英文 中文
Static and dynamic techniques for iterative test-driven modelling of Dynamic Condition Response Graphs 动态条件响应图的迭代测试驱动建模的静态和动态技术
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-02-04 DOI: 10.1016/j.datak.2025.102413
Axel K.F. Christfort , Vlad Paul Cosma , Søren Debois , Thomas T. Hildebrandt , Tijs Slaats
Test-driven declarative process modelling combines process models with test traces and has been introduced as a means to achieve both the flexibility provided by the declarative approach and the comprehensibility of the imperative approach. Open test-driven modelling adds a notion of context to tests, specifying the activities of concern in the model, and has been introduced as a means to support both iterative test-driven modelling, where the model can be extended without having to change all tests, and unit testing, where tests can define desired properties of parts of the process without needing to reason about the details of the whole process. The openness however makes checking a test more demanding, since actions outside the context are allowed at any point in the test execution and therefore many different traces may validate or invalidate an open test. In this paper we combine previously developed static techniques for effective open test-driven modelling for Dynamic Condition Response Graphs with a novel efficient implementation of dynamic checking of open tests based on alignment checking. We illustrate the static techniques on an example based on a real-life cross-organizational case management system and benchmark the dynamic checking on models and tests of varying size.
测试驱动的声明性流程建模将流程模型与测试跟踪相结合,并作为实现声明性方法提供的灵活性和命令式方法的可理解性的一种手段而引入。开放的测试驱动建模为测试添加了上下文的概念,指定了模型中关注的活动,并且已经作为一种支持迭代测试驱动建模的方法被引入,在迭代测试驱动建模中,模型可以扩展而不必更改所有测试,而单元测试中,测试可以定义流程部分的所需属性,而无需对整个流程的细节进行推理。然而,开放性使得检查测试的要求更高,因为在测试执行的任何点都允许上下文之外的操作,因此许多不同的跟踪可能会验证或使打开的测试无效。在本文中,我们将先前开发的用于动态条件响应图的有效开放测试驱动建模的静态技术与基于对齐检查的开放测试动态检查的新颖有效实现相结合。我们在一个基于真实的跨组织案例管理系统的示例上说明了静态技术,并对不同大小的模型和测试进行了动态检查基准测试。
{"title":"Static and dynamic techniques for iterative test-driven modelling of Dynamic Condition Response Graphs","authors":"Axel K.F. Christfort ,&nbsp;Vlad Paul Cosma ,&nbsp;Søren Debois ,&nbsp;Thomas T. Hildebrandt ,&nbsp;Tijs Slaats","doi":"10.1016/j.datak.2025.102413","DOIUrl":"10.1016/j.datak.2025.102413","url":null,"abstract":"<div><div>Test-driven declarative process modelling combines process models with test traces and has been introduced as a means to achieve both the flexibility provided by the declarative approach and the comprehensibility of the imperative approach. Open test-driven modelling adds a notion of context to tests, specifying the activities of concern in the model, and has been introduced as a means to support both iterative test-driven modelling, where the model can be extended without having to change all tests, and unit testing, where tests can define desired properties of parts of the process without needing to reason about the details of the whole process. The openness however makes checking a test more demanding, since actions outside the context are allowed at any point in the test execution and therefore many different traces may validate or invalidate an open test. In this paper we combine previously developed static techniques for effective open test-driven modelling for Dynamic Condition Response Graphs with a novel efficient implementation of dynamic checking of open tests based on alignment checking. We illustrate the static techniques on an example based on a real-life cross-organizational case management system and benchmark the dynamic checking on models and tests of varying size.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"157 ","pages":"Article 102413"},"PeriodicalIF":2.7,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143377553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Reinforcement learning for optimizing responses in care processes 用于优化护理过程反应的强化学习
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-02-03 DOI: 10.1016/j.datak.2025.102412
Olusanmi A. Hundogan , Bart J. Verhoef , Patrick Theeven , Hajo A. Reijers , Xixi Lu
Prescriptive process monitoring aims to derive recommendations for optimizing complex processes. While previous studies have successfully used reinforcement learning techniques to derive actionable policies in business processes, care processes present unique challenges due to their dynamic and multifaceted nature. For example, at any stage of a care process, a multitude of actions is possible. In this study, we follow the Reinforcement Learning (RL) approach and present a general approach that uses event data to build and train Markov decision processes. We proposed three algorithms including one that takes the elapsed time into account when transforming an event log into a semi-Markov decision process. We evaluated the RL approach using an aggression incident data set. Specifically, the goal is to optimize staff member actions when clients are displaying different types of aggressive behavior. The Q-learning and SARSA are used to find optimal policies. Our results showed that the derived policies align closely with current practices while offering alternative options in specific situations. By employing RL in the context of care processes, we contribute to the ongoing efforts to enhance decision-making and efficiency in dynamic and complex environments.
规定性过程监控旨在得出优化复杂过程的建议。虽然以前的研究已经成功地使用强化学习技术在业务流程中派生出可操作的策略,但护理流程由于其动态和多面性而面临独特的挑战。例如,在护理过程的任何阶段,都可能采取多种行动。在本研究中,我们遵循强化学习(RL)方法,并提出了一种使用事件数据构建和训练马尔可夫决策过程的通用方法。我们提出了三种算法,其中一种算法在将事件日志转换为半马尔可夫决策过程时考虑了经过的时间。我们使用攻击事件数据集来评估RL方法。具体来说,目标是在客户表现出不同类型的攻击行为时优化工作人员的行动。使用Q-learning和SARSA来寻找最优策略。我们的结果表明,衍生的政策与当前的实践密切相关,同时在特定情况下提供替代选项。通过在护理过程中采用强化学习,我们为不断努力提高动态和复杂环境中的决策和效率做出了贡献。
{"title":"Reinforcement learning for optimizing responses in care processes","authors":"Olusanmi A. Hundogan ,&nbsp;Bart J. Verhoef ,&nbsp;Patrick Theeven ,&nbsp;Hajo A. Reijers ,&nbsp;Xixi Lu","doi":"10.1016/j.datak.2025.102412","DOIUrl":"10.1016/j.datak.2025.102412","url":null,"abstract":"<div><div>Prescriptive process monitoring aims to derive recommendations for optimizing complex processes. While previous studies have successfully used reinforcement learning techniques to derive actionable policies in business processes, care processes present unique challenges due to their dynamic and multifaceted nature. For example, at any stage of a care process, a multitude of actions is possible. In this study, we follow the Reinforcement Learning (RL) approach and present a general approach that uses event data to build and train Markov decision processes. We proposed three algorithms including one that takes the elapsed time into account when transforming an event log into a semi-Markov decision process. We evaluated the RL approach using an aggression incident data set. Specifically, the goal is to optimize staff member actions when clients are displaying different types of aggressive behavior. The Q-learning and SARSA are used to find optimal policies. Our results showed that the derived policies align closely with current practices while offering alternative options in specific situations. By employing RL in the context of care processes, we contribute to the ongoing efforts to enhance decision-making and efficiency in dynamic and complex environments.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"157 ","pages":"Article 102412"},"PeriodicalIF":2.7,"publicationDate":"2025-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143372710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Symmetric non negative matrices factorization applied to the detection of communities in graphs and forensic image analysis 对称非负矩阵分解应用于图形群体检测和法医图像分析
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-01-31 DOI: 10.1016/j.datak.2025.102411
Gaël Marec , Nédra Mellouli
With the proliferation of data, particularly on social networks, the accuracy of the information becomes uncertain. In this context, a major challenge lies in detecting image manipulations, where alterations are made to deceive observers. Aligning with the anomaly detection issue, recent methods approach the detection of image transformations as a community detection problem within graphs associated with the images. In this study, we propose using a community clustering method based on non-negative symmetric matrix factorization. By examining several experiments detecting alterations in manipulated images, we assess the method’s robustness and discuss potential enhancements. We also present a process for automatically generating visually and semantically coherent forged images. Additionally, we provide a web application to demonstrate this process.
随着数据的激增,尤其是在社交网络上,信息的准确性变得不确定。在这种情况下,一个主要的挑战在于检测图像操纵,其中的改变是为了欺骗观察者。与异常检测问题一致,最近的方法将图像变换检测作为与图像相关的图中的社区检测问题。在本研究中,我们提出了一种基于非负对称矩阵分解的群体聚类方法。通过检查几个实验检测在操纵图像的变化,我们评估该方法的鲁棒性和讨论潜在的增强。我们还提出了一种自动生成视觉和语义上连贯的伪造图像的过程。此外,我们提供了一个web应用程序来演示这个过程。
{"title":"Symmetric non negative matrices factorization applied to the detection of communities in graphs and forensic image analysis","authors":"Gaël Marec ,&nbsp;Nédra Mellouli","doi":"10.1016/j.datak.2025.102411","DOIUrl":"10.1016/j.datak.2025.102411","url":null,"abstract":"<div><div>With the proliferation of data, particularly on social networks, the accuracy of the information becomes uncertain. In this context, a major challenge lies in detecting image manipulations, where alterations are made to deceive observers. Aligning with the anomaly detection issue, recent methods approach the detection of image transformations as a community detection problem within graphs associated with the images. In this study, we propose using a community clustering method based on non-negative symmetric matrix factorization. By examining several experiments detecting alterations in manipulated images, we assess the method’s robustness and discuss potential enhancements. We also present a process for automatically generating visually and semantically coherent forged images. Additionally, we provide a web application to demonstrate this process.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"157 ","pages":"Article 102411"},"PeriodicalIF":2.7,"publicationDate":"2025-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143346792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
REDIRE: Extreme REduction DImension for extRactivE Summarization rerere:提取摘要的极限降维
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-01-26 DOI: 10.1016/j.datak.2025.102407
Christophe Rodrigues , Marius Ortega , Aurélien Bossard , Nédra Mellouli
This paper presents an automatic unsupervised summarization model capable of extracting the most important sentences from a corpus. The unsupervised aspect makes it possible to do away with large corpora, made up of documents and their reference summaries, and to directly process documents potentially made up of several thousand words. To extract sentences in a summary, we use pre-entrained word embeddings to represent the documents. From this thick cloud of word vectors, we apply an extreme dimension reduction to identify important words, which we group by proximity. Sentences are extracted using linear constraint solving to maximize the information present in the summary. We evaluate the approach on large documents and present very encouraging initial results.
本文提出了一种自动无监督摘要模型,能够从语料库中提取出最重要的句子。无监督的方面使它有可能消除由文档及其参考摘要组成的大型语料库,并直接处理可能由数千个单词组成的文档。为了在摘要中提取句子,我们使用预先包含的词嵌入来表示文档。从这个厚厚的词向量云中,我们应用一个极端降维来识别重要的词,我们通过接近度来分组。使用线性约束求解来提取句子,以最大化摘要中存在的信息。我们在大型文档上评估了这种方法,并给出了非常令人鼓舞的初步结果。
{"title":"REDIRE: Extreme REduction DImension for extRactivE Summarization","authors":"Christophe Rodrigues ,&nbsp;Marius Ortega ,&nbsp;Aurélien Bossard ,&nbsp;Nédra Mellouli","doi":"10.1016/j.datak.2025.102407","DOIUrl":"10.1016/j.datak.2025.102407","url":null,"abstract":"<div><div>This paper presents an automatic unsupervised summarization model capable of extracting the most important sentences from a corpus. The unsupervised aspect makes it possible to do away with large corpora, made up of documents and their reference summaries, and to directly process documents potentially made up of several thousand words. To extract sentences in a summary, we use pre-entrained word embeddings to represent the documents. From this thick cloud of word vectors, we apply an extreme dimension reduction to identify important words, which we group by proximity. Sentences are extracted using linear constraint solving to maximize the information present in the summary. We evaluate the approach on large documents and present very encouraging initial results.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"157 ","pages":"Article 102407"},"PeriodicalIF":2.7,"publicationDate":"2025-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143135194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Logic-infused knowledge graph QA: Enhancing large language models for specialized domains through Prolog integration 逻辑注入的知识图QA:通过Prolog集成增强专门领域的大型语言模型
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-01-24 DOI: 10.1016/j.datak.2025.102406
Aneesa Bashir, Rong Peng, Yongchang Ding
Efficiently answering questions over complex, domain-specific knowledge graphs remain a substantial challenge, as large language models (LLMs) often lack the logical reasoning abilities and particular knowledge required for such tasks. This paper presents a novel framework integrating LLMs with logical programming languages like Prolog for Logic-Infused Knowledge Graph Question Answering (KGQA) in specialized domains. The proposed methodology uses a transformer-based encoder–decoder architecture. An encoder reads the question, and a named entity recognition (NER) module connects entities to the knowledge graph. The extracted entities are fed into a grammar-guided decoder, producing a logical form (Prolog query) that captures the semantic constraints and relationships. The Prolog query is executed over the knowledge graph to perform symbolic reasoning and retrieve relevant answer entities. Comprehensive experiments on the MetaQA benchmark dataset demonstrate the superior performance of this logic-infused method in accurately identifying correct answer entities from the knowledge graph. Even when trained on a limited subset of annotated data, it outperforms state-of-the-art baselines, achieving 89.60 % and F1-scores of up to 89.61 %, showcasing its effectiveness in enhancing large language models with symbolic reasoning capabilities for specialized question-answering tasks. The seamless integration of LLMs and logical programming enables the proposed framework to reason effectively over complex, domain-specific knowledge graphs, overcoming a key limitation of existing KGQA systems. In specialized domains, the interpretability provided by representing questions such as Prologue queries is a valuable asset.
有效地回答复杂的、特定于领域的知识图上的问题仍然是一个重大挑战,因为大型语言模型(llm)通常缺乏逻辑推理能力和此类任务所需的特定知识。本文提出了一个集成法学硕士和逻辑编程语言(如Prolog)的框架,用于在特定领域的逻辑注入知识图问答(KGQA)。提出的方法使用基于变压器的编码器-解码器架构。编码器读取问题,命名实体识别(NER)模块将实体连接到知识图。提取的实体被输入到语法引导的解码器中,生成捕获语义约束和关系的逻辑形式(Prolog查询)。Prolog查询在知识图上执行符号推理并检索相关的答案实体。在MetaQA基准数据集上的综合实验表明,这种逻辑注入的方法在从知识图中准确识别正确答案实体方面具有优越的性能。即使在有限的注释数据子集上进行训练,它也优于最先进的基线,达到89.60%,f1得分高达89.61%,显示了它在增强大型语言模型方面的有效性,该模型具有用于专门问答任务的符号推理能力。法学硕士和逻辑编程的无缝集成使所提出的框架能够对复杂的、特定领域的知识图进行有效的推理,克服了现有KGQA系统的一个关键限制。在专门的领域中,通过表示问题(如Prologue查询)提供的可解释性是一项有价值的资产。
{"title":"Logic-infused knowledge graph QA: Enhancing large language models for specialized domains through Prolog integration","authors":"Aneesa Bashir,&nbsp;Rong Peng,&nbsp;Yongchang Ding","doi":"10.1016/j.datak.2025.102406","DOIUrl":"10.1016/j.datak.2025.102406","url":null,"abstract":"<div><div>Efficiently answering questions over complex, domain-specific knowledge graphs remain a substantial challenge, as large language models (LLMs) often lack the logical reasoning abilities and particular knowledge required for such tasks. This paper presents a novel framework integrating LLMs with logical programming languages like Prolog for Logic-Infused Knowledge Graph Question Answering (KGQA) in specialized domains. The proposed methodology uses a transformer-based encoder–decoder architecture. An encoder reads the question, and a named entity recognition (NER) module connects entities to the knowledge graph. The extracted entities are fed into a grammar-guided decoder, producing a logical form (Prolog query) that captures the semantic constraints and relationships. The Prolog query is executed over the knowledge graph to perform symbolic reasoning and retrieve relevant answer entities. Comprehensive experiments on the MetaQA benchmark dataset demonstrate the superior performance of this logic-infused method in accurately identifying correct answer entities from the knowledge graph. Even when trained on a limited subset of annotated data, it outperforms state-of-the-art baselines, achieving 89.60 % and F1-scores of up to 89.61 %, showcasing its effectiveness in enhancing large language models with symbolic reasoning capabilities for specialized question-answering tasks. The seamless integration of LLMs and logical programming enables the proposed framework to reason effectively over complex, domain-specific knowledge graphs, overcoming a key limitation of existing KGQA systems. In specialized domains, the interpretability provided by representing questions such as Prologue queries is a valuable asset.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"157 ","pages":"Article 102406"},"PeriodicalIF":2.7,"publicationDate":"2025-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143135551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A methodology for the systematic design of storytelling dashboards applied to Industry 4.0 一种应用于工业4.0的叙事仪表板系统设计方法
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-01-22 DOI: 10.1016/j.datak.2025.102410
Ana Lavalle , Alejandro Maté , Maribel Yasmina Santos , Pedro Guimarães , Juan Trujillo , Antonina Santos
Dashboards are popular tools for presenting key insights to decision-makers by translating large volumes of data into clear information. However, while individual visualizations may effectively answer specific questions, they often fail to connect in a way that conveys the overall narrative, leaving decision-makers without a cohesive understanding of the area under analysis.
This paper presents a novel methodology for the systematic design of holistic dashboards, moving from analytical requirements to storytelling dashboards. Our approach ensures that all visualizations are aligned with the analytical goals of decision-makers. It includes several key steps: capturing analytical requirements through the i* framework; structuring and refining these requirements into a tree model to reflect the decision-maker’s mental analysis; identifying and preparing relevant data; capturing the key concepts and relationships for the composition of the cohesive storytelling dashboard through a novel storytelling conceptual model; finally, implementing and integrating the visualizations into the dashboard, ensuring coherence and alignment with the decision-maker’s needs. Our methodology has been applied in real-world industrial environments. We evaluated its impact through a controlled experiment. The findings show that storytelling dashboards significantly improve data interpretation, reduce misinterpretations, and enhance the overall user experience compared to traditional dashboards.
仪表板是一种流行的工具,通过将大量数据转换为清晰的信息,向决策者展示关键见解。然而,虽然单个的可视化可以有效地回答特定的问题,但它们往往不能以一种传达整体叙述的方式联系起来,使决策者无法对所分析的领域有一个连贯的理解。本文提出了一种全新的方法论,用于整体仪表板的系统设计,从分析需求转向讲故事的仪表板。我们的方法确保所有可视化与决策者的分析目标保持一致。它包括几个关键步骤:通过i*框架捕获分析需求;将这些需求构建并细化为树状模型,以反映决策者的心理分析;识别和准备相关数据;通过新颖的讲故事概念模型,捕捉到构成有凝聚力的讲故事仪表板的关键概念和关系;最后,实现可视化并将其集成到仪表板中,确保与决策者的需求保持一致性和一致性。我们的方法已应用于现实世界的工业环境。我们通过一个对照实验来评估它的影响。研究结果表明,与传统仪表板相比,讲故事仪表板显著改善了数据解释,减少了误解,并增强了整体用户体验。
{"title":"A methodology for the systematic design of storytelling dashboards applied to Industry 4.0","authors":"Ana Lavalle ,&nbsp;Alejandro Maté ,&nbsp;Maribel Yasmina Santos ,&nbsp;Pedro Guimarães ,&nbsp;Juan Trujillo ,&nbsp;Antonina Santos","doi":"10.1016/j.datak.2025.102410","DOIUrl":"10.1016/j.datak.2025.102410","url":null,"abstract":"<div><div>Dashboards are popular tools for presenting key insights to decision-makers by translating large volumes of data into clear information. However, while individual visualizations may effectively answer specific questions, they often fail to connect in a way that conveys the overall narrative, leaving decision-makers without a cohesive understanding of the area under analysis.</div><div>This paper presents a novel methodology for the systematic design of holistic dashboards, moving from analytical requirements to storytelling dashboards. Our approach ensures that all visualizations are aligned with the analytical goals of decision-makers. It includes several key steps: capturing analytical requirements through the i* framework; structuring and refining these requirements into a tree model to reflect the decision-maker’s mental analysis; identifying and preparing relevant data; capturing the key concepts and relationships for the composition of the cohesive storytelling dashboard through a novel storytelling conceptual model; finally, implementing and integrating the visualizations into the dashboard, ensuring coherence and alignment with the decision-maker’s needs. Our methodology has been applied in real-world industrial environments. We evaluated its impact through a controlled experiment. The findings show that storytelling dashboards significantly improve data interpretation, reduce misinterpretations, and enhance the overall user experience compared to traditional dashboards.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"156 ","pages":"Article 102410"},"PeriodicalIF":2.7,"publicationDate":"2025-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143133530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Ensuring safety in digital spaces: Detecting code-mixed hate speech in social media posts 确保数字空间的安全:在社交媒体帖子中检测混合代码的仇恨言论
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-01-18 DOI: 10.1016/j.datak.2025.102409
Pradeep Kumar Roy , Abhinav Kumar
Social networks strive to offer positive content to users, yet a considerable amount of inappropriate material, such as rumors, fake news, and hate speech, persists. Despite significant efforts to detect and prevent hate speech early, it remains widespread due to issues like misspellings and mixed language in posts. To address these challenges, this research utilizes advanced algorithms like CNN, LSTM, and BERT to develop an automated system for detecting hate speech in Telugu-English code-mixed posts. Additionally, evaluating the effectiveness of data translation and transliteration approaches for detecting hate in mixed language. Results indicate that the transliteration approach achieves the highest accuracy, with a performance of 75% accuracy, surpassing raw and translated data by 1% and 3%, respectively. The proposed system may effectively minimizes hate speech and offensive content on social media platforms, resulting in an enhanced user experience. From a managerial perspective, this research presents numerous benefits, such as improved content moderation, optimized resource allocation, data-driven decision-making, enhanced user satisfaction, strengthened reputation management, and greater scalability. These advancements underscore the potential of utilizing advanced technologies to address complex challenges in social media management.
社交网络努力为用户提供积极的内容,然而大量不适当的材料,如谣言、假新闻和仇恨言论,仍然存在。尽管在早期发现和预防仇恨言论方面做出了重大努力,但由于拼写错误和帖子中混杂的语言等问题,仇恨言论仍然普遍存在。为了应对这些挑战,本研究利用CNN、LSTM和BERT等先进算法开发了一个自动化系统,用于检测泰卢格-英语代码混合帖子中的仇恨言论。此外,评估数据翻译和音译方法在混合语言中检测仇恨的有效性。结果表明,音译方法的准确率最高,达到75%,分别比原始数据和翻译数据高1%和3%。该系统可以有效地减少社交媒体平台上的仇恨言论和冒犯性内容,从而增强用户体验。从管理角度来看,这项研究带来了许多好处,如改进内容审核、优化资源分配、数据驱动决策、提高用户满意度、加强声誉管理和更大的可扩展性。这些进步强调了利用先进技术解决社交媒体管理中复杂挑战的潜力。
{"title":"Ensuring safety in digital spaces: Detecting code-mixed hate speech in social media posts","authors":"Pradeep Kumar Roy ,&nbsp;Abhinav Kumar","doi":"10.1016/j.datak.2025.102409","DOIUrl":"10.1016/j.datak.2025.102409","url":null,"abstract":"<div><div>Social networks strive to offer positive content to users, yet a considerable amount of inappropriate material, such as rumors, fake news, and hate speech, persists. Despite significant efforts to detect and prevent hate speech early, it remains widespread due to issues like misspellings and mixed language in posts. To address these challenges, this research utilizes advanced algorithms like CNN, LSTM, and BERT to develop an automated system for detecting hate speech in Telugu-English code-mixed posts. Additionally, evaluating the effectiveness of data translation and transliteration approaches for detecting hate in mixed language. Results indicate that the transliteration approach achieves the highest accuracy, with a performance of 75% accuracy, surpassing raw and translated data by 1% and 3%, respectively. The proposed system may effectively minimizes hate speech and offensive content on social media platforms, resulting in an enhanced user experience. From a managerial perspective, this research presents numerous benefits, such as improved content moderation, optimized resource allocation, data-driven decision-making, enhanced user satisfaction, strengthened reputation management, and greater scalability. These advancements underscore the potential of utilizing advanced technologies to address complex challenges in social media management.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"156 ","pages":"Article 102409"},"PeriodicalIF":2.7,"publicationDate":"2025-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143133540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A survey on big data classification 大数据分类调查
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-01-11 DOI: 10.1016/j.datak.2025.102408
Keerthana G , Sherly Puspha Annabel L
Big data refers to vast volumes of structured and unstructured data that are too large or complex for traditional data-processing methods to handle efficiently. The importance of big data lies in its ability to provide actionable insights and drive decision-making across various industries, such as healthcare, finance, marketing, and government, by enabling more accurate predictions, and personalized services. Moreover, traditional big data classification approaches, often struggle with big data's complexity. They failed to manage high-dimensionality, deal with non-linearity, or process data in real time. For effective big data classification, robust computing infrastructure, scalable storage solutions, and advanced algorithms are required. This survey provides a thorough assessment of 50 research papers based on big data classification, by identifying the struggle faced by current big data classification techniques to process and classify data efficiently without substantial computational resources. The analysis is enabled on a variety of scenarios and key points. In this case, this survey will enable the classification of the techniques utilized for big data classification that is made based on the rule-based, deep learning-based, optimization-based, machine learning-based techniques and so on. Furthermore, the classification of techniques, tools used, published year, used software tool, and performance metrics are contemplated for the analysis in big data classification. At last, the research gaps and technical problems of the techniques in a way that makes the motivations for creating an efficient model of enabling big data classification optimal.
大数据是指大量的结构化和非结构化数据,这些数据过于庞大或复杂,传统的数据处理方法无法有效处理。大数据的重要性在于,它能够提供可操作的见解,并通过实现更准确的预测和个性化服务,推动医疗保健、金融、营销和政府等各个行业的决策。此外,传统的大数据分类方法往往难以应对大数据的复杂性。他们无法管理高维、处理非线性或实时处理数据。为了实现有效的大数据分类,需要强大的计算基础设施、可扩展的存储解决方案和先进的算法。本调查通过确定当前大数据分类技术在没有大量计算资源的情况下有效处理和分类数据所面临的困难,对50篇基于大数据分类的研究论文进行了全面评估。可以对各种场景和关键点进行分析。在这种情况下,本调查将对基于规则、基于深度学习、基于优化、基于机器学习等技术的大数据分类所使用的技术进行分类。此外,技术分类、使用的工具、发布年份、使用的软件工具和性能指标被考虑用于大数据分类分析。最后,分析了这些技术的研究差距和技术问题,从而为创建一个有效的大数据分类模型提供了最优的动机。
{"title":"A survey on big data classification","authors":"Keerthana G ,&nbsp;Sherly Puspha Annabel L","doi":"10.1016/j.datak.2025.102408","DOIUrl":"10.1016/j.datak.2025.102408","url":null,"abstract":"<div><div>Big data refers to vast volumes of structured and unstructured data that are too large or complex for traditional data-processing methods to handle efficiently. The importance of big data lies in its ability to provide actionable insights and drive decision-making across various industries, such as healthcare, finance, marketing, and government, by enabling more accurate predictions, and personalized services. Moreover, traditional big data classification approaches, often struggle with big data's complexity. They failed to manage high-dimensionality, deal with non-linearity, or process data in real time. For effective big data classification, robust computing infrastructure, scalable storage solutions, and advanced algorithms are required. This survey provides a thorough assessment of 50 research papers based on big data classification, by identifying the struggle faced by current big data classification techniques to process and classify data efficiently without substantial computational resources. The analysis is enabled on a variety of scenarios and key points. In this case, this survey will enable the classification of the techniques utilized for big data classification that is made based on the rule-based, deep learning-based, optimization-based, machine learning-based techniques and so on. Furthermore, the classification of techniques, tools used, published year, used software tool, and performance metrics are contemplated for the analysis in big data classification. At last, the research gaps and technical problems of the techniques in a way that makes the motivations for creating an efficient model of enabling big data classification optimal.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"156 ","pages":"Article 102408"},"PeriodicalIF":2.7,"publicationDate":"2025-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143133531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Textual data augmentation using generative approaches - Impact on named entity recognition tasks 使用生成方法的文本数据增强。对命名实体识别任务的影响
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-01-10 DOI: 10.1016/j.datak.2024.102403
Danrun Cao , Nicolas Béchet , Pierre-François Marteau , Oussama Ahmia
Industrial applications of Named Entity Recognition (NER) are usually confronted with small and imbalanced corpora. This could harm the performance of trained and finetuned recognition models, especially when they encounter unknown data. In this study we develop three generation-based data enrichment approaches, in order to increase the number of examples of underrepresented entities. We compare the impact of enriched corpora on NER models, using both non-contextual (fastText) and contextual (Bert-like) embedding models to provide discriminant features to a biLSTM-CRF used as an entity classifier. The approach is evaluated on a contract renewal detection task applied to a corpus of calls for tenders. The results show that the proposed data enrichment procedure effectively improves the NER model’s effectiveness when applied on both known and unknown data.
命名实体识别(NER)的工业应用通常面临着小而不平衡的语料库问题。这可能会损害经过训练和微调的识别模型的性能,特别是当它们遇到未知数据时。在本研究中,我们开发了三种基于代的数据丰富方法,以增加代表性不足实体的示例数量。我们比较了丰富的语料库对NER模型的影响,使用非上下文(fastText)和上下文(Bert-like)嵌入模型为作为实体分类器的biLSTM-CRF提供判别特征。该方法在应用于招标语料库的合同续订检测任务上进行了评估。结果表明,所提出的数据充实过程有效地提高了NER模型在已知和未知数据上的有效性。
{"title":"Textual data augmentation using generative approaches - Impact on named entity recognition tasks","authors":"Danrun Cao ,&nbsp;Nicolas Béchet ,&nbsp;Pierre-François Marteau ,&nbsp;Oussama Ahmia","doi":"10.1016/j.datak.2024.102403","DOIUrl":"10.1016/j.datak.2024.102403","url":null,"abstract":"<div><div>Industrial applications of Named Entity Recognition (NER) are usually confronted with small and imbalanced corpora. This could harm the performance of trained and finetuned recognition models, especially when they encounter unknown data. In this study we develop three generation-based data enrichment approaches, in order to increase the number of examples of underrepresented entities. We compare the impact of enriched corpora on NER models, using both non-contextual (fastText) and contextual (Bert-like) embedding models to provide discriminant features to a biLSTM-CRF used as an entity classifier. The approach is evaluated on a contract renewal detection task applied to a corpus of calls for tenders. The results show that the proposed data enrichment procedure effectively improves the NER model’s effectiveness when applied on both known and unknown data.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"156 ","pages":"Article 102403"},"PeriodicalIF":2.7,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143133529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Automated mapping between SDG indicators and open data: An LLM-augmented knowledge graph approach 可持续发展目标指标和开放数据之间的自动映射:法学硕士增强的知识图谱方法
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-01-03 DOI: 10.1016/j.datak.2024.102405
Wissal Benjira , Faten Atigui , Bénédicte Bucher , Malika Grim-Yefsah , Nicolas Travers
Meeting the Sustainable Development Goals (SDGs) presents a large-scale challenge for all countries. SDGs established by the United Nations provide a comprehensive framework for addressing global issues. To monitor progress towards these goals, we need to develop key performance indicators and integrate and analyze heterogeneous datasets. The definition of these indicators requires the use of existing data and metadata. However, the diversity of data sources and formats raises major issues in terms of structuring and integration. Despite the abundance of open data and metadata, its exploitation remains limited, leaving untapped potential for guiding urban policies towards sustainability. Thus, this paper introduces a novel approach for SDG indicator computation, leveraging the capabilities of Large Language Models (LLMs) and Knowledge Graphs (KGs). We propose a method that combines rule-based filtering with LLM-powered schema mapping to establish semantic correspondences between diverse data sources and SDG indicators, including disaggregation. Our approach integrates these mappings into a KG, which enables indicator computation by querying graph’s topology. We evaluate our method through a case study focusing on the SDG Indicator 11.7.1 about accessibility of public open spaces. Our experimental results show significant improvements in accuracy, precision, recall, and F1-score compared to traditional schema mapping techniques.
实现可持续发展目标对所有国家来说都是一个巨大的挑战。联合国制定的可持续发展目标为解决全球问题提供了一个全面的框架。为了监测实现这些目标的进展情况,我们需要制定关键绩效指标,并整合和分析异构数据集。这些指标的定义需要使用现有数据和元数据。但是,数据源和格式的多样性在结构和集成方面提出了重大问题。尽管开放数据和元数据丰富,但其开发仍然有限,在指导城市政策走向可持续性方面尚未开发潜力。因此,本文引入了一种利用大型语言模型(LLMs)和知识图(KGs)的能力来计算可持续发展目标指标的新方法。我们提出了一种将基于规则的过滤与基于llm的模式映射相结合的方法,以建立不同数据源与可持续发展目标指标之间的语义对应关系,包括分解。我们的方法将这些映射集成到一个KG中,通过查询图的拓扑来实现指示器计算。我们通过一个案例研究来评估我们的方法,重点关注可持续发展目标指标11.7.1关于公共开放空间的可达性。我们的实验结果表明,与传统的模式映射技术相比,在准确性、精密度、召回率和f1得分方面有了显著的提高。
{"title":"Automated mapping between SDG indicators and open data: An LLM-augmented knowledge graph approach","authors":"Wissal Benjira ,&nbsp;Faten Atigui ,&nbsp;Bénédicte Bucher ,&nbsp;Malika Grim-Yefsah ,&nbsp;Nicolas Travers","doi":"10.1016/j.datak.2024.102405","DOIUrl":"10.1016/j.datak.2024.102405","url":null,"abstract":"<div><div>Meeting the Sustainable Development Goals (SDGs) presents a large-scale challenge for all countries. SDGs established by the United Nations provide a comprehensive framework for addressing global issues. To monitor progress towards these goals, we need to develop key performance indicators and integrate and analyze heterogeneous datasets. The definition of these indicators requires the use of existing data and metadata. However, the diversity of data sources and formats raises major issues in terms of structuring and integration. Despite the abundance of open data and metadata, its exploitation remains limited, leaving untapped potential for guiding urban policies towards sustainability. Thus, this paper introduces a novel approach for SDG indicator computation, leveraging the capabilities of Large Language Models (LLMs) and Knowledge Graphs (KGs). We propose a method that combines rule-based filtering with LLM-powered schema mapping to establish semantic correspondences between diverse data sources and SDG indicators, including disaggregation. Our approach integrates these mappings into a KG, which enables indicator computation by querying graph’s topology. We evaluate our method through a case study focusing on the SDG Indicator 11.7.1 about accessibility of public open spaces. Our experimental results show significant improvements in accuracy, precision, recall, and F1-score compared to traditional schema mapping techniques.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"156 ","pages":"Article 102405"},"PeriodicalIF":2.7,"publicationDate":"2025-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143133534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Data & Knowledge Engineering
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1