Pub Date : 2025-11-24DOI: 10.1016/j.datak.2025.102535
Greta Damo , Nicolás Benjamín Ocampo , Elena Cabrio, Serena Villata
Social media platforms face a growing challenge in addressing abusive content and hate speech, particularly as traditional natural language processing methods often struggle with detecting nuanced and implicit instances. To tackle this issue, our study enhances Large Language Models (LLMs) in the detection and explanation of implicit hate speech, outperforming classical approaches. We focus on two key objectives: (1) determining whether jointly predicting and generating explanations for why a message is hateful improves LLMs’ accuracy, especially for implicit cases, and (2) evaluating whether incorporating information from BERT-based models can further boost detection and explanation performance. Our method evaluates and enhances LLMs’ ability to detect hate speech and explain their predictions. By combining binary classification (Hate Speech vs. Non-Hate Speech) with natural language explanations, our approach provides clearer insights into why a message is considered hateful, advancing the accuracy and interpretability of hate speech detection.
{"title":"“Detectors Lead, LLMs Follow”: Integrating LLMs and traditional models on implicit hate speech detection to generate faithful and plausible explanations","authors":"Greta Damo , Nicolás Benjamín Ocampo , Elena Cabrio, Serena Villata","doi":"10.1016/j.datak.2025.102535","DOIUrl":"10.1016/j.datak.2025.102535","url":null,"abstract":"<div><div>Social media platforms face a growing challenge in addressing abusive content and hate speech, particularly as traditional natural language processing methods often struggle with detecting nuanced and implicit instances. To tackle this issue, our study enhances Large Language Models (LLMs) in the detection and explanation of implicit hate speech, outperforming classical approaches. We focus on two key objectives: (1) determining whether jointly predicting and generating explanations for why a message is hateful improves LLMs’ accuracy, especially for implicit cases, and (2) evaluating whether incorporating information from BERT-based models can further boost detection and explanation performance. Our method evaluates and enhances LLMs’ ability to detect hate speech and explain their predictions. By combining binary classification (Hate Speech vs. Non-Hate Speech) with natural language explanations, our approach provides clearer insights into why a message is considered hateful, advancing the accuracy and interpretability of hate speech detection.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"162 ","pages":"Article 102535"},"PeriodicalIF":2.7,"publicationDate":"2025-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145617829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-23DOI: 10.1016/j.datak.2025.102534
Simon Fuchs, Janik Schnellbach, Holger Wittges, Helmut Krcmar
In general, supervised Machine Learning approaches using labeled training data currently promise the best classification results with respect to classification accuracy. Therefore, data annotation is a key component of most Machine Learning projects implemented. However, creating labels for a training data set is often an elaborate project involving arduous and repetitive work, which is why data scientists often try to minimize the effort for data annotation by automating the data annotation process itself. In this paper, we present a case study of two data annotation projects on the same data set of support tickets and compare these: one using human annotators and the other using algorithmic Learning Functions in a combination of Active Learning and Weak Supervision. Here, we achieved a weighted confidence score of >94 % for the human-created labels, while also achieving up to 92 % agreement between the labels of our automated project and the labels created by human annotators, with the need for only 10 % of human annotation as the starting input of the automated approach. Additionally, we were able to reproduce the value of 85 % for initial human classification accuracy in support ticket distribution from previous papers. We close with a reflection about the worth of business understanding in data annotation projects and the problem and proposed solutions to ticket ambiguity.
{"title":"Human vs. Automated data annotation: Labeling the data set for an ML-driven support ticket classifier","authors":"Simon Fuchs, Janik Schnellbach, Holger Wittges, Helmut Krcmar","doi":"10.1016/j.datak.2025.102534","DOIUrl":"10.1016/j.datak.2025.102534","url":null,"abstract":"<div><div>In general, supervised Machine Learning approaches using labeled training data currently promise the best classification results with respect to classification accuracy. Therefore, data annotation is a key component of most Machine Learning projects implemented. However, creating labels for a training data set is often an elaborate project involving arduous and repetitive work, which is why data scientists often try to minimize the effort for data annotation by automating the data annotation process itself. In this paper, we present a case study of two data annotation projects on the same data set of support tickets and compare these: one using human annotators and the other using algorithmic Learning Functions in a combination of Active Learning and Weak Supervision. Here, we achieved a weighted confidence score of >94 % for the human-created labels, while also achieving up to 92 % agreement between the labels of our automated project and the labels created by human annotators, with the need for only 10 % of human annotation as the starting input of the automated approach. Additionally, we were able to reproduce the value of 85 % for initial human classification accuracy in support ticket distribution from previous papers. We close with a reflection about the worth of business understanding in data annotation projects and the problem and proposed solutions to ticket ambiguity.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"162 ","pages":"Article 102534"},"PeriodicalIF":2.7,"publicationDate":"2025-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145684462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-20DOI: 10.1016/j.datak.2025.102533
Tarandeep Kaur, Pankaj Deep Kaur
The proliferation of the Internet of Things (IoT) has led to an unprecedented surge in data generation, making the enhancement of data quality indispensable for unlocking IoT's full potential and enabling intelligent, data-driven decision-making. This systematic literature review examines scholarly research from the past decade to unravel the complexities of data quality in IoT. Seven research questions have been formulated, an extensive search of relevant academic databases has been conducted, and criteria for inclusion and exclusion have been defined. Key insights from these selected studies address the research questions, analyzing the multi-faceted landscape of IoT data and its quality. This study explores various data quality dimensions that play a pivotal role in assessing overall data quality, identifying critical gaps and limitations, and offering a roadmap for future research. The comprehensive overview provides a nuanced understanding of the factors influencing data quality and highlights the contributions of various researchers to ensure data quality. The study consolidates perspectives on the significance of data quality as perceived by users, while also emphasizing the paramount importance of security and privacy in managing IoT data. The findings of this SLR will provide valuable insights to researchers and practitioners, advancing efforts to maintain robust data quality across IoT ecosystems.
{"title":"Quality matters: A decadal systematic exploration of data quality in IoT environment","authors":"Tarandeep Kaur, Pankaj Deep Kaur","doi":"10.1016/j.datak.2025.102533","DOIUrl":"10.1016/j.datak.2025.102533","url":null,"abstract":"<div><div>The proliferation of the Internet of Things (IoT) has led to an unprecedented surge in data generation, making the enhancement of data quality indispensable for unlocking IoT's full potential and enabling intelligent, data-driven decision-making. This systematic literature review examines scholarly research from the past decade to unravel the complexities of data quality in IoT. Seven research questions have been formulated, an extensive search of relevant academic databases has been conducted, and criteria for inclusion and exclusion have been defined. Key insights from these selected studies address the research questions, analyzing the multi-faceted landscape of IoT data and its quality. This study explores various data quality dimensions that play a pivotal role in assessing overall data quality, identifying critical gaps and limitations, and offering a roadmap for future research. The comprehensive overview provides a nuanced understanding of the factors influencing data quality and highlights the contributions of various researchers to ensure data quality. The study consolidates perspectives on the significance of data quality as perceived by users, while also emphasizing the paramount importance of security and privacy in managing IoT data. The findings of this SLR will provide valuable insights to researchers and practitioners, advancing efforts to maintain robust data quality across IoT ecosystems.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"162 ","pages":"Article 102533"},"PeriodicalIF":2.7,"publicationDate":"2025-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145684460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-19DOI: 10.1016/j.datak.2025.102532
Pavani Chirasani , Gatram Rama Mohan Babu
Social media sites like Twitter, which offer an abundance of content supplemented with emojis, can be used to identify and treat depression, a widespread mental health issue. Several prediction methods exist to predict the depression. However, the relevant outcome was not good because of inaccurate prediction. The input Twitter emoji data contains more error features, which increases the complexity of predicting depression. These drawbacks resulted in poor prediction and low Accuracy. So, the proposed work aims to design a novel Zebra-based Long former Emoji Analysis (ZLEA) for predicting depression. The Twitter Emoji database was initially collected from the standard website and provided to the Python environment as input. First, the pre-processing function is run to remove the noisy features that are present in the trained database. Moreover, the necessary features were extracted and the depression condition was predicted using the current emoji. Finally, the depression severe state was assessed based on the emoji's grade level, and the performance was confirmed with other conventional research with metrics.
{"title":"Optimized adaptive depression state prediction and severity estimation from twitter data","authors":"Pavani Chirasani , Gatram Rama Mohan Babu","doi":"10.1016/j.datak.2025.102532","DOIUrl":"10.1016/j.datak.2025.102532","url":null,"abstract":"<div><div>Social media sites like Twitter, which offer an abundance of content supplemented with emojis, can be used to identify and treat depression, a widespread mental health issue. Several prediction methods exist to predict the depression. However, the relevant outcome was not good because of inaccurate prediction. The input Twitter emoji data contains more error features, which increases the complexity of predicting depression. These drawbacks resulted in poor prediction and low Accuracy. So, the proposed work aims to design a novel Zebra-based Long former Emoji Analysis (ZLEA) for predicting depression. The Twitter Emoji database was initially collected from the standard website and provided to the Python environment as input. First, the pre-processing function is run to remove the noisy features that are present in the trained database. Moreover, the necessary features were extracted and the depression condition was predicted using the current emoji. Finally, the depression severe state was assessed based on the emoji's grade level, and the performance was confirmed with other conventional research with metrics.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"162 ","pages":"Article 102532"},"PeriodicalIF":2.7,"publicationDate":"2025-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145617823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-17DOI: 10.1016/j.datak.2025.102531
Hiteshwar Kumar Azad
The amount of information available on the Web has grown dramatically and continues to grow on a daily basis. The massive amount of Web data poses significant challenges to the reliability and accuracy of current information retrieval systems. The purpose of information retrieval is to discover relevant documents within a huge group of documents whose contents match a user-initiated query. Because most users struggle to formulate well-defined queries, the query expansion technique is critical for retrieving the most relevant information. Obtaining relevant results in a concise manner is a significant challenge in this scenario. Automatic text summarization can condense a lengthy document while retaining its informative content and key concepts. It could be a potential solution to information overload. This paper proposed a query-based automatic text summarization technique that employs query expansion to improve text summarization and provide the relevant information in a concise manner. To produce a relevant text summary, this article employs a query-based extractive text summarization method, which involves selecting sentences based on the four best features retrieved from each sentence. In this process, the words are scored by the expanded query’s score, and the sentences are scored by four important features, including sentence terms, position, similarity to the first sentence, and proper noun. Extensive experiments with different ROUGE variants on various evaluation metrics, including precision, recall, and F-score, were carried out on the DUC 2007 dataset, with gains of approximately 44%, 46%, and 45% respectively, in the best scenario. It is observed that the suggested approach outperforms both DUC participatory systems and cutting-edge approaches in summary generation.
{"title":"Query-based automatic text summarization using query expansion approach","authors":"Hiteshwar Kumar Azad","doi":"10.1016/j.datak.2025.102531","DOIUrl":"10.1016/j.datak.2025.102531","url":null,"abstract":"<div><div>The amount of information available on the Web has grown dramatically and continues to grow on a daily basis. The massive amount of Web data poses significant challenges to the reliability and accuracy of current information retrieval systems. The purpose of information retrieval is to discover relevant documents within a huge group of documents whose contents match a user-initiated query. Because most users struggle to formulate well-defined queries, the query expansion technique is critical for retrieving the most relevant information. Obtaining relevant results in a concise manner is a significant challenge in this scenario. Automatic text summarization can condense a lengthy document while retaining its informative content and key concepts. It could be a potential solution to information overload. This paper proposed a query-based automatic text summarization technique that employs query expansion to improve text summarization and provide the relevant information in a concise manner. To produce a relevant text summary, this article employs a query-based extractive text summarization method, which involves selecting sentences based on the four best features retrieved from each sentence. In this process, the words are scored by the expanded query’s score, and the sentences are scored by four important features, including sentence terms, position, similarity to the first sentence, and proper noun. Extensive experiments with different ROUGE variants on various evaluation metrics, including precision, recall, and F-score, were carried out on the DUC 2007 dataset, with gains of approximately 44%, 46%, and 45% respectively, in the best scenario. It is observed that the suggested approach outperforms both DUC participatory systems and cutting-edge approaches in summary generation.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"162 ","pages":"Article 102531"},"PeriodicalIF":2.7,"publicationDate":"2025-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145532425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-15DOI: 10.1016/j.datak.2025.102530
Nathalie Balla, Thomas Setzer
We present a Decision Support System (DSS) that provides experts with feedback on their personal potential bias based on their previous error pattern. Feedback is calculated using a knowledge database containing a library of biases and typical error patterns that suggest them. An error pattern means any identifiable structure of errors. For instance, an inference engine might detect continuously too high forecasts of an expert submitted via a user interface, regularly exceeding the actual quantities observed later. The engine might then positively evaluate a rule indicating an overestimation bias and provide feedback on the detected error pattern and/or the presumed bias, potentially including further explanations. As the feedback stems from an expert’s own error pattern, it intends to enhance their self-reflection and support wise consideration of the feedback. We assume that this allows experts to acquire knowledge about their own flawed judgmental heuristics, that experts are able to apply the feedback systematically and selectively to different decision tasks and to therefore reduce their potential bias and error. To test these assumptions, we conduct experiments with the DSS. Therein, subjects provide point estimations as well as certainty intervals and subsequently receive error feedback given by a machine based on his or her previous answers. After the feedback, subjects answer further questions. Results indicate that subjects reflect on their own error pattern and apply the feedback selectively to further, upcoming estimations and reduce overall bias and error.
{"title":"Debiasing judgmental decisions by providing individual error pattern feedback","authors":"Nathalie Balla, Thomas Setzer","doi":"10.1016/j.datak.2025.102530","DOIUrl":"10.1016/j.datak.2025.102530","url":null,"abstract":"<div><div>We present a Decision Support System (DSS) that provides experts with feedback on their personal potential bias based on their previous error pattern. Feedback is calculated using a knowledge database containing a library of biases and typical error patterns that suggest them. An error pattern means any identifiable structure of errors. For instance, an inference engine might detect continuously too high forecasts of an expert submitted via a user interface, regularly exceeding the actual quantities observed later. The engine might then positively evaluate a rule indicating an overestimation bias and provide feedback on the detected error pattern and/or the presumed bias, potentially including further explanations. As the feedback stems from an expert’s own error pattern, it intends to enhance their self-reflection and support wise consideration of the feedback. We assume that this allows experts to acquire knowledge about their own flawed judgmental heuristics, that experts are able to apply the feedback systematically and selectively to different decision tasks and to therefore reduce their potential bias and error. To test these assumptions, we conduct experiments with the DSS. Therein, subjects provide point estimations as well as certainty intervals and subsequently receive error feedback given by a machine based on his or her previous answers. After the feedback, subjects answer further questions. Results indicate that subjects reflect on their own error pattern and apply the feedback selectively to further, upcoming estimations and reduce overall bias and error.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"162 ","pages":"Article 102530"},"PeriodicalIF":2.7,"publicationDate":"2025-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145617828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-24DOI: 10.1016/j.datak.2025.102510
Amar Tauqeer , Tek Raj Chhetri , Robert David , Albin Ahmeti , Anna Fensel
GDPR defines six legal bases, at least one of which needs to be followed in order to process (or share) personally identifiable data in a lawful manner. Most of the research today is centered around the legal bases of consent and contracts. This limits the options for legal bases that one can select (or use) for data sharing, especially in circumstances where there is a need to use multiple legal bases. For example, one can consent to share data but may want to place restrictions on how it can be used, which requires a license (an extension/add-on to data sharing contracts) in scenarios, where digital assets licensing is involved. Overcoming these limitations and enabling data sharing via multiple legal bases require combining multiple legal bases. However, incorporating additional (or multiple) legal bases, such as licenses (as an add-on to contracts), in a GDPR-compliant manner remains a challenging task. This is because combining multiple legal bases requires an understanding of each individual legal basis—a task challenging in itself—and designing a system in a manner that is both compliant with regulatory requirements and practically pertinent. Therefore, in this paper, we present our semantic-based approach and tool that enables GDPR-compliant data sharing via multiple legal bases, consent, and contracts (using licenses as an add-on). This work extends our previous work, GDPR Contract Compliance Verification (CCV) tool, which enables GDPR-compliant data sharing via consent and contracts only. We add licenses as a further add-on to contracts, make our previous work more semantically compliant by utilizing SHACL validation for compliance checking, secure the contract signing process with digital signatures, introduce SHACL repairs to automatically fix data inconsistencies, and evaluate the performance of the tool and the SHACL components. We demonstrate the effectiveness of SHACL and the enhancement of the tool with GDPR-complaint data sharing based on multiple legal bases by performance testing.
{"title":"An integrated approach to GDPR-compliant data sharing employing consent, contracts, and licenses","authors":"Amar Tauqeer , Tek Raj Chhetri , Robert David , Albin Ahmeti , Anna Fensel","doi":"10.1016/j.datak.2025.102510","DOIUrl":"10.1016/j.datak.2025.102510","url":null,"abstract":"<div><div>GDPR defines six legal bases, at least one of which needs to be followed in order to process (or share) personally identifiable data in a lawful manner. Most of the research today is centered around the legal bases of consent and contracts. This limits the options for legal bases that one can select (or use) for data sharing, especially in circumstances where there is a need to use multiple legal bases. For example, one can consent to share data but may want to place restrictions on how it can be used, which requires a license (an extension/add-on to data sharing contracts) in scenarios, where digital assets licensing is involved. Overcoming these limitations and enabling data sharing via multiple legal bases require combining multiple legal bases. However, incorporating additional (or multiple) legal bases, such as licenses (as an add-on to contracts), in a GDPR-compliant manner remains a challenging task. This is because combining multiple legal bases requires an understanding of each individual legal basis—a task challenging in itself—and designing a system in a manner that is both compliant with regulatory requirements and practically pertinent. Therefore, in this paper, we present our semantic-based approach and tool that enables GDPR-compliant data sharing via multiple legal bases, consent, and contracts (using licenses as an add-on). This work extends our previous work, GDPR Contract Compliance Verification (CCV) tool, which enables GDPR-compliant data sharing via consent and contracts only. We add licenses as a further add-on to contracts, make our previous work more semantically compliant by utilizing SHACL validation for compliance checking, secure the contract signing process with digital signatures, introduce SHACL repairs to automatically fix data inconsistencies, and evaluate the performance of the tool and the SHACL components. We demonstrate the effectiveness of SHACL and the enhancement of the tool with GDPR-complaint data sharing based on multiple legal bases by performance testing.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"162 ","pages":"Article 102510"},"PeriodicalIF":2.7,"publicationDate":"2025-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145684540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-24DOI: 10.1016/j.datak.2025.102528
Ruishen Liu , Shaorong Xie , Xinzhi Wang , Xiangfeng Luo , Hang Yu
Knowledge Base Question Generation (KBQG) focuses on generating natural language questions from a set of triplets and answers, playing a crucial role in applications such as personalized question-answering systems and educational evaluation tools. Currently, most previous work trains the models based on a large-scale dataset, leading to degraded performance in low-resource scenarios. To address this issue, we propose a novel framework for KBQG using in-context learning with large language models (LLMs) in few-shot settings (FKQG). The key to in-context learning lies in selecting relevant examples to construct prompts that guide the LLM. Therefore, we introduce two strategies for example selection: (1) extracting the semantic paths of triplets as seeds to organize the data, and (2) using the relation linking to the answer as an additional seed. Based on these seeds, examples are retrieved from the data and reranked through graph edit distance, optimizing the prompt structure. This approach ensures contextually relevant question generation. We evaluate FKQG through extensive experiments on two benchmark datasets. Our framework outperforms existing KBQG models in few-shot scenarios, achieving up to a 4.73% improvement in ROUGE-L. Additionally, FKQG enhances the performance of knowledge-based question-answering systems, yielding a 1.2% increase in Hit@1.
{"title":"FKQG: Few-shot question generation from knowledge graph via large language model in-context learning","authors":"Ruishen Liu , Shaorong Xie , Xinzhi Wang , Xiangfeng Luo , Hang Yu","doi":"10.1016/j.datak.2025.102528","DOIUrl":"10.1016/j.datak.2025.102528","url":null,"abstract":"<div><div>Knowledge Base Question Generation (KBQG) focuses on generating natural language questions from a set of triplets and answers, playing a crucial role in applications such as personalized question-answering systems and educational evaluation tools. Currently, most previous work trains the models based on a large-scale dataset, leading to degraded performance in low-resource scenarios. To address this issue, we propose a novel framework for KBQG using in-context learning with large language models (LLMs) in few-shot settings (FKQG). The key to in-context learning lies in selecting relevant examples to construct prompts that guide the LLM. Therefore, we introduce two strategies for example selection: (1) extracting the semantic paths of triplets as seeds to organize the data, and (2) using the relation linking to the answer as an additional seed. Based on these seeds, examples are retrieved from the data and reranked through graph edit distance, optimizing the prompt structure. This approach ensures contextually relevant question generation. We evaluate FKQG through extensive experiments on two benchmark datasets. Our framework outperforms existing KBQG models in few-shot scenarios, achieving up to a 4.73% improvement in ROUGE-L. Additionally, FKQG enhances the performance of knowledge-based question-answering systems, yielding a 1.2% increase in Hit@1.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"161 ","pages":"Article 102528"},"PeriodicalIF":2.7,"publicationDate":"2025-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145416217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
For large, dynamic and continuously changing data streams, the characteristics and categories of data samples change with the increase of time, which is called concept drift. Concept drift has an impact on the stability and performance of the model, if ignored, the prediction will be inaccurate. More and more attention has been paid to the study of multi-data streams. Group concept drift is more complicated than single data stream concept drift. This paper studies the processing methods of group concept drift, and classifies the processing methods from three directions of single type, multiple type and group type for the first time. Within the scope of methods designed to address a single type of concept drift, this paper provides a systematic summary based on the classification of drift types, distinguishing between approaches that target real concept drift and those that address virtual concept drift. Solving real concept drift includes abrupt, gradual, recurrent and local concept drift processing methods. In this paper, the methods of dealing with the concept drift of multiple type are introduced from the aspects of solving the drift of gradual and abrupt, abrupt, gradual and incremental, abrupt, gradual and recurrent. This paper analyzes and summarizes the methods to solve the group type concept drift problem. The methods considered in this paper are compared and summarized from the aspects of processing technology, applicable drift type, comparison methods, advantages and disadvantages.
{"title":"A survey of processing methods for different types of concept drift","authors":"Shurong Yang, Meng Han, Shineng Zhu, Wenyan Yang, Zhenlong Dai, Juan Li, Jian Ding","doi":"10.1016/j.datak.2025.102525","DOIUrl":"10.1016/j.datak.2025.102525","url":null,"abstract":"<div><div>For large, dynamic and continuously changing data streams, the characteristics and categories of data samples change with the increase of time, which is called concept drift. Concept drift has an impact on the stability and performance of the model, if ignored, the prediction will be inaccurate. More and more attention has been paid to the study of multi-data streams. Group concept drift is more complicated than single data stream concept drift. This paper studies the processing methods of group concept drift, and classifies the processing methods from three directions of single type, multiple type and group type for the first time. Within the scope of methods designed to address a single type of concept drift, this paper provides a systematic summary based on the classification of drift types, distinguishing between approaches that target real concept drift and those that address virtual concept drift. Solving real concept drift includes abrupt, gradual, recurrent and local concept drift processing methods. In this paper, the methods of dealing with the concept drift of multiple type are introduced from the aspects of solving the drift of gradual and abrupt, abrupt, gradual and incremental, abrupt, gradual and recurrent. This paper analyzes and summarizes the methods to solve the group type concept drift problem. The methods considered in this paper are compared and summarized from the aspects of processing technology, applicable drift type, comparison methods, advantages and disadvantages.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"161 ","pages":"Article 102525"},"PeriodicalIF":2.7,"publicationDate":"2025-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145465448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-17DOI: 10.1016/j.datak.2025.102527
Ning Wang, Shibo Cui, Jing Zhang, Runzhe Wang, Yongping Yu
The demand for efficient and fair civil dispute mediation is driving the development of intelligent knowledge recommendation technologies. However, existing approaches face challenges in interpretability and reasoning over complex relationships. This study proposes an Interpretable Knowledge Recommendation Method (IKRM) that integrates deep learning and multi-hop reasoning to provide precise and transparent decision support for online mediation platforms. First, to address the extraction of specialized terms and intricate relationships in legal texts, we propose a pre-trained model-based semi-joint extraction method combined with ontology design, constructing a civil dispute knowledge graph that enables hierarchical semantic modeling of legal concepts. Second, we design a hybrid multi-hop reasoning framework that combines neural logic programming for numerical rule-based latent relation mining and cognitive graphs for multi-path reasoning, dynamically generating traceable explanations during path expansion. IKRM performs better than mainstream baseline models in terms of all key evaluation indicators, according to experiments validated using multi-source Chinese legal datasets. It additionally exhibits greater reasoning robustness for difficult queries. This study creates a new paradigm for legal knowledge recommendation systems that is modular, interpretable, and effective. It also contributes to larger social equitable governance by offering accurate decision assistance for civil dispute mediation in China.
{"title":"An interpretable knowledge recommendation method for civil dispute mediation","authors":"Ning Wang, Shibo Cui, Jing Zhang, Runzhe Wang, Yongping Yu","doi":"10.1016/j.datak.2025.102527","DOIUrl":"10.1016/j.datak.2025.102527","url":null,"abstract":"<div><div>The demand for efficient and fair civil dispute mediation is driving the development of intelligent knowledge recommendation technologies. However, existing approaches face challenges in interpretability and reasoning over complex relationships. This study proposes an Interpretable Knowledge Recommendation Method (IKRM) that integrates deep learning and multi-hop reasoning to provide precise and transparent decision support for online mediation platforms. First, to address the extraction of specialized terms and intricate relationships in legal texts, we propose a pre-trained model-based semi-joint extraction method combined with ontology design, constructing a civil dispute knowledge graph that enables hierarchical semantic modeling of legal concepts. Second, we design a hybrid multi-hop reasoning framework that combines neural logic programming for numerical rule-based latent relation mining and cognitive graphs for multi-path reasoning, dynamically generating traceable explanations during path expansion. IKRM performs better than mainstream baseline models in terms of all key evaluation indicators, according to experiments validated using multi-source Chinese legal datasets. It additionally exhibits greater reasoning robustness for difficult queries. This study creates a new paradigm for legal knowledge recommendation systems that is modular, interpretable, and effective. It also contributes to larger social equitable governance by offering accurate decision assistance for civil dispute mediation in China.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"161 ","pages":"Article 102527"},"PeriodicalIF":2.7,"publicationDate":"2025-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145319484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}