Pub Date : 2025-10-17DOI: 10.1016/j.datak.2025.102529
Mingtao Zhou , Juxiang Zhou , Jianhou Gan , Jun Wang , Jiatian Mei
The aim of the knowledge graph-based question generation (KGQG) task is to generate an answerable, fluent question from a ternary knowledge graph and a target answer. Existing KGQG that study knowledge graph subgraphs and the question of target answer generation do not effectively capture the critical semantic information between tokens within nodes/edges in subgraphs and fail to make full use of target answers and answer markers. This has led to the generation of disfluent and unanswerable questions. To address these problems, we propose a model called knowledge graph question generation based on crucial semantic information (KGQG-CSI). Our proposed model utilizes the critical semantic information encoding module to dynamically learn the degree of significance of tokens within the edges and nodes of fused answers, capturing critical semantic information that would remedy disfluency. In addition, the target answers and answer markers are sufficiently integrated with the nodes to make the generated questions answerable. First, the attention mechanism is used to allow the nodes to interact with the target answers, thereby expressing the semantic information related to the answers more accurately. The nodes that have been processed through the critical semantic information encoding module are then spliced with the answer markers to reduce the ambiguous information. The experimental results on two public datasets show that the results of the proposed model outperform the existing methods.
{"title":"Knowledge graph question generation based on crucial semantic information","authors":"Mingtao Zhou , Juxiang Zhou , Jianhou Gan , Jun Wang , Jiatian Mei","doi":"10.1016/j.datak.2025.102529","DOIUrl":"10.1016/j.datak.2025.102529","url":null,"abstract":"<div><div>The aim of the knowledge graph-based question generation (KGQG) task is to generate an answerable, fluent question from a ternary knowledge graph and a target answer. Existing KGQG that study knowledge graph subgraphs and the question of target answer generation do not effectively capture the critical semantic information between tokens within nodes/edges in subgraphs and fail to make full use of target answers and answer markers. This has led to the generation of disfluent and unanswerable questions. To address these problems, we propose a model called knowledge graph question generation based on crucial semantic information (KGQG-CSI). Our proposed model utilizes the critical semantic information encoding module to dynamically learn the degree of significance of tokens within the edges and nodes of fused answers, capturing critical semantic information that would remedy disfluency. In addition, the target answers and answer markers are sufficiently integrated with the nodes to make the generated questions answerable. First, the attention mechanism is used to allow the nodes to interact with the target answers, thereby expressing the semantic information related to the answers more accurately. The nodes that have been processed through the critical semantic information encoding module are then spliced with the answer markers to reduce the ambiguous information. The experimental results on two public datasets show that the results of the proposed model outperform the existing methods.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"161 ","pages":"Article 102529"},"PeriodicalIF":2.7,"publicationDate":"2025-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145361577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-15DOI: 10.1016/j.datak.2025.102526
Andrea Burattin , Barbara Re , Lorenzo Rossi , Francesco Tiezzi
Process mining is a prominent discipline in business process management. It collects a variety of techniques for gathering information from event logs, each fulfilling a different mining purpose. Event logs are always necessary for assessing and validating mining techniques in relation to specific purposes. Unfortunately, event logs are hard to find and usually contain noise that can influence the validity of the results of a mining technique. In this paper, we propose a framework, named purple, for generating, through business model simulation, event logs tailored for different mining purposes, i.e., discovery, what-if analysis, and conformance checking. It supports the simulation of models specified in different languages, by projecting their execution onto a common behavioral model, i.e., a labeled transition system. We present eleven instantiations of the framework implemented in a software tool by-product of this paper. The framework is validated against reference log generators through experiments on the purposes presented in the paper.
{"title":"A framework for purpose-guided event logs generation","authors":"Andrea Burattin , Barbara Re , Lorenzo Rossi , Francesco Tiezzi","doi":"10.1016/j.datak.2025.102526","DOIUrl":"10.1016/j.datak.2025.102526","url":null,"abstract":"<div><div>Process mining is a prominent discipline in business process management. It collects a variety of techniques for gathering information from event logs, each fulfilling a different mining purpose. Event logs are always necessary for assessing and validating mining techniques in relation to specific purposes. Unfortunately, event logs are hard to find and usually contain noise that can influence the validity of the results of a mining technique. In this paper, we propose a framework, named <span>purple</span>, for generating, through business model simulation, event logs tailored for different mining purposes, i.e., discovery, what-if analysis, and conformance checking. It supports the simulation of models specified in different languages, by projecting their execution onto a common behavioral model, i.e., a labeled transition system. We present eleven instantiations of the framework implemented in a software tool by-product of this paper. The framework is validated against reference log generators through experiments on the purposes presented in the paper.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"161 ","pages":"Article 102526"},"PeriodicalIF":2.7,"publicationDate":"2025-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145319482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-06DOI: 10.1016/j.datak.2025.102524
D. Kavitha, Ashwin Kumar S, Divya Priya B A, M V Guru Prasadh, Sri Krishna S, Sidh Parakh, Shriram. V, Tabish Rashid
Multilingual opinion mining has grown within Natural Language Processing (NLP), particularly in the setting of social media. Because it offers valuable data from content provided by users, social networks have grown exponentially in recent years, enabling people to express their opinions and share their ideas on a variety of subjects. Many Sentiment Analysis (SA) strategies have been developed subsequently to gather emotional information from the feedback. The primary limitations of these methods include longer training time and decreased accuracy. To solve this issue, the research work introduced a multi-lingual opinion mining model for analyzing public opinions that are useful in businesses and organizations. Firstly, the texts from social media are accumulated from standard data sources. The collected texts are pre-processed to remove unnecessary content, including promotional content and spam texts. Next, the features from pre-processed text are extracted using techniques including Bidirectional Encoder Representations from Transformers (BERT), N-gram and word2vec. Here, BERT understand the sentiment of a word based on its surrounding words, allowing for more accurate sentiment detection, particularly in complex linguistic structures present in different languages. Further, N-grams extract the features from multilingual datasets, where different languages may have unique syntactic and semantic structures. Word2Vec can effectively capture phrases and idioms that convey specific sentiments. This is particularly beneficial in multilingual contexts where expressions may vary significantly between languages. The extracted features from BERT, N-gram and word2vec are given to the developed Multi-scale Fused Features based Adaptive Residual Convolutional Long Short-Term Memory with Attention mechanism (MARCLA) for analyzing the opinion of the public in different languages. Here, the sentiments expressed in the complex and varied text data are accurately interpreted by the Residual Conv-LSTM. The multiscale mechanism captures the micro and the macro level linguistic features, and the residual connections combat the issues of vanishing gradient, which aid in effectively training the deep network. In addition, the parameters from the suggested MARCLA are optimized using a Modified Random Function-based Parrot Optimizer (MRFPO). This model is beneficial in understanding public sentiment more effectively. The suggested opinion mining model’s performance is compared with conventional techniques to ensure our model’s ability. The accuracy of the designed MRFPO- MARCLA framework is 95.12 %, which is higher than the conventional frameworks like CNN, LSTM, CoNBiLSTM and MARCLA, respectively. Thus, the experimental findings demonstrated that the developed multi-lingual opinion mining approach can effectively help the organizations to monitor sentiment changes and public reactions across different languages.
{"title":"A fine-grained multi-lingual opinion mining method on social media texts using multi-scale fused features-based adaptive residual convolutional LSTM with attention mechanism","authors":"D. Kavitha, Ashwin Kumar S, Divya Priya B A, M V Guru Prasadh, Sri Krishna S, Sidh Parakh, Shriram. V, Tabish Rashid","doi":"10.1016/j.datak.2025.102524","DOIUrl":"10.1016/j.datak.2025.102524","url":null,"abstract":"<div><div>Multilingual opinion mining has grown within Natural Language Processing (NLP), particularly in the setting of social media. Because it offers valuable data from content provided by users, social networks have grown exponentially in recent years, enabling people to express their opinions and share their ideas on a variety of subjects. Many Sentiment Analysis (SA) strategies have been developed subsequently to gather emotional information from the feedback. The primary limitations of these methods include longer training time and decreased accuracy. To solve this issue, the research work introduced a multi-lingual opinion mining model for analyzing public opinions that are useful in businesses and organizations. Firstly, the texts from social media are accumulated from standard data sources. The collected texts are pre-processed to remove unnecessary content, including promotional content and spam texts. Next, the features from pre-processed text are extracted using techniques including Bidirectional Encoder Representations from Transformers (BERT), N-gram and word2vec. Here, BERT understand the sentiment of a word based on its surrounding words, allowing for more accurate sentiment detection, particularly in complex linguistic structures present in different languages. Further, N-grams extract the features from multilingual datasets, where different languages may have unique syntactic and semantic structures. Word2Vec can effectively capture phrases and idioms that convey specific sentiments. This is particularly beneficial in multilingual contexts where expressions may vary significantly between languages. The extracted features from BERT, N-gram and word2vec are given to the developed Multi-scale Fused Features based Adaptive Residual Convolutional Long Short-Term Memory with Attention mechanism (MARCLA) for analyzing the opinion of the public in different languages. Here, the sentiments expressed in the complex and varied text data are accurately interpreted by the Residual Conv-LSTM. The multiscale mechanism captures the micro and the macro level linguistic features, and the residual connections combat the issues of vanishing gradient, which aid in effectively training the deep network. In addition, the parameters from the suggested MARCLA are optimized using a Modified Random Function-based Parrot Optimizer (MRFPO). This model is beneficial in understanding public sentiment more effectively. The suggested opinion mining model’s performance is compared with conventional techniques to ensure our model’s ability. The accuracy of the designed MRFPO- MARCLA framework is 95.12 %, which is higher than the conventional frameworks like CNN, LSTM, CoNBiLSTM and MARCLA, respectively. Thus, the experimental findings demonstrated that the developed multi-lingual opinion mining approach can effectively help the organizations to monitor sentiment changes and public reactions across different languages.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"161 ","pages":"Article 102524"},"PeriodicalIF":2.7,"publicationDate":"2025-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145319483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Knowledge graph recommendation (KGRec) models not only alleviate the issues of data sparsity and the cold start problem encountered by traditional models but also enhance interpretability and credibility through the provision of explicit recommendation rationales. Nonetheless, existing KGRec models predominantly concentrate on extracting static structural features of user preferences from KG, often neglecting the dynamic temporal features, such as purchase time and click time. This oversight results in considerable limitations in recommendation performance. In response to this challenge, this paper introduces a novel temporal knowledge graph recommendation model (TKGRec), which fully utilizes both dynamic temporal feature and static structure feature for better recommendation. We specifically construct a temporal KG that encapsulates both static and dynamic user–item interactions. Based on the new environment, we propose a sequence-aware and path reasoning framework, in which the sequence-aware module employs a dual-attention mechanism to distill temporal features from interactions, whereas the path reasoning module utilizes reinforcement learning to extract path features. These two modules are seamlessly fused and iteratively refined to capture a more holistic understanding of user preferences. Experimental results on three real-world datasets demonstrate that the proposed model significantly outperforms existing state-of-the-art baseline models in terms of performance.
{"title":"Temporal knowledge graph recommendation with sequence-aware and path reasoning","authors":"Yuanming Zhang, Ziyou He, Yongbiao Lou, Haixia Long, Fei Gao","doi":"10.1016/j.datak.2025.102522","DOIUrl":"10.1016/j.datak.2025.102522","url":null,"abstract":"<div><div>Knowledge graph recommendation (KGRec) models not only alleviate the issues of data sparsity and the cold start problem encountered by traditional models but also enhance interpretability and credibility through the provision of explicit recommendation rationales. Nonetheless, existing KGRec models predominantly concentrate on extracting static structural features of user preferences from KG, often neglecting the dynamic temporal features, such as purchase time and click time. This oversight results in considerable limitations in recommendation performance. In response to this challenge, this paper introduces a novel temporal knowledge graph recommendation model (TKGRec), which fully utilizes both dynamic temporal feature and static structure feature for better recommendation. We specifically construct a temporal KG that encapsulates both static and dynamic user–item interactions. Based on the new environment, we propose a sequence-aware and path reasoning framework, in which the sequence-aware module employs a dual-attention mechanism to distill temporal features from interactions, whereas the path reasoning module utilizes reinforcement learning to extract path features. These two modules are seamlessly fused and iteratively refined to capture a more holistic understanding of user preferences. Experimental results on three real-world datasets demonstrate that the proposed model significantly outperforms existing state-of-the-art baseline models in terms of performance.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"161 ","pages":"Article 102522"},"PeriodicalIF":2.7,"publicationDate":"2025-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145266450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-29DOI: 10.1016/j.datak.2025.102523
Zhijun Chen , Tsungshun Hsieh , Ze Chen
The primary aim of the study endeavour is to introduce a deep learning approaches augmented with optimization techniques to conduct sentiment analysis on apparel products, utilize customer reviews and ratings as foundational data. Consequently, a review of a clothing item is utilized as input, which undergoes pre-processing involving the elimination of stop words and stemming to eradicate superfluous information. In parallel, critical features are extracted from the pre-processed data to facilitate effective categorization. Thereafter, feature extraction is executed through execution of Term frequency-inverse document frequency (TF-IDF), SentiWordNet features, positive sentiment scores, negative sentiment scores, the count of capitalized words, and hashtags. Subsequently, feature fusion is conducted utilizing the proposed Trend factor smoothing-Siberian Tiger Optimization (TS-STO), which is innovatively premeditated by integrating trend factor smoothing within the update process of Siberian Tiger Optimization (STO). Ultimately, sentiment analysis is conducted through the implementation of HA-Deep LSTM, which is conceived by merging Hierarchical Attention Network with Deep LSTM. Experimental analysis portrayed that presented approach conquered an accuracy of 95.9 %, a sensitivity of 96.1 % and specificity of 94.2 %.
{"title":"An optimization enabled Hierarchical Attention -Deep LSTM model for sentiment analysis on cloth products from customer rating","authors":"Zhijun Chen , Tsungshun Hsieh , Ze Chen","doi":"10.1016/j.datak.2025.102523","DOIUrl":"10.1016/j.datak.2025.102523","url":null,"abstract":"<div><div>The primary aim of the study endeavour is to introduce a deep learning approaches augmented with optimization techniques to conduct sentiment analysis on apparel products, utilize customer reviews and ratings as foundational data. Consequently, a review of a clothing item is utilized as input, which undergoes pre-processing involving the elimination of stop words and stemming to eradicate superfluous information. In parallel, critical features are extracted from the pre-processed data to facilitate effective categorization. Thereafter, feature extraction is executed through execution of Term frequency-inverse document frequency (TF-IDF), SentiWordNet features, positive sentiment scores, negative sentiment scores, the count of capitalized words, and hashtags. Subsequently, feature fusion is conducted utilizing the proposed Trend factor smoothing-Siberian Tiger Optimization (TS-STO), which is innovatively premeditated by integrating trend factor smoothing within the update process of Siberian Tiger Optimization (STO). Ultimately, sentiment analysis is conducted through the implementation of HA-Deep LSTM, which is conceived by merging Hierarchical Attention Network with Deep LSTM. Experimental analysis portrayed that presented approach conquered an accuracy of 95.9 %, a sensitivity of 96.1 % and specificity of 94.2 %.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"161 ","pages":"Article 102523"},"PeriodicalIF":2.7,"publicationDate":"2025-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145266451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-16DOI: 10.1016/j.datak.2025.102512
R. Iyswarya , R. Anitha
Human life has become highly dependent on data in recent decades almost every facet of daily activities, leading to its storage in multi-cloud environments. To ensure data integrity, confidentiality, and privacy, it is essential to protect data from unauthorized access. This paper proposes a novel approach for securing data in multi-cloud environments for user authentication and data storage using Lattice-Based Saber Cryptography combined with PUF-ECC and the Enhanced Goose Optimization Algorithm (EGOA). The initial user authentication is achieved through the PUF-ECC digital signature algorithm, which verifies both the user's and the device's identity. Once authenticated, user data is securely transmitted to the cloud server based on Lattice-Based Saber post-quantum cryptography combined with the Diffie-Hellman key exchange protocol. The encrypted data is then stored across multiple cloud storage through a cloud controller using RAM-based chunking. For efficient data retrieval, the Enhanced Goose Optimization Algorithm (EGOA) is employed to extract encrypted data from clouds. Finally, the data is decrypted using the Lattice-Based Saber decryption algorithm and securely retrieved by the authenticated user. This method enhances both the security and efficiency of cloud data management and retrieval. The experiment is carried out with the proposed methodologies and also compared with the existing technologies. The proposed approach achieves encryption times of 9.68 ms, key generation times of 4.84 ms, and block creation times of 1.59 ms, while maintaining a 93.7 % confidentiality rate, a 98 % packet delivery ratio, a transmission delay of 0.026 ms, throughput of 407.33 MB/s, jitter of 3.26 ms, and an RTT of 0.17 ms, demonstrating its effectiveness in secure data storage and retrieval in multi-cloud environments.
{"title":"Secure data storage in multi-cloud environments using lattice-based saber with Diffie-Hellman cryptography and authenticate based on PUF-ECC","authors":"R. Iyswarya , R. Anitha","doi":"10.1016/j.datak.2025.102512","DOIUrl":"10.1016/j.datak.2025.102512","url":null,"abstract":"<div><div>Human life has become highly dependent on data in recent decades almost every facet of daily activities, leading to its storage in multi-cloud environments. To ensure data integrity, confidentiality, and privacy, it is essential to protect data from unauthorized access. This paper proposes a novel approach for securing data in multi-cloud environments for user authentication and data storage using Lattice-Based Saber Cryptography combined with PUF-ECC and the Enhanced Goose Optimization Algorithm (EGOA). The initial user authentication is achieved through the PUF-ECC digital signature algorithm, which verifies both the user's and the device's identity. Once authenticated, user data is securely transmitted to the cloud server based on Lattice-Based Saber post-quantum cryptography combined with the Diffie-Hellman key exchange protocol. The encrypted data is then stored across multiple cloud storage through a cloud controller using RAM-based chunking. For efficient data retrieval, the Enhanced Goose Optimization Algorithm (EGOA) is employed to extract encrypted data from clouds. Finally, the data is decrypted using the Lattice-Based Saber decryption algorithm and securely retrieved by the authenticated user. This method enhances both the security and efficiency of cloud data management and retrieval. The experiment is carried out with the proposed methodologies and also compared with the existing technologies. The proposed approach achieves encryption times of 9.68 ms, key generation times of 4.84 ms, and block creation times of 1.59 ms, while maintaining a 93.7 % confidentiality rate, a 98 % packet delivery ratio, a transmission delay of 0.026 ms, throughput of 407.33 MB/s, jitter of 3.26 ms, and an RTT of 0.17 ms, demonstrating its effectiveness in secure data storage and retrieval in multi-cloud environments.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"161 ","pages":"Article 102512"},"PeriodicalIF":2.7,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145158079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Measuring semantic similarity between sentence pairs is a fundamental problem in Natural Language Processing with applications in various domains, including machine translation, speech recognition, automatic question answering, and text summarization. Despite its significance, accurately assessing semantic similarity remains a challenging task, particularly for underrepresented languages such as Vietnamese. Existing methods have yet to fully leverage the unique linguistic characteristics of Vietnamese for semantic similarity measurement. To address this limitation, we propose GBNet-STS (Graph-Based Network for Semantic Textual Similarity), a novel framework for measuring the semantic similarity of Vietnamese sentence pairs. GBNet-STS integrates lexical-grammatical similarity scores and distributional semantic similarity scores within a multi-layered graph-based model. By capturing different semantic perspectives through multiple interconnected layers, our approach provides a more comprehensive and robust similarity estimation. Experimental results demonstrate that GBNet-STS outperforms traditional methods, achieving state-of-the-art performance in Vietnamese semantic similarity tasks.
句子对之间的语义相似度测量是自然语言处理中的一个基本问题,在机器翻译、语音识别、自动问答和文本摘要等领域都有广泛的应用。尽管具有重要意义,但准确评估语义相似性仍然是一项具有挑战性的任务,特别是对于像越南语这样代表性不足的语言。现有的方法尚未充分利用越南语独特的语言特征进行语义相似度测量。为了解决这一限制,我们提出了一种新的框架GBNet-STS (Graph-Based Network for Semantic Textual Similarity)来测量越南语句子对的语义相似度。GBNet-STS将词汇语法相似度评分和分布语义相似度评分集成在一个多层基于图的模型中。通过通过多个相互连接的层捕获不同的语义透视图,我们的方法提供了更全面和健壮的相似性估计。实验结果表明,GBNet-STS优于传统方法,在越南语语义相似任务中取得了最先进的性能。
{"title":"A graph-based model for semantic textual similarity measurement","authors":"Van-Tan Bui , Quang-Minh Nguyen , Van-Vinh Nguyen , Duc-Toan Nguyen","doi":"10.1016/j.datak.2025.102509","DOIUrl":"10.1016/j.datak.2025.102509","url":null,"abstract":"<div><div>Measuring semantic similarity between sentence pairs is a fundamental problem in Natural Language Processing with applications in various domains, including machine translation, speech recognition, automatic question answering, and text summarization. Despite its significance, accurately assessing semantic similarity remains a challenging task, particularly for underrepresented languages such as Vietnamese. Existing methods have yet to fully leverage the unique linguistic characteristics of Vietnamese for semantic similarity measurement. To address this limitation, we propose GBNet-STS (Graph-Based Network for Semantic Textual Similarity), a novel framework for measuring the semantic similarity of Vietnamese sentence pairs. GBNet-STS integrates lexical-grammatical similarity scores and distributional semantic similarity scores within a multi-layered graph-based model. By capturing different semantic perspectives through multiple interconnected layers, our approach provides a more comprehensive and robust similarity estimation. Experimental results demonstrate that GBNet-STS outperforms traditional methods, achieving state-of-the-art performance in Vietnamese semantic similarity tasks.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"161 ","pages":"Article 102509"},"PeriodicalIF":2.7,"publicationDate":"2025-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145060462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-11DOI: 10.1016/j.datak.2025.102511
MVPT Lakshika, HA Caldera
One of the most important tools for knowledge management is the Knowledge Graph (KG), a multi-relational graph that depicts rich factual information across entities. A KG represents entities as nodes and relations as edges, with each edge represented by a triplet: (head entity, relation, tail entity). The Scoring Function (SF) in a KG quantifies the plausibility of these triplets and is often derived from KG embeddings. However, due to the distinct relational patterns across KGs, an SF that performs well on one KG might fail on another, making the design of optimal SFs a challenging task. This study introduces the concept of an Associative Scoring Function (ASF), which leverages Association Rule Mining (ARM) to discover and incorporate patterns and characteristics of symmetric, asymmetric, inverse, and other relational types within embedded KGs. The ARM technique in ASF uses the FP-Growth algorithm to extract meaningful associations, which is enhanced further through hyperparameter tuning. Extensive experiments on benchmark datasets demonstrate that ASF is KG-independent and performs better than state-of-the-art SFs. These results highlight ASF's potential to generalize across diverse KGs, offering a significant advancement in the KG link prediction task.
{"title":"ASF: A novel associative scoring function for embedded knowledge graph reasoning","authors":"MVPT Lakshika, HA Caldera","doi":"10.1016/j.datak.2025.102511","DOIUrl":"10.1016/j.datak.2025.102511","url":null,"abstract":"<div><div>One of the most important tools for knowledge management is the Knowledge Graph (KG), a multi-relational graph that depicts rich factual information across entities. A KG represents entities as nodes and relations as edges, with each edge represented by a triplet: (head entity, relation, tail entity). The Scoring Function (SF) in a KG quantifies the plausibility of these triplets and is often derived from KG embeddings. However, due to the distinct relational patterns across KGs, an SF that performs well on one KG might fail on another, making the design of optimal SFs a challenging task. This study introduces the concept of an Associative Scoring Function (ASF), which leverages Association Rule Mining (ARM) to discover and incorporate patterns and characteristics of symmetric, asymmetric, inverse, and other relational types within embedded KGs. The ARM technique in ASF uses the FP-Growth algorithm to extract meaningful associations, which is enhanced further through hyperparameter tuning. Extensive experiments on benchmark datasets demonstrate that ASF is KG-independent and performs better than state-of-the-art SFs. These results highlight ASF's potential to generalize across diverse KGs, offering a significant advancement in the KG link prediction task.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"161 ","pages":"Article 102511"},"PeriodicalIF":2.7,"publicationDate":"2025-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145094813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-06DOI: 10.1016/j.datak.2025.102493
Juan Trujillo , Ana Lavalle , Alejandro Reina-Reina , Jorge García-Carrasco , Alejandro Maté , Wolfgang Maaß
To this day, the requirements of data warehouses, user visualizations and ML projects have been tackled in an independent manner, ignoring the possible cross-requirements, collective constraints and dependencies between the outputs of the different systems that should be taken into account to ensure a successful analytical project. In this work, we take a holistic approach and propose a methodology that supports modeling and subsequent analysis while taking into account these three aspects. This methodology has several advantages, mainly that (i) it enables us to identify possible conflicts between actors on different tasks that are overlooked if the systems are treated in an isolated manner and (ii) this holistic view enables modeling multi-company systems, where the information or even the analytical results can be provided by third-parties, identifying key participants in federated environments. After presenting the required formalism to carry out this kind of analysis, we showcase it on a real-world running example of the tourism sector.
{"title":"An integrated requirements framework for analytical and AI projects","authors":"Juan Trujillo , Ana Lavalle , Alejandro Reina-Reina , Jorge García-Carrasco , Alejandro Maté , Wolfgang Maaß","doi":"10.1016/j.datak.2025.102493","DOIUrl":"10.1016/j.datak.2025.102493","url":null,"abstract":"<div><div>To this day, the requirements of data warehouses, user visualizations and ML projects have been tackled in an independent manner, ignoring the possible cross-requirements, collective constraints and dependencies between the outputs of the different systems that should be taken into account to ensure a successful analytical project. In this work, we take a holistic approach and propose a methodology that supports modeling and subsequent analysis while taking into account these three aspects. This methodology has several advantages, mainly that (i) it enables us to identify possible conflicts between actors on different tasks that are overlooked if the systems are treated in an isolated manner and (ii) this holistic view enables modeling multi-company systems, where the information or even the analytical results can be provided by third-parties, identifying key participants in federated environments. After presenting the required formalism to carry out this kind of analysis, we showcase it on a real-world running example of the tourism sector.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"161 ","pages":"Article 102493"},"PeriodicalIF":2.7,"publicationDate":"2025-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145094812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-02DOI: 10.1016/j.datak.2025.102508
Ali Norouzifar , Marcus Dees , Wil van der Aalst
Event data extracted from information systems serves as the foundation for process mining, enabling the extraction of insights and identification of improvements. Process discovery focuses on deriving descriptive process models from event logs, which form the basis for conformance checking, performance analysis, and other applications. Traditional process discovery techniques predominantly rely on event logs, often overlooking supplementary information such as domain knowledge and process rules. These rules, which define relationships between activities, can be obtained through automated techniques like declarative process discovery or provided by domain experts based on process specifications. When used as an additional input alongside event logs, such rules have significant potential to guide process discovery. However, leveraging rules to discover high-quality imperative process models, such as BPMN models and Petri nets, remains an underexplored area in the literature. To address this gap, we propose an enhanced framework, IMr, which integrates discovered or user-defined rules into the process discovery workflow via a novel recursive approach. The IMr framework employs a divide-and-conquer strategy, using rules to guide the selection of process structures at each recursion step in combination with the input event log. We evaluate our approach on several real-world event logs and demonstrate that the discovered models better align with the provided rules without compromising their conformance to the event log. Additionally, we show that high-quality rules can improve model quality across well-known conformance metrics. This work highlights the importance of integrating domain knowledge into process discovery, enhancing the quality, interpretability, and applicability of the resulting process models.
{"title":"Rule-guided process discovery","authors":"Ali Norouzifar , Marcus Dees , Wil van der Aalst","doi":"10.1016/j.datak.2025.102508","DOIUrl":"10.1016/j.datak.2025.102508","url":null,"abstract":"<div><div>Event data extracted from information systems serves as the foundation for process mining, enabling the extraction of insights and identification of improvements. Process discovery focuses on deriving descriptive process models from event logs, which form the basis for conformance checking, performance analysis, and other applications. Traditional process discovery techniques predominantly rely on event logs, often overlooking supplementary information such as domain knowledge and process rules. These rules, which define relationships between activities, can be obtained through automated techniques like declarative process discovery or provided by domain experts based on process specifications. When used as an additional input alongside event logs, such rules have significant potential to guide process discovery. However, leveraging rules to discover high-quality imperative process models, such as BPMN models and Petri nets, remains an underexplored area in the literature. To address this gap, we propose an enhanced framework, IMr, which integrates discovered or user-defined rules into the process discovery workflow via a novel recursive approach. The IMr framework employs a divide-and-conquer strategy, using rules to guide the selection of process structures at each recursion step in combination with the input event log. We evaluate our approach on several real-world event logs and demonstrate that the discovered models better align with the provided rules without compromising their conformance to the event log. Additionally, we show that high-quality rules can improve model quality across well-known conformance metrics. This work highlights the importance of integrating domain knowledge into process discovery, enhancing the quality, interpretability, and applicability of the resulting process models.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"161 ","pages":"Article 102508"},"PeriodicalIF":2.7,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145048987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}