Pub Date : 2024-06-01DOI: 10.1007/s10115-024-02142-2
Jieming Feng, Zhanhuai Li, Qun Chen, Hailong Liu
For cardinality estimation in DBMS, building multiple local models instead of one global model can usually improve estimation accuracy as well as reducing the effort to label large amounts of training data. Unfortunately, the existing approach of localized learning requires users to explicitly specify which query patterns a local model can handle. Making these decisions is very arduous and error-prone for users; to make things worse, it limits the usability of local models. In this paper, we propose a localized learning solution for cardinality estimation based on XGBoost, which can automatically build an optimal combination of local models given a query workload. It consists of two phases: 1) model initialization; 2) model evolution. In the first phase, it clusters training data into a set of coarse-grained query pattern groups based on pattern similarity and constructs a separate local model for each group. In the second phase, it iteratively merges and splits clusters to identify an optimal combination by reconstructing local models. We formulate the problem of identifying the optimal combination of local models as a combinatorial optimization problem and present an efficient heuristic algorithm, named MMS (Models Merging and Splitting), for its solution due to its exponential complexity. Finally, we validate its performance superiority over the existing learning alternatives by extensive experiments on real datasets.
{"title":"Automating localized learning for cardinality estimation based on XGBoost","authors":"Jieming Feng, Zhanhuai Li, Qun Chen, Hailong Liu","doi":"10.1007/s10115-024-02142-2","DOIUrl":"https://doi.org/10.1007/s10115-024-02142-2","url":null,"abstract":"<p>For cardinality estimation in DBMS, building multiple local models instead of one global model can usually improve estimation accuracy as well as reducing the effort to label large amounts of training data. Unfortunately, the existing approach of localized learning requires users to explicitly specify which query patterns a local model can handle. Making these decisions is very arduous and error-prone for users; to make things worse, it limits the usability of local models. In this paper, we propose a localized learning solution for cardinality estimation based on XGBoost, which can automatically build an optimal combination of local models given a query workload. It consists of two phases: 1) model initialization; 2) model evolution. In the first phase, it clusters training data into a set of coarse-grained query pattern groups based on pattern similarity and constructs a separate local model for each group. In the second phase, it iteratively merges and splits clusters to identify an optimal combination by reconstructing local models. We formulate the problem of identifying the optimal combination of local models as a combinatorial optimization problem and present an efficient heuristic algorithm, named <b>MMS</b> (<b>M</b>odels <b>M</b>erging and <b>S</b>plitting), for its solution due to its exponential complexity. Finally, we validate its performance superiority over the existing learning alternatives by extensive experiments on real datasets.</p>","PeriodicalId":54749,"journal":{"name":"Knowledge and Information Systems","volume":null,"pages":null},"PeriodicalIF":2.7,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141193279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-30DOI: 10.1007/s10115-024-02133-3
Mohavia Ben Amid Sinon, Jules Clement Mba
The common goal for investors is to minimise the risk and maximise the returns on their investments. This is often achieved through diversification, where investors spread their investments across various assets. This study aims to use the MAD-entropy model to minimise the absolute deviation, maximise the mean return, and maximise the Shannon entropy of the portfolio. The MAD model is used because it is a linear programming model, allowing it to resolve large-scale problems and nonnormally distributed data. Entropy is added to the MAD model because it can better diversify the weight of assets in the portfolios. The analysed portfolios consist of cryptocurrencies, stablecoins, and selected world indices such as the SP500 and FTSE obtained from Yahoo Finance. The models found that stablecoins pegged to the US dollar, followed by stablecoins pegged to gold, are better diversifiers for traditional cryptocurrencies and stocks. These results are probably due to their low volatility compared to the other assets. Findings from this study may assist investors since the MAD-Entropy model outperforms the MAD model by providing more significant portfolio mean returns with minimal risk. Therefore, crypto investors can design a well-diversified portfolio using MAD entropy to reduce unsystematic risk. Further research integrating mad entropy with machine learning techniques may improve accuracy and risk management.
{"title":"The analysis of diversification properties of stablecoins through the Shannon entropy measure","authors":"Mohavia Ben Amid Sinon, Jules Clement Mba","doi":"10.1007/s10115-024-02133-3","DOIUrl":"https://doi.org/10.1007/s10115-024-02133-3","url":null,"abstract":"<p>The common goal for investors is to minimise the risk and maximise the returns on their investments. This is often achieved through diversification, where investors spread their investments across various assets. This study aims to use the MAD-entropy model to minimise the absolute deviation, maximise the mean return, and maximise the Shannon entropy of the portfolio. The MAD model is used because it is a linear programming model, allowing it to resolve large-scale problems and nonnormally distributed data. Entropy is added to the MAD model because it can better diversify the weight of assets in the portfolios. The analysed portfolios consist of cryptocurrencies, stablecoins, and selected world indices such as the SP500 and FTSE obtained from Yahoo Finance. The models found that stablecoins pegged to the US dollar, followed by stablecoins pegged to gold, are better diversifiers for traditional cryptocurrencies and stocks. These results are probably due to their low volatility compared to the other assets. Findings from this study may assist investors since the MAD-Entropy model outperforms the MAD model by providing more significant portfolio mean returns with minimal risk. Therefore, crypto investors can design a well-diversified portfolio using MAD entropy to reduce unsystematic risk. Further research integrating mad entropy with machine learning techniques may improve accuracy and risk management.</p>","PeriodicalId":54749,"journal":{"name":"Knowledge and Information Systems","volume":null,"pages":null},"PeriodicalIF":2.7,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141193162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-30DOI: 10.1007/s10115-024-02139-x
Nicolás Leutwyler, Mario Lezoche, Chiara Franciosi, Hervé Panetto, Laurent Teste, Diego Torres
The Internet of Things massive adoption in many industrial areas in addition to the requirement of modern services is posing huge challenges to the field of data mining. Moreover, the semantic interoperability of systems and enterprises requires to operate between many different formats such as ontologies, knowledge graphs, or relational databases, as well as different contexts such as static, dynamic, or real time. Consequently, supporting this semantic interoperability requires a wide range of knowledge discovery methods with different capabilities that answer to the context of distributed architectures (DA). However, to the best of our knowledge there is no general review in recent time about the state of the art of Concept Analysis (CA) and multi-relational data mining (MRDM) methods regarding knowledge discovery in DA considering semantic interoperability. In this work, a systematic literature review on CA and MRDM is conducted, providing a discussion on the characteristics they have according to the papers reviewed, supported by a clusterization technique based on association rules. Moreover, the review allowed the identification of three research gaps toward a more scalable set of methods in the context of DA and heterogeneous sources.
除了对现代服务的要求之外,物联网在许多工业领域的大规模应用也给数据挖掘领域带来了巨大挑战。此外,系统和企业的语义互操作性要求在本体、知识图谱或关系数据库等多种不同格式以及静态、动态或实时等不同上下文之间进行操作。因此,支持这种语义互操作性需要多种知识发现方法,这些方法具有不同的功能,可满足分布式架构(DA)的要求。然而,据我们所知,最近还没有关于概念分析(CA)和多关系数据挖掘(MRDM)方法在考虑到语义互操作性的 DA 中的知识发现方面的最新进展的综述。在这项工作中,对 CA 和 MRDM 进行了系统的文献综述,根据所综述的论文讨论了它们的特点,并辅以基于关联规则的聚类技术。此外,通过综述还确定了三个研究缺口,以便在数据分析和异构源的背景下,找到一套更具可扩展性的方法。
{"title":"Methods for concept analysis and multi-relational data mining: a systematic literature review","authors":"Nicolás Leutwyler, Mario Lezoche, Chiara Franciosi, Hervé Panetto, Laurent Teste, Diego Torres","doi":"10.1007/s10115-024-02139-x","DOIUrl":"https://doi.org/10.1007/s10115-024-02139-x","url":null,"abstract":"<p>The Internet of Things massive adoption in many industrial areas in addition to the requirement of modern services is posing huge challenges to the field of data mining. Moreover, the semantic interoperability of systems and enterprises requires to operate between many different formats such as ontologies, knowledge graphs, or relational databases, as well as different contexts such as static, dynamic, or real time. Consequently, supporting this semantic interoperability requires a wide range of knowledge discovery methods with different capabilities that answer to the context of <i>distributed architectures</i> (DA). However, to the best of our knowledge there is no general review in recent time about the state of the art of Concept Analysis (CA) and multi-relational data mining (MRDM) methods regarding knowledge discovery in DA considering semantic interoperability. In this work, a systematic literature review on CA and MRDM is conducted, providing a discussion on the characteristics they have according to the papers reviewed, supported by a clusterization technique based on association rules. Moreover, the review allowed the identification of three research gaps toward a more scalable set of methods in the context of DA and heterogeneous sources.</p>","PeriodicalId":54749,"journal":{"name":"Knowledge and Information Systems","volume":null,"pages":null},"PeriodicalIF":2.7,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141193750","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The goal of aspect-based sentiment analysis is to recognize the aspect information in the text and the corresponding sentiment polarity. A variety of robust methods, including attention mechanisms and convolutional neural networks, have been extensively utilized to tackle this complex task. Better experimental results are obtained by using graph convolutional networks (GCN) based on semantic dependency trees in previous studies. Therefore, abundant methods begin to use sentence structure information to complete this task. However, only the loose connection between aspect words and contexts is realized in some practices due to sentences may contain complex relations. To solve this problem, Twain-Syntax graph convolutional network model is proposed, which can utilize multiple syntactic structure information simultaneously. Guided by the constituent tree and dependency tree, rich syntactic information is fully used in the model to build the sentiment-aware context for each aspect. In special, the multilayer attention mechanism and GCN are employed for learning to capture the correlation between words. By integrating syntactic information, this approach significantly refines the model’s technical performance. Extensive testing on four benchmark datasets shows that the model delineated in this paper exhibits high levels of efficiency, comparable to several cutting-edge models.
{"title":"Twain-GCN: twain-syntax graph convolutional networks for aspect-based sentiment analysis","authors":"Ying Hou, Fang’ai Liu, Xuqiang Zhuang, Yuling Zhang","doi":"10.1007/s10115-024-02135-1","DOIUrl":"https://doi.org/10.1007/s10115-024-02135-1","url":null,"abstract":"<p>The goal of aspect-based sentiment analysis is to recognize the aspect information in the text and the corresponding sentiment polarity. A variety of robust methods, including attention mechanisms and convolutional neural networks, have been extensively utilized to tackle this complex task. Better experimental results are obtained by using graph convolutional networks (GCN) based on semantic dependency trees in previous studies. Therefore, abundant methods begin to use sentence structure information to complete this task. However, only the loose connection between aspect words and contexts is realized in some practices due to sentences may contain complex relations. To solve this problem, Twain-Syntax graph convolutional network model is proposed, which can utilize multiple syntactic structure information simultaneously. Guided by the constituent tree and dependency tree, rich syntactic information is fully used in the model to build the sentiment-aware context for each aspect. In special, the multilayer attention mechanism and GCN are employed for learning to capture the correlation between words. By integrating syntactic information, this approach significantly refines the model’s technical performance. Extensive testing on four benchmark datasets shows that the model delineated in this paper exhibits high levels of efficiency, comparable to several cutting-edge models.</p>","PeriodicalId":54749,"journal":{"name":"Knowledge and Information Systems","volume":null,"pages":null},"PeriodicalIF":2.7,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141193171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-30DOI: 10.1007/s10115-024-02141-3
Yichao Hong, Yuanyuan Chen
Convolutional neural networks (CNNs) have demonstrated impressive performance in fitting data distribution. However, due to the complexity in learning intricate features from data, networks usually experience overfitting during the training. To address this issue, many data augmentation techniques have been proposed to expand the representation of the training data, thereby improving the generalization ability of CNNs. Inspired by jigsaw puzzles, we propose PatchMix, a novel mixup-based augmentation method that applies mixup to patches within an image to extract abundant and varied information from it. At the input level of CNNs, PatchMix can generate a multitude of reliable training samples through an integrated and controllable approach that encompasses cropping, combining, blurring, and more. Additionally, we propose PatchMix-R to enhance the robustness of the model against perturbations by processing adjacent pixels. Easy to implement, our methods can be integrated with most CNN-based classification models and combined with varying data augmentation techniques. The experiments show that PatchMix and PatchMix-R consistently outperform other state-of-the-art methods in terms of accuracy and robustness. Class activation mappings of the trained model are also investigated to visualize the effectiveness of our approach.
{"title":"PatchMix: patch-level mixup for data augmentation in convolutional neural networks","authors":"Yichao Hong, Yuanyuan Chen","doi":"10.1007/s10115-024-02141-3","DOIUrl":"https://doi.org/10.1007/s10115-024-02141-3","url":null,"abstract":"<p>Convolutional neural networks (CNNs) have demonstrated impressive performance in fitting data distribution. However, due to the complexity in learning intricate features from data, networks usually experience overfitting during the training. To address this issue, many data augmentation techniques have been proposed to expand the representation of the training data, thereby improving the generalization ability of CNNs. Inspired by jigsaw puzzles, we propose PatchMix, a novel mixup-based augmentation method that applies mixup to patches within an image to extract abundant and varied information from it. At the input level of CNNs, PatchMix can generate a multitude of reliable training samples through an integrated and controllable approach that encompasses cropping, combining, blurring, and more. Additionally, we propose PatchMix-R to enhance the robustness of the model against perturbations by processing adjacent pixels. Easy to implement, our methods can be integrated with most CNN-based classification models and combined with varying data augmentation techniques. The experiments show that PatchMix and PatchMix-R consistently outperform other state-of-the-art methods in terms of accuracy and robustness. Class activation mappings of the trained model are also investigated to visualize the effectiveness of our approach.\u0000</p>","PeriodicalId":54749,"journal":{"name":"Knowledge and Information Systems","volume":null,"pages":null},"PeriodicalIF":2.7,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141193289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-29DOI: 10.1007/s10115-024-02131-5
Marwa Badrouni, Chaker Katar, Wissem Inoubli
The knowledge graph emerges as powerful data structures that provide a deep representation and understanding of the knowledge presented in networks. In the pursuit of representation learning of the knowledge graph, entities and relationships undergo an embedding process, where they are mapped onto a vector space with reduced dimensions. These embeddings are progressively used to extract their information for a multitude of tasks in machine learning. Nevertheless, the increase data in knowledge graph has introduced a challenge, especially as knowledge graph embedding now encompass millions of nodes and billions of edges, surpassing the capacities of existing knowledge representation learning systems. In response to these challenge, this paper presents DistKGE, a distributed learning approach of knowledge graph embedding based on a new partitioning technique. In our experimental evaluation, we illustrate that the proposed approach improves the scalability of distributed knowledge graph learning with respect to graph size compared to existing methods in terms of runtime performances in the link prediction task aimed at identifying new links between entities within the knowledge graph.
{"title":"Large-scale knowledge graph representation learning","authors":"Marwa Badrouni, Chaker Katar, Wissem Inoubli","doi":"10.1007/s10115-024-02131-5","DOIUrl":"https://doi.org/10.1007/s10115-024-02131-5","url":null,"abstract":"<p>The knowledge graph emerges as powerful data structures that provide a deep representation and understanding of the knowledge presented in networks. In the pursuit of representation learning of the knowledge graph, entities and relationships undergo an embedding process, where they are mapped onto a vector space with reduced dimensions. These embeddings are progressively used to extract their information for a multitude of tasks in machine learning. Nevertheless, the increase data in knowledge graph has introduced a challenge, especially as knowledge graph embedding now encompass millions of nodes and billions of edges, surpassing the capacities of existing knowledge representation learning systems. In response to these challenge, this paper presents DistKGE, a distributed learning approach of knowledge graph embedding based on a new partitioning technique. In our experimental evaluation, we illustrate that the proposed approach improves the scalability of distributed knowledge graph learning with respect to graph size compared to existing methods in terms of runtime performances in the link prediction task aimed at identifying new links between entities within the knowledge graph.\u0000</p>","PeriodicalId":54749,"journal":{"name":"Knowledge and Information Systems","volume":null,"pages":null},"PeriodicalIF":2.7,"publicationDate":"2024-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141193176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Online social networks (OSNs) are an indispensable part of social communication where people connect and share information. Spammers and other malicious actors use the OSN’s power to propagate spam content. In an OSN with mutual relations between nodes, two kinds of spammer detection methods can be employed: feature based and propagation based. However, both of these are incomplete in themselves. The feature-based methods cannot exploit mutual connections between nodes, and propagation-based methods cannot utilize the rich discriminating node features. We propose a hybrid model—Markov enhanced graph attention network (MEGAT)—using graph attention networks (GAT) and pairwise Markov random fields (pMRF) for the spammer detection task. It efficiently utilizes node features as well as propagation information. We experiment our GAT model with a smoother Swish activation function having non-monotonic derivatives, instead of the leakyReLU function. The experiments performed on a real-world Twitter Social Honeypot (TwitterSH) benchmark dataset and subsequent comparative analysis reveal that our proposed MEGAT model outperforms the state-of-the-art models in accuracy, precision–recall area under curve (PRAUC), and F1-score performance measures.
在线社交网络(OSN)是社会交流中不可或缺的一部分,人们在这里建立联系并分享信息。垃圾邮件发送者和其他恶意行为者利用 OSN 的力量传播垃圾邮件内容。在节点之间存在相互关系的 OSN 中,可以采用两种垃圾邮件发送者检测方法:基于特征的方法和基于传播的方法。然而,这两种方法本身都是不完整的。基于特征的方法无法利用节点之间的相互联系,而基于传播的方法则无法利用丰富的节点判别特征。我们提出了一种混合模型--马尔可夫增强图注意力网络(MEGAT)--利用图注意力网络(GAT)和成对马尔可夫随机场(pMRF)来完成垃圾邮件检测任务。它有效地利用了节点特征和传播信息。我们使用具有非单调导数的更平滑 Swish 激活函数,而不是 leakyReLU 函数来实验我们的 GAT 模型。在真实世界的 Twitter 社交蜜罐(TwitterSH)基准数据集上进行的实验和随后的比较分析表明,我们提出的 MEGAT 模型在准确率、精确度-召回曲线下面积(PRAUC)和 F1 分数等性能指标上都优于最先进的模型。
{"title":"Markov enhanced graph attention network for spammer detection in online social network","authors":"Ashutosh Tripathi, Mohona Ghosh, Kusum Kumari Bharti","doi":"10.1007/s10115-024-02137-z","DOIUrl":"https://doi.org/10.1007/s10115-024-02137-z","url":null,"abstract":"<p>Online social networks (OSNs) are an indispensable part of social communication where people connect and share information. Spammers and other malicious actors use the OSN’s power to propagate spam content. In an OSN with mutual relations between nodes, two kinds of spammer detection methods can be employed: feature based and propagation based. However, both of these are incomplete in themselves. The feature-based methods cannot exploit mutual connections between nodes, and propagation-based methods cannot utilize the rich discriminating node features. We propose a hybrid model—Markov enhanced graph attention network (MEGAT)—using graph attention networks (GAT) and pairwise Markov random fields (pMRF) for the spammer detection task. It efficiently utilizes node features as well as propagation information. We experiment our GAT model with a smoother <i>Swish</i> activation function having non-monotonic derivatives, instead of the <i>leakyReLU</i> function. The experiments performed on a real-world Twitter Social Honeypot (TwitterSH) benchmark dataset and subsequent comparative analysis reveal that our proposed MEGAT model outperforms the state-of-the-art models in accuracy, precision–recall area under curve (PRAUC), and F1-score performance measures.\u0000</p>","PeriodicalId":54749,"journal":{"name":"Knowledge and Information Systems","volume":null,"pages":null},"PeriodicalIF":2.7,"publicationDate":"2024-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141167350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-29DOI: 10.1007/s10115-024-02140-4
Quang-Duy Tran, Phuoc Nguyen, Bao Duong, Thin Nguyen
Distributional estimates in Bayesian approaches in structure learning have advantages compared to the ones performing point estimates when handling epistemic uncertainty. Differentiable methods for Bayesian structure learning have been developed to enhance the scalability of the inference process and are achieving optimistic outcomes. However, in the differentiable continuous setting, constraining the acyclicity of learned graphs emerges as another challenge. Various works utilize post-hoc penalization scores to impose this constraint which cannot assure acyclicity. The topological ordering of the variables is one type of prior knowledge that contains valuable information about the acyclicity of a directed graph. In this work, we propose a framework to guarantee the acyclicity of inferred graphs by integrating the information from the topological ordering into the inference process. Our integration framework does not interfere with the differentiable inference process while being able to strictly assure the acyclicity of learned graphs and reduce the inference complexity. Our extensive empirical experiments on both synthetic and real data have demonstrated the effectiveness of our approach with preferable results compared to related Bayesian approaches.
{"title":"Constraining acyclicity of differentiable Bayesian structure learning with topological ordering","authors":"Quang-Duy Tran, Phuoc Nguyen, Bao Duong, Thin Nguyen","doi":"10.1007/s10115-024-02140-4","DOIUrl":"https://doi.org/10.1007/s10115-024-02140-4","url":null,"abstract":"<p>Distributional estimates in Bayesian approaches in structure learning have advantages compared to the ones performing point estimates when handling epistemic uncertainty. Differentiable methods for Bayesian structure learning have been developed to enhance the scalability of the inference process and are achieving optimistic outcomes. However, in the differentiable continuous setting, constraining the acyclicity of learned graphs emerges as another challenge. Various works utilize post-hoc penalization scores to impose this constraint which cannot assure acyclicity. The topological ordering of the variables is one type of prior knowledge that contains valuable information about the acyclicity of a directed graph. In this work, we propose a framework to guarantee the acyclicity of inferred graphs by integrating the information from the topological ordering into the inference process. Our integration framework does not interfere with the differentiable inference process while being able to strictly assure the acyclicity of learned graphs and reduce the inference complexity. Our extensive empirical experiments on both synthetic and real data have demonstrated the effectiveness of our approach with preferable results compared to related Bayesian approaches.</p>","PeriodicalId":54749,"journal":{"name":"Knowledge and Information Systems","volume":null,"pages":null},"PeriodicalIF":2.7,"publicationDate":"2024-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141167553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-27DOI: 10.1007/s10115-024-02114-6
Ritika Singh, Vipin Kumar
Multi-view learning consistently outperforms traditional single-view learning by leveraging multiple perspectives of data. However, the effectiveness of multi-view learning heavily relies on how the data are partitioned into feature sets. In many cases, different datasets may require different partitioning methods to capture their unique characteristics, making a single partitioning method insufficient. Finding an optimal feature set partitioning (FSP) for each dataset may be a time-consuming process, and the optimal FSP may still not be sufficient for all types of datasets. Therefore, the paper presents a novel approach called ensemble multi-view feature set partitioning (EMvFSP) to improve the performance of multi-view learning, a technique that uses multiple data sources to make predictions. The proposed EMvFSP method combines the different views produced by multiple partitioning methods to achieve better classification performance than any single partitioning method alone. The experiments were conducted on 15 structured datasets with varying ratios of samples, features, and labels, and the results showed that the proposed EMvFSP method effectively improved classification performance. The paper also includes statistical analyses using Friedman ranking and Holms procedure to demonstrate the effectiveness of the proposed method. This approach provides a robust solution for multi-view learning that can adapt to different types of datasets and partitioning methods, making it suitable for a wide range of applications.
{"title":"Ensemble multi-view feature set partitioning method for effective multi-view learning","authors":"Ritika Singh, Vipin Kumar","doi":"10.1007/s10115-024-02114-6","DOIUrl":"https://doi.org/10.1007/s10115-024-02114-6","url":null,"abstract":"<p>Multi-view learning consistently outperforms traditional single-view learning by leveraging multiple perspectives of data. However, the effectiveness of multi-view learning heavily relies on how the data are partitioned into feature sets. In many cases, different datasets may require different partitioning methods to capture their unique characteristics, making a single partitioning method insufficient. Finding an optimal feature set partitioning (FSP) for each dataset may be a time-consuming process, and the optimal FSP may still not be sufficient for all types of datasets. Therefore, the paper presents a novel approach called ensemble multi-view feature set partitioning (EMvFSP) to improve the performance of multi-view learning, a technique that uses multiple data sources to make predictions. The proposed EMvFSP method combines the different views produced by multiple partitioning methods to achieve better classification performance than any single partitioning method alone. The experiments were conducted on 15 structured datasets with varying ratios of samples, features, and labels, and the results showed that the proposed EMvFSP method effectively improved classification performance. The paper also includes statistical analyses using Friedman ranking and Holms procedure to demonstrate the effectiveness of the proposed method. This approach provides a robust solution for multi-view learning that can adapt to different types of datasets and partitioning methods, making it suitable for a wide range of applications.</p>","PeriodicalId":54749,"journal":{"name":"Knowledge and Information Systems","volume":null,"pages":null},"PeriodicalIF":2.7,"publicationDate":"2024-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141167357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-27DOI: 10.1007/s10115-024-02138-y
Wenhan Liu, Yujia Zhou, Yutao Zhu, Zhicheng Dou
Personalized search plays an important role in satisfying users’ information needs owing to its ability to build user profiles based on users’ search histories. Most of the existing personalized methods built dynamic user profiles by emphasizing query-related historical behaviors rather than treating each historical behavior equally. Sometimes, the ambiguity and short nature of the query make it difficult to understand the potential query intent exactly, and the query-centric user profiles built in these cases will be biased and inaccurate. In this work, we propose to leverage candidate documents, which contain richer information than the short query text, to help understand the query intent more accurately and improve the quality of user profiles afterward. Specifically, we intend to better understand the query intent through candidate documents, so that more relevant user behaviors from history can be selected to build more accurate user profiles. Moreover, by analyzing the differences between candidate documents, we can better control the degree of personalization on the ranking of results. This controlled personalization approach is also expected to further improve the stability of personalized search as blind personalization may harm the ranking results. We conduct extensive experiments on two datasets, and the results show that our model significantly outperforms competitive baselines, which confirms the benefit of utilizing candidate documents for personalized web search.
{"title":"How to personalize and whether to personalize? Candidate documents decide","authors":"Wenhan Liu, Yujia Zhou, Yutao Zhu, Zhicheng Dou","doi":"10.1007/s10115-024-02138-y","DOIUrl":"https://doi.org/10.1007/s10115-024-02138-y","url":null,"abstract":"<p>Personalized search plays an important role in satisfying users’ information needs owing to its ability to build user profiles based on users’ search histories. Most of the existing personalized methods built dynamic user profiles by emphasizing query-related historical behaviors rather than treating each historical behavior equally. Sometimes, the ambiguity and short nature of the query make it difficult to understand the potential query intent exactly, and the query-centric user profiles built in these cases will be biased and inaccurate. In this work, we propose to leverage candidate documents, which contain richer information than the short query text, to help understand the query intent more accurately and improve the quality of user profiles afterward. Specifically, we intend to better understand the query intent through candidate documents, so that more relevant user behaviors from history can be selected to build more accurate user profiles. Moreover, by analyzing the differences between candidate documents, we can better control the degree of personalization on the ranking of results. This controlled personalization approach is also expected to further improve the stability of personalized search as blind personalization may harm the ranking results. We conduct extensive experiments on two datasets, and the results show that our model significantly outperforms competitive baselines, which confirms the benefit of utilizing candidate documents for personalized web search.</p>","PeriodicalId":54749,"journal":{"name":"Knowledge and Information Systems","volume":null,"pages":null},"PeriodicalIF":2.7,"publicationDate":"2024-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141167361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}