首页 > 最新文献

Knowledge and Information Systems最新文献

英文 中文
Automating localized learning for cardinality estimation based on XGBoost 基于 XGBoost 的卡片数量估算本地化自动学习
IF 2.7 4区 计算机科学 Q1 Computer Science Pub Date : 2024-06-01 DOI: 10.1007/s10115-024-02142-2
Jieming Feng, Zhanhuai Li, Qun Chen, Hailong Liu

For cardinality estimation in DBMS, building multiple local models instead of one global model can usually improve estimation accuracy as well as reducing the effort to label large amounts of training data. Unfortunately, the existing approach of localized learning requires users to explicitly specify which query patterns a local model can handle. Making these decisions is very arduous and error-prone for users; to make things worse, it limits the usability of local models. In this paper, we propose a localized learning solution for cardinality estimation based on XGBoost, which can automatically build an optimal combination of local models given a query workload. It consists of two phases: 1) model initialization; 2) model evolution. In the first phase, it clusters training data into a set of coarse-grained query pattern groups based on pattern similarity and constructs a separate local model for each group. In the second phase, it iteratively merges and splits clusters to identify an optimal combination by reconstructing local models. We formulate the problem of identifying the optimal combination of local models as a combinatorial optimization problem and present an efficient heuristic algorithm, named MMS (Models Merging and Splitting), for its solution due to its exponential complexity. Finally, we validate its performance superiority over the existing learning alternatives by extensive experiments on real datasets.

对于数据库管理系统中的卡入度估计,建立多个局部模型而不是一个全局模型通常可以提高估计精度,并减少标注大量训练数据的工作量。遗憾的是,现有的本地化学习方法要求用户明确指定本地模型可以处理哪些查询模式。对用户来说,做出这些决定非常麻烦,而且容易出错;更糟糕的是,这限制了本地模型的可用性。在本文中,我们提出了一种基于 XGBoost 的卡片度估计本地化学习解决方案,它可以在给定查询工作量的情况下自动构建本地模型的最佳组合。它包括两个阶段:1) 模型初始化;2) 模型演化。在第一阶段,它根据模式相似性将训练数据聚类为一组粗粒度查询模式组,并为每组构建一个单独的本地模型。在第二阶段,它通过重建局部模型,迭代合并和拆分群组,以确定最佳组合。我们将确定局部模型最优组合的问题表述为一个组合优化问题,并提出了一种高效的启发式算法,命名为 MMS(模型合并与拆分),用于解决其指数复杂性问题。最后,我们通过在真实数据集上进行大量实验,验证了该算法优于现有学习方法的性能。
{"title":"Automating localized learning for cardinality estimation based on XGBoost","authors":"Jieming Feng, Zhanhuai Li, Qun Chen, Hailong Liu","doi":"10.1007/s10115-024-02142-2","DOIUrl":"https://doi.org/10.1007/s10115-024-02142-2","url":null,"abstract":"<p>For cardinality estimation in DBMS, building multiple local models instead of one global model can usually improve estimation accuracy as well as reducing the effort to label large amounts of training data. Unfortunately, the existing approach of localized learning requires users to explicitly specify which query patterns a local model can handle. Making these decisions is very arduous and error-prone for users; to make things worse, it limits the usability of local models. In this paper, we propose a localized learning solution for cardinality estimation based on XGBoost, which can automatically build an optimal combination of local models given a query workload. It consists of two phases: 1) model initialization; 2) model evolution. In the first phase, it clusters training data into a set of coarse-grained query pattern groups based on pattern similarity and constructs a separate local model for each group. In the second phase, it iteratively merges and splits clusters to identify an optimal combination by reconstructing local models. We formulate the problem of identifying the optimal combination of local models as a combinatorial optimization problem and present an efficient heuristic algorithm, named <b>MMS</b> (<b>M</b>odels <b>M</b>erging and <b>S</b>plitting), for its solution due to its exponential complexity. Finally, we validate its performance superiority over the existing learning alternatives by extensive experiments on real datasets.</p>","PeriodicalId":54749,"journal":{"name":"Knowledge and Information Systems","volume":null,"pages":null},"PeriodicalIF":2.7,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141193279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The analysis of diversification properties of stablecoins through the Shannon entropy measure 通过香农熵度量分析稳定币的多样化特性
IF 2.7 4区 计算机科学 Q1 Computer Science Pub Date : 2024-05-30 DOI: 10.1007/s10115-024-02133-3
Mohavia Ben Amid Sinon, Jules Clement Mba

The common goal for investors is to minimise the risk and maximise the returns on their investments. This is often achieved through diversification, where investors spread their investments across various assets. This study aims to use the MAD-entropy model to minimise the absolute deviation, maximise the mean return, and maximise the Shannon entropy of the portfolio. The MAD model is used because it is a linear programming model, allowing it to resolve large-scale problems and nonnormally distributed data. Entropy is added to the MAD model because it can better diversify the weight of assets in the portfolios. The analysed portfolios consist of cryptocurrencies, stablecoins, and selected world indices such as the SP500 and FTSE obtained from Yahoo Finance. The models found that stablecoins pegged to the US dollar, followed by stablecoins pegged to gold, are better diversifiers for traditional cryptocurrencies and stocks. These results are probably due to their low volatility compared to the other assets. Findings from this study may assist investors since the MAD-Entropy model outperforms the MAD model by providing more significant portfolio mean returns with minimal risk. Therefore, crypto investors can design a well-diversified portfolio using MAD entropy to reduce unsystematic risk. Further research integrating mad entropy with machine learning techniques may improve accuracy and risk management.

投资者的共同目标是最大限度地降低投资风险,最大限度地提高投资收益。这通常是通过分散投资来实现的,即投资者将投资分散到各种资产上。本研究旨在使用 MAD-熵模型,使投资组合的绝对偏差最小化、平均收益最大化和香农熵最大化。之所以使用 MAD 模型,是因为它是一种线性规划模型,可以解决大规模问题和非正态分布数据。在 MAD 模型中加入熵,是因为它可以更好地分散投资组合中的资产权重。分析的投资组合包括加密货币、稳定币以及从雅虎财经获得的部分世界指数,如 SP500 和 FTSE。模型发现,与美元挂钩的稳定币,其次是与黄金挂钩的稳定币,是传统加密货币和股票更好的分散工具。这些结果可能是由于与其他资产相比,稳定币的波动性较低。这项研究的结果可能会对投资者有所帮助,因为 MAD-Entropy 模型优于 MAD 模型,它能以最低的风险提供更显著的投资组合平均回报。因此,加密货币投资者可以利用 MAD 熵设计一个分散的投资组合,以降低非系统性风险。将疯熵与机器学习技术相结合的进一步研究可能会提高准确性和风险管理。
{"title":"The analysis of diversification properties of stablecoins through the Shannon entropy measure","authors":"Mohavia Ben Amid Sinon, Jules Clement Mba","doi":"10.1007/s10115-024-02133-3","DOIUrl":"https://doi.org/10.1007/s10115-024-02133-3","url":null,"abstract":"<p>The common goal for investors is to minimise the risk and maximise the returns on their investments. This is often achieved through diversification, where investors spread their investments across various assets. This study aims to use the MAD-entropy model to minimise the absolute deviation, maximise the mean return, and maximise the Shannon entropy of the portfolio. The MAD model is used because it is a linear programming model, allowing it to resolve large-scale problems and nonnormally distributed data. Entropy is added to the MAD model because it can better diversify the weight of assets in the portfolios. The analysed portfolios consist of cryptocurrencies, stablecoins, and selected world indices such as the SP500 and FTSE obtained from Yahoo Finance. The models found that stablecoins pegged to the US dollar, followed by stablecoins pegged to gold, are better diversifiers for traditional cryptocurrencies and stocks. These results are probably due to their low volatility compared to the other assets. Findings from this study may assist investors since the MAD-Entropy model outperforms the MAD model by providing more significant portfolio mean returns with minimal risk. Therefore, crypto investors can design a well-diversified portfolio using MAD entropy to reduce unsystematic risk. Further research integrating mad entropy with machine learning techniques may improve accuracy and risk management.</p>","PeriodicalId":54749,"journal":{"name":"Knowledge and Information Systems","volume":null,"pages":null},"PeriodicalIF":2.7,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141193162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Methods for concept analysis and multi-relational data mining: a systematic literature review 概念分析和多关系数据挖掘方法:系统文献综述
IF 2.7 4区 计算机科学 Q1 Computer Science Pub Date : 2024-05-30 DOI: 10.1007/s10115-024-02139-x
Nicolás Leutwyler, Mario Lezoche, Chiara Franciosi, Hervé Panetto, Laurent Teste, Diego Torres

The Internet of Things massive adoption in many industrial areas in addition to the requirement of modern services is posing huge challenges to the field of data mining. Moreover, the semantic interoperability of systems and enterprises requires to operate between many different formats such as ontologies, knowledge graphs, or relational databases, as well as different contexts such as static, dynamic, or real time. Consequently, supporting this semantic interoperability requires a wide range of knowledge discovery methods with different capabilities that answer to the context of distributed architectures (DA). However, to the best of our knowledge there is no general review in recent time about the state of the art of Concept Analysis (CA) and multi-relational data mining (MRDM) methods regarding knowledge discovery in DA considering semantic interoperability. In this work, a systematic literature review on CA and MRDM is conducted, providing a discussion on the characteristics they have according to the papers reviewed, supported by a clusterization technique based on association rules. Moreover, the review allowed the identification of three research gaps toward a more scalable set of methods in the context of DA and heterogeneous sources.

除了对现代服务的要求之外,物联网在许多工业领域的大规模应用也给数据挖掘领域带来了巨大挑战。此外,系统和企业的语义互操作性要求在本体、知识图谱或关系数据库等多种不同格式以及静态、动态或实时等不同上下文之间进行操作。因此,支持这种语义互操作性需要多种知识发现方法,这些方法具有不同的功能,可满足分布式架构(DA)的要求。然而,据我们所知,最近还没有关于概念分析(CA)和多关系数据挖掘(MRDM)方法在考虑到语义互操作性的 DA 中的知识发现方面的最新进展的综述。在这项工作中,对 CA 和 MRDM 进行了系统的文献综述,根据所综述的论文讨论了它们的特点,并辅以基于关联规则的聚类技术。此外,通过综述还确定了三个研究缺口,以便在数据分析和异构源的背景下,找到一套更具可扩展性的方法。
{"title":"Methods for concept analysis and multi-relational data mining: a systematic literature review","authors":"Nicolás Leutwyler, Mario Lezoche, Chiara Franciosi, Hervé Panetto, Laurent Teste, Diego Torres","doi":"10.1007/s10115-024-02139-x","DOIUrl":"https://doi.org/10.1007/s10115-024-02139-x","url":null,"abstract":"<p>The Internet of Things massive adoption in many industrial areas in addition to the requirement of modern services is posing huge challenges to the field of data mining. Moreover, the semantic interoperability of systems and enterprises requires to operate between many different formats such as ontologies, knowledge graphs, or relational databases, as well as different contexts such as static, dynamic, or real time. Consequently, supporting this semantic interoperability requires a wide range of knowledge discovery methods with different capabilities that answer to the context of <i>distributed architectures</i> (DA). However, to the best of our knowledge there is no general review in recent time about the state of the art of Concept Analysis (CA) and multi-relational data mining (MRDM) methods regarding knowledge discovery in DA considering semantic interoperability. In this work, a systematic literature review on CA and MRDM is conducted, providing a discussion on the characteristics they have according to the papers reviewed, supported by a clusterization technique based on association rules. Moreover, the review allowed the identification of three research gaps toward a more scalable set of methods in the context of DA and heterogeneous sources.</p>","PeriodicalId":54749,"journal":{"name":"Knowledge and Information Systems","volume":null,"pages":null},"PeriodicalIF":2.7,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141193750","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Twain-GCN: twain-syntax graph convolutional networks for aspect-based sentiment analysis 吐温-GCN:用于基于方面的情感分析的吐温语法图卷积网络
IF 2.7 4区 计算机科学 Q1 Computer Science Pub Date : 2024-05-30 DOI: 10.1007/s10115-024-02135-1
Ying Hou, Fang’ai Liu, Xuqiang Zhuang, Yuling Zhang

The goal of aspect-based sentiment analysis is to recognize the aspect information in the text and the corresponding sentiment polarity. A variety of robust methods, including attention mechanisms and convolutional neural networks, have been extensively utilized to tackle this complex task. Better experimental results are obtained by using graph convolutional networks (GCN) based on semantic dependency trees in previous studies. Therefore, abundant methods begin to use sentence structure information to complete this task. However, only the loose connection between aspect words and contexts is realized in some practices due to sentences may contain complex relations. To solve this problem, Twain-Syntax graph convolutional network model is proposed, which can utilize multiple syntactic structure information simultaneously. Guided by the constituent tree and dependency tree, rich syntactic information is fully used in the model to build the sentiment-aware context for each aspect. In special, the multilayer attention mechanism and GCN are employed for learning to capture the correlation between words. By integrating syntactic information, this approach significantly refines the model’s technical performance. Extensive testing on four benchmark datasets shows that the model delineated in this paper exhibits high levels of efficiency, comparable to several cutting-edge models.

基于方面的情感分析的目标是识别文本中的方面信息和相应的情感极性。为了完成这项复杂的任务,人们广泛采用了各种稳健的方法,包括注意力机制和卷积神经网络。在以往的研究中,基于语义依存树的图卷积网络(GCN)获得了较好的实验结果。因此,大量方法开始使用句子结构信息来完成这项任务。然而,由于句子可能包含复杂的关系,在某些实践中只能实现方面词和上下文之间的松散联系。为解决这一问题,本文提出了吐温-语法图卷积网络模型,该模型可同时利用多种句法结构信息。该模型以成分树和依赖树为指导,充分利用丰富的句法信息,为每个方面构建感知语境。特别是,多层注意机制和 GCN 被用于学习捕捉词与词之间的相关性。通过整合句法信息,这种方法大大提高了模型的技术性能。在四个基准数据集上进行的广泛测试表明,本文所描述的模型具有很高的效率,可与几种最先进的模型相媲美。
{"title":"Twain-GCN: twain-syntax graph convolutional networks for aspect-based sentiment analysis","authors":"Ying Hou, Fang’ai Liu, Xuqiang Zhuang, Yuling Zhang","doi":"10.1007/s10115-024-02135-1","DOIUrl":"https://doi.org/10.1007/s10115-024-02135-1","url":null,"abstract":"<p>The goal of aspect-based sentiment analysis is to recognize the aspect information in the text and the corresponding sentiment polarity. A variety of robust methods, including attention mechanisms and convolutional neural networks, have been extensively utilized to tackle this complex task. Better experimental results are obtained by using graph convolutional networks (GCN) based on semantic dependency trees in previous studies. Therefore, abundant methods begin to use sentence structure information to complete this task. However, only the loose connection between aspect words and contexts is realized in some practices due to sentences may contain complex relations. To solve this problem, Twain-Syntax graph convolutional network model is proposed, which can utilize multiple syntactic structure information simultaneously. Guided by the constituent tree and dependency tree, rich syntactic information is fully used in the model to build the sentiment-aware context for each aspect. In special, the multilayer attention mechanism and GCN are employed for learning to capture the correlation between words. By integrating syntactic information, this approach significantly refines the model’s technical performance. Extensive testing on four benchmark datasets shows that the model delineated in this paper exhibits high levels of efficiency, comparable to several cutting-edge models.</p>","PeriodicalId":54749,"journal":{"name":"Knowledge and Information Systems","volume":null,"pages":null},"PeriodicalIF":2.7,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141193171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PatchMix: patch-level mixup for data augmentation in convolutional neural networks PatchMix:用于卷积神经网络数据扩增的补丁级混搭
IF 2.7 4区 计算机科学 Q1 Computer Science Pub Date : 2024-05-30 DOI: 10.1007/s10115-024-02141-3
Yichao Hong, Yuanyuan Chen

Convolutional neural networks (CNNs) have demonstrated impressive performance in fitting data distribution. However, due to the complexity in learning intricate features from data, networks usually experience overfitting during the training. To address this issue, many data augmentation techniques have been proposed to expand the representation of the training data, thereby improving the generalization ability of CNNs. Inspired by jigsaw puzzles, we propose PatchMix, a novel mixup-based augmentation method that applies mixup to patches within an image to extract abundant and varied information from it. At the input level of CNNs, PatchMix can generate a multitude of reliable training samples through an integrated and controllable approach that encompasses cropping, combining, blurring, and more. Additionally, we propose PatchMix-R to enhance the robustness of the model against perturbations by processing adjacent pixels. Easy to implement, our methods can be integrated with most CNN-based classification models and combined with varying data augmentation techniques. The experiments show that PatchMix and PatchMix-R consistently outperform other state-of-the-art methods in terms of accuracy and robustness. Class activation mappings of the trained model are also investigated to visualize the effectiveness of our approach.

卷积神经网络(CNN)在拟合数据分布方面表现出色。然而,由于从数据中学习复杂特征的复杂性,网络在训练过程中通常会出现过拟合。为了解决这个问题,人们提出了许多数据增强技术来扩展训练数据的表示,从而提高 CNN 的泛化能力。受拼图游戏的启发,我们提出了 PatchMix,这是一种基于混合的新型增强方法,它对图像中的斑块进行混合,以从中提取丰富多样的信息。在 CNN 的输入层,PatchMix 可以通过包含裁剪、组合、模糊等在内的综合可控方法生成大量可靠的训练样本。此外,我们还提出了 PatchMix-R,通过处理相邻像素来增强模型对扰动的鲁棒性。我们的方法易于实现,可与大多数基于 CNN 的分类模型集成,并与各种数据增强技术相结合。实验表明,PatchMix 和 PatchMix-R 在准确性和鲁棒性方面始终优于其他最先进的方法。我们还对训练模型的类激活映射进行了研究,以直观展示我们方法的有效性。
{"title":"PatchMix: patch-level mixup for data augmentation in convolutional neural networks","authors":"Yichao Hong, Yuanyuan Chen","doi":"10.1007/s10115-024-02141-3","DOIUrl":"https://doi.org/10.1007/s10115-024-02141-3","url":null,"abstract":"<p>Convolutional neural networks (CNNs) have demonstrated impressive performance in fitting data distribution. However, due to the complexity in learning intricate features from data, networks usually experience overfitting during the training. To address this issue, many data augmentation techniques have been proposed to expand the representation of the training data, thereby improving the generalization ability of CNNs. Inspired by jigsaw puzzles, we propose PatchMix, a novel mixup-based augmentation method that applies mixup to patches within an image to extract abundant and varied information from it. At the input level of CNNs, PatchMix can generate a multitude of reliable training samples through an integrated and controllable approach that encompasses cropping, combining, blurring, and more. Additionally, we propose PatchMix-R to enhance the robustness of the model against perturbations by processing adjacent pixels. Easy to implement, our methods can be integrated with most CNN-based classification models and combined with varying data augmentation techniques. The experiments show that PatchMix and PatchMix-R consistently outperform other state-of-the-art methods in terms of accuracy and robustness. Class activation mappings of the trained model are also investigated to visualize the effectiveness of our approach.\u0000</p>","PeriodicalId":54749,"journal":{"name":"Knowledge and Information Systems","volume":null,"pages":null},"PeriodicalIF":2.7,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141193289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Large-scale knowledge graph representation learning 大规模知识图谱表示学习
IF 2.7 4区 计算机科学 Q1 Computer Science Pub Date : 2024-05-29 DOI: 10.1007/s10115-024-02131-5
Marwa Badrouni, Chaker Katar, Wissem Inoubli

The knowledge graph emerges as powerful data structures that provide a deep representation and understanding of the knowledge presented in networks. In the pursuit of representation learning of the knowledge graph, entities and relationships undergo an embedding process, where they are mapped onto a vector space with reduced dimensions. These embeddings are progressively used to extract their information for a multitude of tasks in machine learning. Nevertheless, the increase data in knowledge graph has introduced a challenge, especially as knowledge graph embedding now encompass millions of nodes and billions of edges, surpassing the capacities of existing knowledge representation learning systems. In response to these challenge, this paper presents DistKGE, a distributed learning approach of knowledge graph embedding based on a new partitioning technique. In our experimental evaluation, we illustrate that the proposed approach improves the scalability of distributed knowledge graph learning with respect to graph size compared to existing methods in terms of runtime performances in the link prediction task aimed at identifying new links between entities within the knowledge graph.

知识图谱是一种功能强大的数据结构,可以深入表示和理解网络中呈现的知识。在对知识图谱进行表征学习时,实体和关系会经历一个嵌入过程,在这个过程中,它们会被映射到一个维度较小的向量空间中。在机器学习的众多任务中,这些嵌入逐渐被用来提取它们的信息。然而,知识图谱数据的增加带来了挑战,尤其是知识图谱嵌入现在包含了数百万个节点和数十亿条边,超出了现有知识表示学习系统的能力。为了应对这些挑战,本文提出了基于新分区技术的知识图谱嵌入分布式学习方法 DistKGE。在实验评估中,我们发现,与现有方法相比,在旨在识别知识图谱中实体间新链接的链接预测任务中,所提出的方法在图谱大小方面提高了分布式知识图谱学习的可扩展性。
{"title":"Large-scale knowledge graph representation learning","authors":"Marwa Badrouni, Chaker Katar, Wissem Inoubli","doi":"10.1007/s10115-024-02131-5","DOIUrl":"https://doi.org/10.1007/s10115-024-02131-5","url":null,"abstract":"<p>The knowledge graph emerges as powerful data structures that provide a deep representation and understanding of the knowledge presented in networks. In the pursuit of representation learning of the knowledge graph, entities and relationships undergo an embedding process, where they are mapped onto a vector space with reduced dimensions. These embeddings are progressively used to extract their information for a multitude of tasks in machine learning. Nevertheless, the increase data in knowledge graph has introduced a challenge, especially as knowledge graph embedding now encompass millions of nodes and billions of edges, surpassing the capacities of existing knowledge representation learning systems. In response to these challenge, this paper presents DistKGE, a distributed learning approach of knowledge graph embedding based on a new partitioning technique. In our experimental evaluation, we illustrate that the proposed approach improves the scalability of distributed knowledge graph learning with respect to graph size compared to existing methods in terms of runtime performances in the link prediction task aimed at identifying new links between entities within the knowledge graph.\u0000</p>","PeriodicalId":54749,"journal":{"name":"Knowledge and Information Systems","volume":null,"pages":null},"PeriodicalIF":2.7,"publicationDate":"2024-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141193176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Markov enhanced graph attention network for spammer detection in online social network 用于在线社交网络垃圾邮件发送者检测的马尔可夫增强图注意网络
IF 2.7 4区 计算机科学 Q1 Computer Science Pub Date : 2024-05-29 DOI: 10.1007/s10115-024-02137-z
Ashutosh Tripathi, Mohona Ghosh, Kusum Kumari Bharti

Online social networks (OSNs) are an indispensable part of social communication where people connect and share information. Spammers and other malicious actors use the OSN’s power to propagate spam content. In an OSN with mutual relations between nodes, two kinds of spammer detection methods can be employed: feature based and propagation based. However, both of these are incomplete in themselves. The feature-based methods cannot exploit mutual connections between nodes, and propagation-based methods cannot utilize the rich discriminating node features. We propose a hybrid model—Markov enhanced graph attention network (MEGAT)—using graph attention networks (GAT) and pairwise Markov random fields (pMRF) for the spammer detection task. It efficiently utilizes node features as well as propagation information. We experiment our GAT model with a smoother Swish activation function having non-monotonic derivatives, instead of the leakyReLU function. The experiments performed on a real-world Twitter Social Honeypot (TwitterSH) benchmark dataset and subsequent comparative analysis reveal that our proposed MEGAT model outperforms the state-of-the-art models in accuracy, precision–recall area under curve (PRAUC), and F1-score performance measures.

在线社交网络(OSN)是社会交流中不可或缺的一部分,人们在这里建立联系并分享信息。垃圾邮件发送者和其他恶意行为者利用 OSN 的力量传播垃圾邮件内容。在节点之间存在相互关系的 OSN 中,可以采用两种垃圾邮件发送者检测方法:基于特征的方法和基于传播的方法。然而,这两种方法本身都是不完整的。基于特征的方法无法利用节点之间的相互联系,而基于传播的方法则无法利用丰富的节点判别特征。我们提出了一种混合模型--马尔可夫增强图注意力网络(MEGAT)--利用图注意力网络(GAT)和成对马尔可夫随机场(pMRF)来完成垃圾邮件检测任务。它有效地利用了节点特征和传播信息。我们使用具有非单调导数的更平滑 Swish 激活函数,而不是 leakyReLU 函数来实验我们的 GAT 模型。在真实世界的 Twitter 社交蜜罐(TwitterSH)基准数据集上进行的实验和随后的比较分析表明,我们提出的 MEGAT 模型在准确率、精确度-召回曲线下面积(PRAUC)和 F1 分数等性能指标上都优于最先进的模型。
{"title":"Markov enhanced graph attention network for spammer detection in online social network","authors":"Ashutosh Tripathi, Mohona Ghosh, Kusum Kumari Bharti","doi":"10.1007/s10115-024-02137-z","DOIUrl":"https://doi.org/10.1007/s10115-024-02137-z","url":null,"abstract":"<p>Online social networks (OSNs) are an indispensable part of social communication where people connect and share information. Spammers and other malicious actors use the OSN’s power to propagate spam content. In an OSN with mutual relations between nodes, two kinds of spammer detection methods can be employed: feature based and propagation based. However, both of these are incomplete in themselves. The feature-based methods cannot exploit mutual connections between nodes, and propagation-based methods cannot utilize the rich discriminating node features. We propose a hybrid model—Markov enhanced graph attention network (MEGAT)—using graph attention networks (GAT) and pairwise Markov random fields (pMRF) for the spammer detection task. It efficiently utilizes node features as well as propagation information. We experiment our GAT model with a smoother <i>Swish</i> activation function having non-monotonic derivatives, instead of the <i>leakyReLU</i> function. The experiments performed on a real-world Twitter Social Honeypot (TwitterSH) benchmark dataset and subsequent comparative analysis reveal that our proposed MEGAT model outperforms the state-of-the-art models in accuracy, precision–recall area under curve (PRAUC), and F1-score performance measures.\u0000</p>","PeriodicalId":54749,"journal":{"name":"Knowledge and Information Systems","volume":null,"pages":null},"PeriodicalIF":2.7,"publicationDate":"2024-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141167350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Constraining acyclicity of differentiable Bayesian structure learning with topological ordering 用拓扑排序约束可微贝叶斯结构学习的非循环性
IF 2.7 4区 计算机科学 Q1 Computer Science Pub Date : 2024-05-29 DOI: 10.1007/s10115-024-02140-4
Quang-Duy Tran, Phuoc Nguyen, Bao Duong, Thin Nguyen

Distributional estimates in Bayesian approaches in structure learning have advantages compared to the ones performing point estimates when handling epistemic uncertainty. Differentiable methods for Bayesian structure learning have been developed to enhance the scalability of the inference process and are achieving optimistic outcomes. However, in the differentiable continuous setting, constraining the acyclicity of learned graphs emerges as another challenge. Various works utilize post-hoc penalization scores to impose this constraint which cannot assure acyclicity. The topological ordering of the variables is one type of prior knowledge that contains valuable information about the acyclicity of a directed graph. In this work, we propose a framework to guarantee the acyclicity of inferred graphs by integrating the information from the topological ordering into the inference process. Our integration framework does not interfere with the differentiable inference process while being able to strictly assure the acyclicity of learned graphs and reduce the inference complexity. Our extensive empirical experiments on both synthetic and real data have demonstrated the effectiveness of our approach with preferable results compared to related Bayesian approaches.

在处理认识不确定性时,贝叶斯结构学习方法中的分布估计与进行点估计的方法相比具有优势。为了提高推理过程的可扩展性,人们开发了贝叶斯结构学习的可微分方法,并取得了令人乐观的成果。然而,在可微分连续环境中,约束学习图的非循环性成为另一个挑战。各种研究利用事后惩罚分数来施加这一约束,但无法确保非循环性。变量的拓扑排序是一种先验知识,它包含了有向图非周期性的宝贵信息。在这项工作中,我们提出了一个框架,通过将拓扑排序的信息整合到推理过程中来保证推理图的非循环性。我们的集成框架不会干扰可微分推理过程,同时能够严格保证所学图的非循环性并降低推理复杂度。我们在合成数据和真实数据上进行的大量实证实验证明了我们方法的有效性,其结果优于相关的贝叶斯方法。
{"title":"Constraining acyclicity of differentiable Bayesian structure learning with topological ordering","authors":"Quang-Duy Tran, Phuoc Nguyen, Bao Duong, Thin Nguyen","doi":"10.1007/s10115-024-02140-4","DOIUrl":"https://doi.org/10.1007/s10115-024-02140-4","url":null,"abstract":"<p>Distributional estimates in Bayesian approaches in structure learning have advantages compared to the ones performing point estimates when handling epistemic uncertainty. Differentiable methods for Bayesian structure learning have been developed to enhance the scalability of the inference process and are achieving optimistic outcomes. However, in the differentiable continuous setting, constraining the acyclicity of learned graphs emerges as another challenge. Various works utilize post-hoc penalization scores to impose this constraint which cannot assure acyclicity. The topological ordering of the variables is one type of prior knowledge that contains valuable information about the acyclicity of a directed graph. In this work, we propose a framework to guarantee the acyclicity of inferred graphs by integrating the information from the topological ordering into the inference process. Our integration framework does not interfere with the differentiable inference process while being able to strictly assure the acyclicity of learned graphs and reduce the inference complexity. Our extensive empirical experiments on both synthetic and real data have demonstrated the effectiveness of our approach with preferable results compared to related Bayesian approaches.</p>","PeriodicalId":54749,"journal":{"name":"Knowledge and Information Systems","volume":null,"pages":null},"PeriodicalIF":2.7,"publicationDate":"2024-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141167553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Ensemble multi-view feature set partitioning method for effective multi-view learning 用于有效多视角学习的集合多视角特征集划分方法
IF 2.7 4区 计算机科学 Q1 Computer Science Pub Date : 2024-05-27 DOI: 10.1007/s10115-024-02114-6
Ritika Singh, Vipin Kumar

Multi-view learning consistently outperforms traditional single-view learning by leveraging multiple perspectives of data. However, the effectiveness of multi-view learning heavily relies on how the data are partitioned into feature sets. In many cases, different datasets may require different partitioning methods to capture their unique characteristics, making a single partitioning method insufficient. Finding an optimal feature set partitioning (FSP) for each dataset may be a time-consuming process, and the optimal FSP may still not be sufficient for all types of datasets. Therefore, the paper presents a novel approach called ensemble multi-view feature set partitioning (EMvFSP) to improve the performance of multi-view learning, a technique that uses multiple data sources to make predictions. The proposed EMvFSP method combines the different views produced by multiple partitioning methods to achieve better classification performance than any single partitioning method alone. The experiments were conducted on 15 structured datasets with varying ratios of samples, features, and labels, and the results showed that the proposed EMvFSP method effectively improved classification performance. The paper also includes statistical analyses using Friedman ranking and Holms procedure to demonstrate the effectiveness of the proposed method. This approach provides a robust solution for multi-view learning that can adapt to different types of datasets and partitioning methods, making it suitable for a wide range of applications.

多视角学习通过利用数据的多个视角,始终优于传统的单视角学习。然而,多视角学习的有效性在很大程度上取决于如何将数据划分为特征集。在很多情况下,不同的数据集可能需要不同的分割方法来捕捉其独特的特征,因此单一的分割方法是不够的。为每个数据集寻找最佳特征集分割(FSP)可能是一个耗时的过程,而且最佳的 FSP 可能仍然无法满足所有类型数据集的需要。因此,本文提出了一种称为集合多视图特征集分割(EMvFSP)的新方法,以提高多视图学习(一种使用多个数据源进行预测的技术)的性能。所提出的 EMvFSP 方法将多种分区方法产生的不同视图结合在一起,比任何一种单独的分区方法都能获得更好的分类性能。实验在 15 个样本、特征和标签比例各不相同的结构化数据集上进行,结果表明所提出的 EMvFSP 方法有效地提高了分类性能。论文还利用弗里德曼排序和霍姆斯程序进行了统计分析,以证明所提方法的有效性。这种方法为多视图学习提供了一种稳健的解决方案,可以适应不同类型的数据集和分区方法,因此适用于广泛的应用领域。
{"title":"Ensemble multi-view feature set partitioning method for effective multi-view learning","authors":"Ritika Singh, Vipin Kumar","doi":"10.1007/s10115-024-02114-6","DOIUrl":"https://doi.org/10.1007/s10115-024-02114-6","url":null,"abstract":"<p>Multi-view learning consistently outperforms traditional single-view learning by leveraging multiple perspectives of data. However, the effectiveness of multi-view learning heavily relies on how the data are partitioned into feature sets. In many cases, different datasets may require different partitioning methods to capture their unique characteristics, making a single partitioning method insufficient. Finding an optimal feature set partitioning (FSP) for each dataset may be a time-consuming process, and the optimal FSP may still not be sufficient for all types of datasets. Therefore, the paper presents a novel approach called ensemble multi-view feature set partitioning (EMvFSP) to improve the performance of multi-view learning, a technique that uses multiple data sources to make predictions. The proposed EMvFSP method combines the different views produced by multiple partitioning methods to achieve better classification performance than any single partitioning method alone. The experiments were conducted on 15 structured datasets with varying ratios of samples, features, and labels, and the results showed that the proposed EMvFSP method effectively improved classification performance. The paper also includes statistical analyses using Friedman ranking and Holms procedure to demonstrate the effectiveness of the proposed method. This approach provides a robust solution for multi-view learning that can adapt to different types of datasets and partitioning methods, making it suitable for a wide range of applications.</p>","PeriodicalId":54749,"journal":{"name":"Knowledge and Information Systems","volume":null,"pages":null},"PeriodicalIF":2.7,"publicationDate":"2024-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141167357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
How to personalize and whether to personalize? Candidate documents decide 如何个性化以及是否个性化?候选文件决定
IF 2.7 4区 计算机科学 Q1 Computer Science Pub Date : 2024-05-27 DOI: 10.1007/s10115-024-02138-y
Wenhan Liu, Yujia Zhou, Yutao Zhu, Zhicheng Dou

Personalized search plays an important role in satisfying users’ information needs owing to its ability to build user profiles based on users’ search histories. Most of the existing personalized methods built dynamic user profiles by emphasizing query-related historical behaviors rather than treating each historical behavior equally. Sometimes, the ambiguity and short nature of the query make it difficult to understand the potential query intent exactly, and the query-centric user profiles built in these cases will be biased and inaccurate. In this work, we propose to leverage candidate documents, which contain richer information than the short query text, to help understand the query intent more accurately and improve the quality of user profiles afterward. Specifically, we intend to better understand the query intent through candidate documents, so that more relevant user behaviors from history can be selected to build more accurate user profiles. Moreover, by analyzing the differences between candidate documents, we can better control the degree of personalization on the ranking of results. This controlled personalization approach is also expected to further improve the stability of personalized search as blind personalization may harm the ranking results. We conduct extensive experiments on two datasets, and the results show that our model significantly outperforms competitive baselines, which confirms the benefit of utilizing candidate documents for personalized web search.

个性化搜索能根据用户的搜索历史记录建立用户档案,因此在满足用户的信息需求方面发挥着重要作用。现有的大多数个性化方法都是通过强调与查询相关的历史行为来建立动态用户档案,而不是对每种历史行为一视同仁。有时,由于查询的模糊性和简短性,很难准确理解潜在的查询意图,在这种情况下建立的以查询为中心的用户档案会有偏差和不准确。在这项工作中,我们建议利用候选文档(与简短的查询文本相比,候选文档包含更丰富的信息)来帮助更准确地理解查询意图,并在之后提高用户配置文件的质量。具体来说,我们打算通过候选文档更好地理解查询意图,从而从历史记录中选择更相关的用户行为,建立更准确的用户档案。此外,通过分析候选文档之间的差异,我们可以更好地控制结果排序的个性化程度。这种可控的个性化方法还有望进一步提高个性化搜索的稳定性,因为盲目的个性化可能会损害排名结果。我们在两个数据集上进行了广泛的实验,结果表明我们的模型明显优于竞争基线,这证实了利用候选文档进行个性化网络搜索的好处。
{"title":"How to personalize and whether to personalize? Candidate documents decide","authors":"Wenhan Liu, Yujia Zhou, Yutao Zhu, Zhicheng Dou","doi":"10.1007/s10115-024-02138-y","DOIUrl":"https://doi.org/10.1007/s10115-024-02138-y","url":null,"abstract":"<p>Personalized search plays an important role in satisfying users’ information needs owing to its ability to build user profiles based on users’ search histories. Most of the existing personalized methods built dynamic user profiles by emphasizing query-related historical behaviors rather than treating each historical behavior equally. Sometimes, the ambiguity and short nature of the query make it difficult to understand the potential query intent exactly, and the query-centric user profiles built in these cases will be biased and inaccurate. In this work, we propose to leverage candidate documents, which contain richer information than the short query text, to help understand the query intent more accurately and improve the quality of user profiles afterward. Specifically, we intend to better understand the query intent through candidate documents, so that more relevant user behaviors from history can be selected to build more accurate user profiles. Moreover, by analyzing the differences between candidate documents, we can better control the degree of personalization on the ranking of results. This controlled personalization approach is also expected to further improve the stability of personalized search as blind personalization may harm the ranking results. We conduct extensive experiments on two datasets, and the results show that our model significantly outperforms competitive baselines, which confirms the benefit of utilizing candidate documents for personalized web search.</p>","PeriodicalId":54749,"journal":{"name":"Knowledge and Information Systems","volume":null,"pages":null},"PeriodicalIF":2.7,"publicationDate":"2024-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141167361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Knowledge and Information Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1