首页 > 最新文献

Frontiers of Computer Science最新文献

英文 中文
Communication-robust multi-agent learning by adaptable auxiliary multi-agent adversary generation 通过适应性辅助多代理对手生成实现通信稳健的多代理学习
IF 4.2 3区 计算机科学 Q1 Mathematics Pub Date : 2023-12-16 DOI: 10.1007/s11704-023-2733-5
Lei Yuan, Feng Chen, Zongzhang Zhang, Yang Yu

Communication can promote coordination in cooperative Multi-Agent Reinforcement Learning (MARL). Nowadays, existing works mainly focus on improving the communication efficiency of agents, neglecting that real-world communication is much more challenging as there may exist noise or potential attackers. Thus the robustness of the communication-based policies becomes an emergent and severe issue that needs more exploration. In this paper, we posit that the ego system1) trained with auxiliary adversaries may handle this limitation and propose an adaptable method of Multi-Agent Auxiliary Adversaries Generation for robust Communication, dubbed MA3C, to obtain a robust communication-based policy. In specific, we introduce a novel message-attacking approach that models the learning of the auxiliary attacker as a cooperative problem under a shared goal to minimize the coordination ability of the ego system, with which every information channel may suffer from distinct message attacks. Furthermore, as naive adversarial training may impede the generalization ability of the ego system, we design an attacker population generation approach based on evolutionary learning. Finally, the ego system is paired with an attacker population and then alternatively trained against the continuously evolving attackers to improve its robustness, meaning that both the ego system and the attackers are adaptable. Extensive experiments on multiple benchmarks indicate that our proposed MA3C provides comparable or better robustness and generalization ability than other baselines.

在合作式多代理强化学习(MARL)中,通信可以促进协调。目前,现有的研究主要集中在提高代理的通信效率上,而忽略了现实世界中的通信因可能存在噪音或潜在攻击者而更具挑战性。因此,基于通信的策略的鲁棒性成为一个新出现的严峻问题,需要更多的探索。在本文中,我们认为使用辅助对手训练的自我系统1) 可以解决这一局限性,并提出了一种用于鲁棒通信的多代理辅助对手生成的适应性方法(被称为 MA3C),以获得基于通信的鲁棒策略。具体来说,我们引入了一种新颖的信息攻击方法,将辅助攻击者的学习建模为一个共同目标下的合作问题,即最小化自我系统的协调能力,在此目标下,每个信息通道都可能遭受不同的信息攻击。此外,由于天真的对抗训练可能会阻碍自我系统的泛化能力,我们设计了一种基于进化学习的攻击者群体生成方法。最后,自我系统与攻击者群体配对,然后针对不断进化的攻击者进行交替训练,以提高其鲁棒性,这意味着自我系统和攻击者都具有适应性。在多个基准上进行的广泛实验表明,我们提出的 MA3C 具有与其他基准相当甚至更好的鲁棒性和泛化能力。
{"title":"Communication-robust multi-agent learning by adaptable auxiliary multi-agent adversary generation","authors":"Lei Yuan, Feng Chen, Zongzhang Zhang, Yang Yu","doi":"10.1007/s11704-023-2733-5","DOIUrl":"https://doi.org/10.1007/s11704-023-2733-5","url":null,"abstract":"<p>Communication can promote coordination in cooperative Multi-Agent Reinforcement Learning (MARL). Nowadays, existing works mainly focus on improving the communication efficiency of agents, neglecting that real-world communication is much more challenging as there may exist noise or potential attackers. Thus the robustness of the communication-based policies becomes an emergent and severe issue that needs more exploration. In this paper, we posit that the ego system<sup>1)</sup> trained with auxiliary adversaries may handle this limitation and propose an adaptable method of <b>M</b>ulti<b>-A</b>gent <b>A</b>uxiliary <b>A</b>dversaries Generation for robust <b>C</b>ommunication, dubbed MA3C, to obtain a robust communication-based policy. In specific, we introduce a novel message-attacking approach that models the learning of the auxiliary attacker as a cooperative problem under a shared goal to minimize the coordination ability of the ego system, with which every information channel may suffer from distinct message attacks. Furthermore, as naive adversarial training may impede the generalization ability of the ego system, we design an attacker population generation approach based on evolutionary learning. Finally, the ego system is paired with an attacker population and then alternatively trained against the continuously evolving attackers to improve its robustness, meaning that both the ego system and the attackers are adaptable. Extensive experiments on multiple benchmarks indicate that our proposed MA3C provides comparable or better robustness and generalization ability than other baselines.</p>","PeriodicalId":12640,"journal":{"name":"Frontiers of Computer Science","volume":null,"pages":null},"PeriodicalIF":4.2,"publicationDate":"2023-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138681760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A survey on dynamic graph processing on GPUs: concepts, terminologies and systems GPU 动态图形处理概览:概念、术语和系统
IF 4.2 3区 计算机科学 Q1 Mathematics Pub Date : 2023-12-16 DOI: 10.1007/s11704-023-2656-1
Hongru Gao, Xiaofei Liao, Zhiyuan Shao, Kexin Li, Jiajie Chen, Hai Jin

Graphs that are used to model real-world entities with vertices and relationships among entities with edges, have proven to be a powerful tool for describing real-world problems in applications. In most real-world scenarios, entities and their relationships are subject to constant changes. Graphs that record such changes are called dynamic graphs. In recent years, the widespread application scenarios of dynamic graphs have stimulated extensive research on dynamic graph processing systems that continuously ingest graph updates and produce up-to-date graph analytics results. As the scale of dynamic graphs becomes larger, higher performance requirements are demanded to dynamic graph processing systems. With the massive parallel processing power and high memory bandwidth, GPUs become mainstream vehicles to accelerate dynamic graph processing tasks. GPU-based dynamic graph processing systems mainly address two challenges: maintaining the graph data when updates occur (i.e., graph updating) and producing analytics results in time (i.e., graph computing). In this paper, we survey GPU-based dynamic graph processing systems and review their methods on addressing both graph updating and graph computing. To comprehensively discuss existing dynamic graph processing systems on GPUs, we first introduce the terminologies of dynamic graph processing and then develop a taxonomy to describe the methods employed for graph updating and graph computing. In addition, we discuss the challenges and future research directions of dynamic graph processing on GPUs.

事实证明,用顶点来模拟现实世界中的实体,用边来模拟实体之间的关系的图形,是描述应用中现实世界问题的有力工具。在现实世界的大多数场景中,实体及其关系会不断发生变化。记录这种变化的图被称为动态图。近年来,动态图的广泛应用场景激发了人们对动态图处理系统的广泛研究,这些系统可以持续摄取图更新并生成最新的图分析结果。随着动态图的规模越来越大,对动态图处理系统的性能也提出了更高的要求。GPU 具有强大的并行处理能力和高内存带宽,已成为加速动态图处理任务的主流工具。基于 GPU 的动态图处理系统主要解决两个难题:在更新发生时维护图数据(即图更新)和及时生成分析结果(即图计算)。在本文中,我们对基于 GPU 的动态图处理系统进行了调查,并回顾了它们解决图更新和图计算的方法。为了全面讨论 GPU 上现有的动态图处理系统,我们首先介绍了动态图处理的术语,然后开发了一种分类法来描述图更新和图计算所采用的方法。此外,我们还讨论了 GPU 上动态图处理所面临的挑战和未来的研究方向。
{"title":"A survey on dynamic graph processing on GPUs: concepts, terminologies and systems","authors":"Hongru Gao, Xiaofei Liao, Zhiyuan Shao, Kexin Li, Jiajie Chen, Hai Jin","doi":"10.1007/s11704-023-2656-1","DOIUrl":"https://doi.org/10.1007/s11704-023-2656-1","url":null,"abstract":"<p>Graphs that are used to model real-world entities with vertices and relationships among entities with edges, have proven to be a powerful tool for describing real-world problems in applications. In most real-world scenarios, entities and their relationships are subject to constant changes. Graphs that record such changes are called dynamic graphs. In recent years, the widespread application scenarios of dynamic graphs have stimulated extensive research on dynamic graph processing systems that continuously ingest graph updates and produce up-to-date graph analytics results. As the scale of dynamic graphs becomes larger, higher performance requirements are demanded to dynamic graph processing systems. With the massive parallel processing power and high memory bandwidth, GPUs become mainstream vehicles to accelerate dynamic graph processing tasks. GPU-based dynamic graph processing systems mainly address two challenges: maintaining the graph data when updates occur (i.e., graph updating) and producing analytics results in time (i.e., graph computing). In this paper, we survey GPU-based dynamic graph processing systems and review their methods on addressing both graph updating and graph computing. To comprehensively discuss existing dynamic graph processing systems on GPUs, we first introduce the terminologies of dynamic graph processing and then develop a taxonomy to describe the methods employed for graph updating and graph computing. In addition, we discuss the challenges and future research directions of dynamic graph processing on GPUs.</p>","PeriodicalId":12640,"journal":{"name":"Frontiers of Computer Science","volume":null,"pages":null},"PeriodicalIF":4.2,"publicationDate":"2023-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138682019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A MLP-Mixer and mixture of expert model for remaining useful life prediction of lithium-ion batteries 用于预测锂离子电池剩余使用寿命的 MLP-Mixer 和专家混合模型
IF 4.2 3区 计算机科学 Q1 Mathematics Pub Date : 2023-12-16 DOI: 10.1007/s11704-023-3277-4

Abstract

Accurately predicting the Remaining Useful Life (RUL) of lithium-ion batteries is crucial for battery management systems. Deep learning-based methods have been shown to be effective in predicting RUL by leveraging battery capacity time series data. However, the representation learning of features such as long-distance sequence dependencies and mutations in capacity time series still needs to be improved. To address this challenge, this paper proposes a novel deep learning model, the MLP-Mixer and Mixture of Expert (MMMe) model, for RUL prediction. The MMMe model leverages the Gated Recurrent Unit and Multi-Head Attention mechanism to encode the sequential data of battery capacity to capture the temporal features and a re-zero MLP-Mixer model to capture the high-level features. Additionally, we devise an ensemble predictor based on a Mixture-of-Experts (MoE) architecture to generate reliable RUL predictions. The experimental results on public datasets demonstrate that our proposed model significantly outperforms other existing methods, providing more reliable and precise RUL predictions while also accurately tracking the capacity degradation process. Our code and dataset are available at the website of github.

摘要 准确预测锂离子电池的剩余使用寿命(RUL)对电池管理系统至关重要。通过利用电池容量时间序列数据,基于深度学习的方法已被证明能有效预测 RUL。然而,容量时间序列中的长距离序列依赖性和突变等特征的表示学习仍有待改进。为应对这一挑战,本文提出了一种新型深度学习模型--MLP-Mixer 和专家混合(MMMe)模型,用于 RUL 预测。MMMe 模型利用门控递归单元(Gated Recurrent Unit)和多头注意(Multi-Head Attention)机制对电池容量的序列数据进行编码,以捕捉时间特征,并利用重零 MLP-Mixer 模型捕捉高级特征。此外,我们还设计了一种基于专家混合(MoE)架构的集合预测器,以生成可靠的 RUL 预测。在公共数据集上的实验结果表明,我们提出的模型明显优于其他现有方法,能提供更可靠、更精确的 RUL 预测,同时还能准确跟踪容量衰减过程。我们的代码和数据集可在 github 网站上获取。
{"title":"A MLP-Mixer and mixture of expert model for remaining useful life prediction of lithium-ion batteries","authors":"","doi":"10.1007/s11704-023-3277-4","DOIUrl":"https://doi.org/10.1007/s11704-023-3277-4","url":null,"abstract":"<h3>Abstract</h3> <p>Accurately predicting the Remaining Useful Life (RUL) of lithium-ion batteries is crucial for battery management systems. Deep learning-based methods have been shown to be effective in predicting RUL by leveraging battery capacity time series data. However, the representation learning of features such as long-distance sequence dependencies and mutations in capacity time series still needs to be improved. To address this challenge, this paper proposes a novel deep learning model, the MLP-Mixer and Mixture of Expert (MMMe) model, for RUL prediction. The MMMe model leverages the Gated Recurrent Unit and Multi-Head Attention mechanism to encode the sequential data of battery capacity to capture the temporal features and a re-zero MLP-Mixer model to capture the high-level features. Additionally, we devise an ensemble predictor based on a Mixture-of-Experts (MoE) architecture to generate reliable RUL predictions. The experimental results on public datasets demonstrate that our proposed model significantly outperforms other existing methods, providing more reliable and precise RUL predictions while also accurately tracking the capacity degradation process. Our code and dataset are available at the website of github.</p>","PeriodicalId":12640,"journal":{"name":"Frontiers of Computer Science","volume":null,"pages":null},"PeriodicalIF":4.2,"publicationDate":"2023-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138681697","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Representation learning: serial-autoencoder for personalized recommendation 表征学习:用于个性化推荐的序列自动编码器
IF 4.2 3区 计算机科学 Q1 Mathematics Pub Date : 2023-12-16 DOI: 10.1007/s11704-023-2441-1
Yi Zhu, Yishuai Geng, Yun Li, Jipeng Qiang, Xindong Wu

Nowadays, the personalized recommendation has become a research hotspot for addressing information overload. Despite this, generating effective recommendations from sparse data remains a challenge. Recently, auxiliary information has been widely used to address data sparsity, but most models using auxiliary information are linear and have limited expressiveness. Due to the advantages of feature extraction and no-label requirements, autoencoder-based methods have become quite popular. However, most existing autoencoder-based methods discard the reconstruction of auxiliary information, which poses huge challenges for better representation learning and model scalability. To address these problems, we propose Serial-Autoencoder for Personalized Recommendation (SAPR), which aims to reduce the loss of critical information and enhance the learning of feature representations. Specifically, we first combine the original rating matrix and item attribute features and feed them into the first autoencoder for generating a higher-level representation of the input. Second, we use a second autoencoder to enhance the reconstruction of the data representation of the prediciton rating matrix. The output rating information is used for recommendation prediction. Extensive experiments on the MovieTweetings and MovieLens datasets have verified the effectiveness of SAPR compared to state-of-the-art models.

如今,个性化推荐已成为解决信息过载问题的研究热点。尽管如此,从稀疏数据中生成有效的推荐仍然是一个挑战。最近,辅助信息被广泛用于解决数据稀疏问题,但大多数使用辅助信息的模型都是线性的,表达能力有限。由于具有特征提取和无标签要求的优势,基于自动编码器的方法已变得相当流行。然而,大多数现有的基于自动编码器的方法都放弃了对辅助信息的重建,这对更好的表征学习和模型的可扩展性提出了巨大挑战。为了解决这些问题,我们提出了用于个性化推荐的串行自动编码器(SAPR),旨在减少关键信息的丢失,增强特征表征的学习。具体来说,我们首先将原始评分矩阵和项目属性特征结合起来,并将其输入第一个自动编码器,以生成输入的高级表示。其次,我们使用第二个自动编码器来增强预测评级矩阵数据表示的重建。输出的评级信息用于推荐预测。在 MovieTweetings 和 MovieLens 数据集上进行的大量实验验证了 SAPR 与最先进模型相比的有效性。
{"title":"Representation learning: serial-autoencoder for personalized recommendation","authors":"Yi Zhu, Yishuai Geng, Yun Li, Jipeng Qiang, Xindong Wu","doi":"10.1007/s11704-023-2441-1","DOIUrl":"https://doi.org/10.1007/s11704-023-2441-1","url":null,"abstract":"<p>Nowadays, the personalized recommendation has become a research hotspot for addressing information overload. Despite this, generating effective recommendations from sparse data remains a challenge. Recently, auxiliary information has been widely used to address data sparsity, but most models using auxiliary information are linear and have limited expressiveness. Due to the advantages of feature extraction and no-label requirements, autoencoder-based methods have become quite popular. However, most existing autoencoder-based methods discard the reconstruction of auxiliary information, which poses huge challenges for better representation learning and model scalability. To address these problems, we propose Serial-Autoencoder for Personalized Recommendation (SAPR), which aims to reduce the loss of critical information and enhance the learning of feature representations. Specifically, we first combine the original rating matrix and item attribute features and feed them into the first autoencoder for generating a higher-level representation of the input. Second, we use a second autoencoder to enhance the reconstruction of the data representation of the prediciton rating matrix. The output rating information is used for recommendation prediction. Extensive experiments on the MovieTweetings and MovieLens datasets have verified the effectiveness of SAPR compared to state-of-the-art models.</p>","PeriodicalId":12640,"journal":{"name":"Frontiers of Computer Science","volume":null,"pages":null},"PeriodicalIF":4.2,"publicationDate":"2023-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138681699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Robust AUC maximization for classification with pairwise confidence comparisons 利用成对置信度比较实现分类的稳健 AUC 最大化
IF 4.2 3区 计算机科学 Q1 Mathematics Pub Date : 2023-12-16 DOI: 10.1007/s11704-023-2709-5
Haochen Shi, Mingkun Xie, Shengjun Huang

Supervised learning often requires a large number of labeled examples, which has become a critical bottleneck in the case that manual annotating the class labels is costly. To mitigate this issue, a new framework called pairwise comparison (Pcomp) classification is proposed to allow training examples only weakly annotated with pairwise comparison, i.e., which one of two examples is more likely to be positive. The previous study solves Pcomp problems by minimizing the classification error, which may lead to less robust model due to its sensitivity to class distribution. In this paper, we propose a robust learning framework for Pcomp data along with a pairwise surrogate loss called Pcomp-AUC. It provides an unbiased estimator to equivalently maximize AUC without accessing the precise class labels. Theoretically, we prove the consistency with respect to AUC and further provide the estimation error bound for the proposed method. Empirical studies on multiple datasets validate the effectiveness of the proposed method.

监督学习通常需要大量的标注示例,在人工标注类标签成本高昂的情况下,这已成为一个关键瓶颈。为了缓解这一问题,我们提出了一种称为成对比较(Pcomp)分类的新框架,允许只对训练示例进行弱注释的成对比较,即两个示例中哪一个更有可能是正面的。以往的研究通过最小化分类误差来解决 Pcomp 问题,但由于其对类别分布的敏感性,可能会导致模型的鲁棒性较差。在本文中,我们提出了一种针对 Pcomp 数据的稳健学习框架,以及一种名为 Pcomp-AUC 的成对替代损失。它提供了一种无偏估计器,可以在不获取精确类别标签的情况下等效地最大化 AUC。从理论上讲,我们证明了 AUC 的一致性,并进一步提供了所提方法的估计误差边界。对多个数据集的实证研究验证了所提方法的有效性。
{"title":"Robust AUC maximization for classification with pairwise confidence comparisons","authors":"Haochen Shi, Mingkun Xie, Shengjun Huang","doi":"10.1007/s11704-023-2709-5","DOIUrl":"https://doi.org/10.1007/s11704-023-2709-5","url":null,"abstract":"<p>Supervised learning often requires a large number of labeled examples, which has become a critical bottleneck in the case that manual annotating the class labels is costly. To mitigate this issue, a new framework called pairwise comparison (Pcomp) classification is proposed to allow training examples only weakly annotated with pairwise comparison, i.e., which one of two examples is more likely to be positive. The previous study solves Pcomp problems by minimizing the classification error, which may lead to less robust model due to its sensitivity to class distribution. In this paper, we propose a robust learning framework for Pcomp data along with a pairwise surrogate loss called Pcomp-AUC. It provides an unbiased estimator to equivalently maximize AUC without accessing the precise class labels. Theoretically, we prove the consistency with respect to AUC and further provide the estimation error bound for the proposed method. Empirical studies on multiple datasets validate the effectiveness of the proposed method.</p>","PeriodicalId":12640,"journal":{"name":"Frontiers of Computer Science","volume":null,"pages":null},"PeriodicalIF":4.2,"publicationDate":"2023-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138681625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A robust optimization method for label noisy datasets based on adaptive threshold: Adaptive-k 基于自适应阈值的标签噪声数据集鲁棒性优化方法自适应-k
IF 4.2 3区 计算机科学 Q1 Mathematics Pub Date : 2023-12-16 DOI: 10.1007/s11704-023-2430-4
Enes Dedeoglu, Himmet Toprak Kesgin, Mehmet Fatih Amasyali

The use of all samples in the optimization process does not produce robust results in datasets with label noise. Because the gradients calculated according to the losses of the noisy samples cause the optimization process to go in the wrong direction. In this paper, we recommend using samples with loss less than a threshold determined during the optimization, instead of using all samples in the mini-batch. Our proposed method, Adaptive-k, aims to exclude label noise samples from the optimization process and make the process robust. On noisy datasets, we found that using a threshold-based approach, such as Adaptive-k, produces better results than using all samples or a fixed number of low-loss samples in the mini-batch. On the basis of our theoretical analysis and experimental results, we show that the Adaptive-k method is closest to the performance of the Oracle, in which noisy samples are entirely removed from the dataset. Adaptive-k is a simple but effective method. It does not require prior knowledge of the noise ratio of the dataset, does not require additional model training, and does not increase training time significantly. In the experiments, we also show that Adaptive-k is compatible with different optimizers such as SGD, SGDM, and Adam. The code for Adaptive-k is available at GitHub.

在优化过程中使用所有样本并不能在存在标签噪声的数据集上产生稳健的结果。因为根据噪声样本损失计算出的梯度会导致优化过程走向错误的方向。在本文中,我们建议使用损失小于优化过程中确定的阈值的样本,而不是使用迷你批次中的所有样本。我们提出的 "自适应-k "方法旨在将标签噪声样本排除在优化过程之外,使优化过程更加稳健。在噪声数据集上,我们发现使用基于阈值的方法(如 Adaptive-k)比使用迷你批次中的所有样本或固定数量的低损耗样本效果更好。根据我们的理论分析和实验结果,我们发现 Adaptive-k 方法最接近 Oracle 方法的性能,在 Oracle 方法中,噪声样本被完全从数据集中剔除。Adaptive-k 是一种简单而有效的方法。它不需要事先了解数据集的噪声比,不需要额外的模型训练,也不会显著增加训练时间。在实验中,我们还发现 Adaptive-k 与 SGD、SGDM 和 Adam 等不同优化器兼容。Adaptive-k 的代码可在 GitHub 上获取。
{"title":"A robust optimization method for label noisy datasets based on adaptive threshold: Adaptive-k","authors":"Enes Dedeoglu, Himmet Toprak Kesgin, Mehmet Fatih Amasyali","doi":"10.1007/s11704-023-2430-4","DOIUrl":"https://doi.org/10.1007/s11704-023-2430-4","url":null,"abstract":"<p>The use of all samples in the optimization process does not produce robust results in datasets with label noise. Because the gradients calculated according to the losses of the noisy samples cause the optimization process to go in the wrong direction. In this paper, we recommend using samples with loss less than a threshold determined during the optimization, instead of using all samples in the mini-batch. Our proposed method, Adaptive-<i>k</i>, aims to exclude label noise samples from the optimization process and make the process robust. On noisy datasets, we found that using a threshold-based approach, such as Adaptive-<i>k</i>, produces better results than using all samples or a fixed number of low-loss samples in the mini-batch. On the basis of our theoretical analysis and experimental results, we show that the Adaptive-<i>k</i> method is closest to the performance of the Oracle, in which noisy samples are entirely removed from the dataset. Adaptive-<i>k</i> is a simple but effective method. It does not require prior knowledge of the noise ratio of the dataset, does not require additional model training, and does not increase training time significantly. In the experiments, we also show that Adaptive-<i>k</i> is compatible with different optimizers such as SGD, SGDM, and Adam. The code for Adaptive-<i>k</i> is available at GitHub.</p>","PeriodicalId":12640,"journal":{"name":"Frontiers of Computer Science","volume":null,"pages":null},"PeriodicalIF":4.2,"publicationDate":"2023-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138681629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Gria: an efficient deterministic concurrency control protocol Gria:高效的确定性并发控制协议
IF 4.2 3区 计算机科学 Q1 Mathematics Pub Date : 2023-12-16 DOI: 10.1007/s11704-023-2605-z
Xinyuan Wang, Yun Peng, Hejiao Huang

Deterministic databases are able to reduce coordination costs in a replication. This property has fostered a significant interest in the design of efficient deterministic concurrency control protocols. However, the state-of-the-art deterministic concurrency control protocol Aria has three issues. First, it is impractical to configure a suitable batch size when the read-write set is unknown. Second, Aria running in low-concurrency scenarios, e.g., a single-thread scenario, suffers from the same conflicts as running in high-concurrency scenarios. Third, the single-version schema brings write-after-write conflicts.

To address these issues, we propose Gria, an efficient deterministic concurrency control protocol. Gria has the following properties. First, the batch size of Gria is auto-scaling. Second, Gria’s conflict probability in low-concurrency scenarios is lower than that in high-concurrency scenarios. Third, Gria has no write-after-write conflicts by adopting a multi-version structure. To further reduce conflicts, we propose two optimizations: a reordering mechanism as well as a rechecking strategy. The evaluation result on two popular benchmarks shows that Gria outperforms Aria by 13x.

确定性数据库能够降低复制中的协调成本。这一特性激发了人们对设计高效确定性并发控制协议的极大兴趣。然而,最先进的确定性并发控制协议 Aria 有三个问题。首先,当读写集未知时,配置合适的批量大小是不切实际的。其次,在低并发场景(如单线程场景)下运行的 Aria 与在高并发场景下运行的 Aria 存在相同的冲突。为了解决这些问题,我们提出了一种高效的确定性并发控制协议--Gria。Gria 具有以下特性。首先,Gria 的批量大小是自动缩放的。其次,Gria 在低并发场景下的冲突概率低于高并发场景下的冲突概率。第三,Gria 采用多版本结构,不会出现写后冲突。为了进一步减少冲突,我们提出了两个优化方案:重新排序机制和重新检查策略。在两个常用基准测试中的评估结果表明,Gria 的性能是 Aria 的 13 倍。
{"title":"Gria: an efficient deterministic concurrency control protocol","authors":"Xinyuan Wang, Yun Peng, Hejiao Huang","doi":"10.1007/s11704-023-2605-z","DOIUrl":"https://doi.org/10.1007/s11704-023-2605-z","url":null,"abstract":"<p>Deterministic databases are able to reduce coordination costs in a replication. This property has fostered a significant interest in the design of efficient deterministic concurrency control protocols. However, the state-of-the-art deterministic concurrency control protocol Aria has three issues. First, it is impractical to configure a suitable batch size when the read-write set is unknown. Second, Aria running in low-concurrency scenarios, e.g., a single-thread scenario, suffers from the same conflicts as running in high-concurrency scenarios. Third, the single-version schema brings write-after-write conflicts.</p><p>To address these issues, we propose Gria, an efficient deterministic concurrency control protocol. Gria has the following properties. First, the batch size of Gria is auto-scaling. Second, Gria’s conflict probability in low-concurrency scenarios is lower than that in high-concurrency scenarios. Third, Gria has no write-after-write conflicts by adopting a multi-version structure. To further reduce conflicts, we propose two optimizations: a reordering mechanism as well as a rechecking strategy. The evaluation result on two popular benchmarks shows that Gria outperforms Aria by 13x.</p>","PeriodicalId":12640,"journal":{"name":"Frontiers of Computer Science","volume":null,"pages":null},"PeriodicalIF":4.2,"publicationDate":"2023-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138681630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Density estimation-based method to determine sample size for random sample partition of big data 基于密度估计的方法确定大数据随机抽样分区的样本量
IF 4.2 3区 计算机科学 Q1 Mathematics Pub Date : 2023-12-16 DOI: 10.1007/s11704-023-2356-x

Abstract

Random sample partition (RSP) is a newly developed big data representation and management model to deal with big data approximate computation problems. Academic research and practical applications have confirmed that RSP is an efficient solution for big data processing and analysis. However, a challenge for implementing RSP is determining an appropriate sample size for RSP data blocks. While a large sample size increases the burden of big data computation, a small size will lead to insufficient distribution information for RSP data blocks. To address this problem, this paper presents a novel density estimation-based method (DEM) to determine the optimal sample size for RSP data blocks. First, a theoretical sample size is calculated based on the multivariate Dvoretzky-Kiefer-Wolfowitz (DKW) inequality by using the fixed-point iteration (FPI) method. Second, a practical sample size is determined by minimizing the validation error of a kernel density estimator (KDE) constructed on RSP data blocks for an increasing sample size. Finally, a series of persuasive experiments are conducted to validate the feasibility, rationality, and effectiveness of DEM. Experimental results show that (1) the iteration function of the FPI method is convergent for calculating the theoretical sample size from the multivariate DKW inequality; (2) the KDE constructed on RSP data blocks with sample size determined by DEM can yield a good approximation of the probability density function (p.d.f.); and (3) DEM provides more accurate sample sizes than the existing sample size determination methods from the perspective of p.d.f. estimation. This demonstrates that DEM is a viable approach to deal with the sample size determination problem for big data RSP implementation.

摘要 随机抽样分区(RSP)是一种新开发的大数据表示和管理模型,用于处理大数据近似计算问题。学术研究和实际应用证实,RSP 是一种高效的大数据处理和分析解决方案。然而,实施 RSP 的一个挑战是确定 RSP 数据块的适当样本大小。样本量大会增加大数据计算的负担,而样本量小又会导致 RSP 数据块的分布信息不足。为解决这一问题,本文提出了一种新颖的基于密度估计的方法(DEM)来确定 RSP 数据块的最佳样本量。首先,根据多元 Dvoretzky-Kiefer-Wolfowitz (DKW) 不等式,使用定点迭代 (FPI) 方法计算出理论样本量。其次,通过最小化在 RSP 数据块上构建的核密度估算器 (KDE) 的验证误差来确定实际样本量,以增加样本量。最后,通过一系列有说服力的实验来验证 DEM 的可行性、合理性和有效性。实验结果表明:(1) 从多元 DKW 不等式计算理论样本量时,FPI 方法的迭代函数是收敛的;(2) 在 RSP 数据块上构建的 KDE,其样本量由 DEM 确定,可以很好地近似概率密度函数(p.d.f.);(3) 从 p.d.f. 估计的角度来看,DEM 比现有的样本量确定方法提供了更精确的样本量。这表明,DEM 是处理大数据 RSP 实施中样本量确定问题的一种可行方法。
{"title":"Density estimation-based method to determine sample size for random sample partition of big data","authors":"","doi":"10.1007/s11704-023-2356-x","DOIUrl":"https://doi.org/10.1007/s11704-023-2356-x","url":null,"abstract":"<h3>Abstract</h3> <p>Random sample partition (RSP) is a newly developed big data representation and management model to deal with big data approximate computation problems. Academic research and practical applications have confirmed that RSP is an efficient solution for big data processing and analysis. However, a challenge for implementing RSP is determining an appropriate sample size for RSP data blocks. While a large sample size increases the burden of big data computation, a small size will lead to insufficient distribution information for RSP data blocks. To address this problem, this paper presents a novel density estimation-based method (DEM) to determine the optimal sample size for RSP data blocks. First, a theoretical sample size is calculated based on the multivariate Dvoretzky-Kiefer-Wolfowitz (DKW) inequality by using the fixed-point iteration (FPI) method. Second, a practical sample size is determined by minimizing the validation error of a kernel density estimator (KDE) constructed on RSP data blocks for an increasing sample size. Finally, a series of persuasive experiments are conducted to validate the feasibility, rationality, and effectiveness of DEM. Experimental results show that (1) the iteration function of the FPI method is convergent for calculating the theoretical sample size from the multivariate DKW inequality; (2) the KDE constructed on RSP data blocks with sample size determined by DEM can yield a good approximation of the probability density function (<em>p.d.f.</em>); and (3) DEM provides more accurate sample sizes than the existing sample size determination methods from the perspective of <em>p.d.f.</em> estimation. This demonstrates that DEM is a viable approach to deal with the sample size determination problem for big data RSP implementation.</p>","PeriodicalId":12640,"journal":{"name":"Frontiers of Computer Science","volume":null,"pages":null},"PeriodicalIF":4.2,"publicationDate":"2023-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138681701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Minimizing the cost of periodically replicated systems via model and quantitative analysis 通过模型和定量分析使周期性复制系统的成本最小化
IF 4.2 3区 计算机科学 Q1 Mathematics Pub Date : 2023-12-16 DOI: 10.1007/s11704-023-2625-8
Chenhao Zhang, Liang Wang, Limin Xiao, Shixuan Jiang, Meng Han, Jinquan Wang, Bing Wei, Guangjun Qin

Geographically replicating objects across multiple data centers improves the performance and reliability of cloud storage systems. Maintaining consistent replicas comes with high synchronization costs, as it faces more expensive WAN transport prices and increased latency. Periodic replication is the widely used technique to reduce the synchronization costs. Periodic replication strategies in existing cloud storage systems are too static to handle traffic changes, which indicates that they are inflexible in the face of unforeseen loads, resulting in additional synchronization cost. We propose quantitative analysis models to quantify consistency and synchronization cost for periodically replicated systems, and derive the optimal synchronization period to achieve the best tradeoff between consistency and synchronization cost. Based on this, we propose a dynamic periodic synchronization method, Sync-Opt, which allows systems to set the optimal synchronization period according to the variable load in clouds to minimize the synchronization cost. Simulation results demonstrate the effectiveness of our models. Compared with the policies widely used in modern cloud storage systems, the Sync-Opt strategy significantly reduces the synchronization cost.

在多个数据中心对对象进行地理复制可提高云存储系统的性能和可靠性。保持一致的复制需要高昂的同步成本,因为它面临着更昂贵的广域网传输价格和更高的延迟。定期复制是降低同步成本的广泛应用技术。现有云存储系统中的定期复制策略过于静态,无法应对流量变化,这表明它们在面对不可预见的负载时缺乏灵活性,从而导致额外的同步成本。我们提出了量化分析模型来量化周期性复制系统的一致性和同步成本,并推导出最佳同步周期,以实现一致性和同步成本之间的最佳权衡。在此基础上,我们提出了一种动态周期同步方法 Sync-Opt,它允许系统根据云中的可变负载设置最佳同步周期,从而使同步成本最小化。仿真结果证明了我们模型的有效性。与现代云存储系统中广泛使用的策略相比,Sync-Opt 策略大大降低了同步成本。
{"title":"Minimizing the cost of periodically replicated systems via model and quantitative analysis","authors":"Chenhao Zhang, Liang Wang, Limin Xiao, Shixuan Jiang, Meng Han, Jinquan Wang, Bing Wei, Guangjun Qin","doi":"10.1007/s11704-023-2625-8","DOIUrl":"https://doi.org/10.1007/s11704-023-2625-8","url":null,"abstract":"<p>Geographically replicating objects across multiple data centers improves the performance and reliability of cloud storage systems. Maintaining consistent replicas comes with high synchronization costs, as it faces more expensive WAN transport prices and increased latency. Periodic replication is the widely used technique to reduce the synchronization costs. Periodic replication strategies in existing cloud storage systems are too static to handle traffic changes, which indicates that they are inflexible in the face of unforeseen loads, resulting in additional synchronization cost. We propose quantitative analysis models to quantify consistency and synchronization cost for periodically replicated systems, and derive the optimal synchronization period to achieve the best tradeoff between consistency and synchronization cost. Based on this, we propose a dynamic periodic synchronization method, Sync-Opt, which allows systems to set the optimal synchronization period according to the variable load in clouds to minimize the synchronization cost. Simulation results demonstrate the effectiveness of our models. Compared with the policies widely used in modern cloud storage systems, the Sync-Opt strategy significantly reduces the synchronization cost.</p>","PeriodicalId":12640,"journal":{"name":"Frontiers of Computer Science","volume":null,"pages":null},"PeriodicalIF":4.2,"publicationDate":"2023-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138681627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Index-free triangle-based graph local clustering 基于无索引三角形的图形局部聚类
IF 4.2 3区 计算机科学 Q1 Mathematics Pub Date : 2023-12-13 DOI: 10.1007/s11704-023-2768-7
Zhe Yuan, Zhewei Wei, Fangrui Lv, Ji-Rong Wen

Motif-based graph local clustering (MGLC) is a popular method for graph mining tasks due to its various applications. However, the traditional two-phase approach of precomputing motif weights before performing local clustering loses locality and is impractical for large graphs. While some attempts have been made to address the efficiency bottleneck, there is still no applicable algorithm for large scale graphs with billions of edges. In this paper, we propose a purely local and index-free method called Index-free Triangle-based Graph Local Clustering (TGLC*) to solve the MGLC problem w.r.t. a triangle. TGLC* directly estimates the Personalized PageRank (PPR) vector using random walks with the desired triangle-weighted distribution and proposes the clustering result using a standard sweep procedure. We demonstrate TGLC*’s scalability through theoretical analysis and its practical benefits through a novel visualization layout. TGLC* is the first algorithm to solve the MGLC problem without precomputing the motif weight. Extensive experiments on seven real-world large-scale datasets show that TGLC* is applicable and scalable for large graphs.

基于图案的图局部聚类(MGLC)因其应用广泛而成为图挖掘任务的常用方法。然而,传统的两阶段方法是先预先计算图案权重,然后再进行局部聚类,这种方法失去了局部性,对于大型图来说不切实际。虽然已经有人尝试解决效率瓶颈问题,但仍没有适用于拥有数十亿条边的大规模图的算法。在本文中,我们提出了一种纯局部、无索引的方法,称为无索引三角形图局部聚类(TGLC*),用于解决三角形的 MGLC 问题。TGLC* 使用具有所需的三角形加权分布的随机行走直接估计个性化页面排名(PPR)向量,并使用标准扫频程序提出聚类结果。我们通过理论分析展示了 TGLC* 的可扩展性,并通过新颖的可视化布局展示了其实际优势。TGLC* 是首个无需预先计算图案权重就能解决 MGLC 问题的算法。在七个真实世界的大规模数据集上进行的广泛实验表明,TGLC* 适用于大型图并具有可扩展性。
{"title":"Index-free triangle-based graph local clustering","authors":"Zhe Yuan, Zhewei Wei, Fangrui Lv, Ji-Rong Wen","doi":"10.1007/s11704-023-2768-7","DOIUrl":"https://doi.org/10.1007/s11704-023-2768-7","url":null,"abstract":"<p>Motif-based graph local clustering (MGLC) is a popular method for graph mining tasks due to its various applications. However, the traditional two-phase approach of precomputing motif weights before performing local clustering loses locality and is impractical for large graphs. While some attempts have been made to address the efficiency bottleneck, there is still no applicable algorithm for large scale graphs with billions of edges. In this paper, we propose a purely local and index-free method called Index-free Triangle-based Graph Local Clustering (TGLC*) to solve the MGLC problem w.r.t. a triangle. TGLC* directly estimates the Personalized PageRank (PPR) vector using random walks with the desired triangle-weighted distribution and proposes the clustering result using a standard sweep procedure. We demonstrate TGLC*’s scalability through theoretical analysis and its practical benefits through a novel visualization layout. TGLC* is the first algorithm to solve the MGLC problem without precomputing the motif weight. Extensive experiments on seven real-world large-scale datasets show that TGLC* is applicable and scalable for large graphs.</p>","PeriodicalId":12640,"journal":{"name":"Frontiers of Computer Science","volume":null,"pages":null},"PeriodicalIF":4.2,"publicationDate":"2023-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138579486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Frontiers of Computer Science
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1