首页 > 最新文献

Frontiers of Computer Science最新文献

英文 中文
A robust optimization method for label noisy datasets based on adaptive threshold: Adaptive-k 基于自适应阈值的标签噪声数据集鲁棒性优化方法自适应-k
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-12-16 DOI: 10.1007/s11704-023-2430-4
Enes Dedeoglu, Himmet Toprak Kesgin, Mehmet Fatih Amasyali

The use of all samples in the optimization process does not produce robust results in datasets with label noise. Because the gradients calculated according to the losses of the noisy samples cause the optimization process to go in the wrong direction. In this paper, we recommend using samples with loss less than a threshold determined during the optimization, instead of using all samples in the mini-batch. Our proposed method, Adaptive-k, aims to exclude label noise samples from the optimization process and make the process robust. On noisy datasets, we found that using a threshold-based approach, such as Adaptive-k, produces better results than using all samples or a fixed number of low-loss samples in the mini-batch. On the basis of our theoretical analysis and experimental results, we show that the Adaptive-k method is closest to the performance of the Oracle, in which noisy samples are entirely removed from the dataset. Adaptive-k is a simple but effective method. It does not require prior knowledge of the noise ratio of the dataset, does not require additional model training, and does not increase training time significantly. In the experiments, we also show that Adaptive-k is compatible with different optimizers such as SGD, SGDM, and Adam. The code for Adaptive-k is available at GitHub.

在优化过程中使用所有样本并不能在存在标签噪声的数据集上产生稳健的结果。因为根据噪声样本损失计算出的梯度会导致优化过程走向错误的方向。在本文中,我们建议使用损失小于优化过程中确定的阈值的样本,而不是使用迷你批次中的所有样本。我们提出的 "自适应-k "方法旨在将标签噪声样本排除在优化过程之外,使优化过程更加稳健。在噪声数据集上,我们发现使用基于阈值的方法(如 Adaptive-k)比使用迷你批次中的所有样本或固定数量的低损耗样本效果更好。根据我们的理论分析和实验结果,我们发现 Adaptive-k 方法最接近 Oracle 方法的性能,在 Oracle 方法中,噪声样本被完全从数据集中剔除。Adaptive-k 是一种简单而有效的方法。它不需要事先了解数据集的噪声比,不需要额外的模型训练,也不会显著增加训练时间。在实验中,我们还发现 Adaptive-k 与 SGD、SGDM 和 Adam 等不同优化器兼容。Adaptive-k 的代码可在 GitHub 上获取。
{"title":"A robust optimization method for label noisy datasets based on adaptive threshold: Adaptive-k","authors":"Enes Dedeoglu, Himmet Toprak Kesgin, Mehmet Fatih Amasyali","doi":"10.1007/s11704-023-2430-4","DOIUrl":"https://doi.org/10.1007/s11704-023-2430-4","url":null,"abstract":"<p>The use of all samples in the optimization process does not produce robust results in datasets with label noise. Because the gradients calculated according to the losses of the noisy samples cause the optimization process to go in the wrong direction. In this paper, we recommend using samples with loss less than a threshold determined during the optimization, instead of using all samples in the mini-batch. Our proposed method, Adaptive-<i>k</i>, aims to exclude label noise samples from the optimization process and make the process robust. On noisy datasets, we found that using a threshold-based approach, such as Adaptive-<i>k</i>, produces better results than using all samples or a fixed number of low-loss samples in the mini-batch. On the basis of our theoretical analysis and experimental results, we show that the Adaptive-<i>k</i> method is closest to the performance of the Oracle, in which noisy samples are entirely removed from the dataset. Adaptive-<i>k</i> is a simple but effective method. It does not require prior knowledge of the noise ratio of the dataset, does not require additional model training, and does not increase training time significantly. In the experiments, we also show that Adaptive-<i>k</i> is compatible with different optimizers such as SGD, SGDM, and Adam. The code for Adaptive-<i>k</i> is available at GitHub.</p>","PeriodicalId":12640,"journal":{"name":"Frontiers of Computer Science","volume":"104 1","pages":""},"PeriodicalIF":4.2,"publicationDate":"2023-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138681629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Gria: an efficient deterministic concurrency control protocol Gria:高效的确定性并发控制协议
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-12-16 DOI: 10.1007/s11704-023-2605-z
Xinyuan Wang, Yun Peng, Hejiao Huang

Deterministic databases are able to reduce coordination costs in a replication. This property has fostered a significant interest in the design of efficient deterministic concurrency control protocols. However, the state-of-the-art deterministic concurrency control protocol Aria has three issues. First, it is impractical to configure a suitable batch size when the read-write set is unknown. Second, Aria running in low-concurrency scenarios, e.g., a single-thread scenario, suffers from the same conflicts as running in high-concurrency scenarios. Third, the single-version schema brings write-after-write conflicts.

To address these issues, we propose Gria, an efficient deterministic concurrency control protocol. Gria has the following properties. First, the batch size of Gria is auto-scaling. Second, Gria’s conflict probability in low-concurrency scenarios is lower than that in high-concurrency scenarios. Third, Gria has no write-after-write conflicts by adopting a multi-version structure. To further reduce conflicts, we propose two optimizations: a reordering mechanism as well as a rechecking strategy. The evaluation result on two popular benchmarks shows that Gria outperforms Aria by 13x.

确定性数据库能够降低复制中的协调成本。这一特性激发了人们对设计高效确定性并发控制协议的极大兴趣。然而,最先进的确定性并发控制协议 Aria 有三个问题。首先,当读写集未知时,配置合适的批量大小是不切实际的。其次,在低并发场景(如单线程场景)下运行的 Aria 与在高并发场景下运行的 Aria 存在相同的冲突。为了解决这些问题,我们提出了一种高效的确定性并发控制协议--Gria。Gria 具有以下特性。首先,Gria 的批量大小是自动缩放的。其次,Gria 在低并发场景下的冲突概率低于高并发场景下的冲突概率。第三,Gria 采用多版本结构,不会出现写后冲突。为了进一步减少冲突,我们提出了两个优化方案:重新排序机制和重新检查策略。在两个常用基准测试中的评估结果表明,Gria 的性能是 Aria 的 13 倍。
{"title":"Gria: an efficient deterministic concurrency control protocol","authors":"Xinyuan Wang, Yun Peng, Hejiao Huang","doi":"10.1007/s11704-023-2605-z","DOIUrl":"https://doi.org/10.1007/s11704-023-2605-z","url":null,"abstract":"<p>Deterministic databases are able to reduce coordination costs in a replication. This property has fostered a significant interest in the design of efficient deterministic concurrency control protocols. However, the state-of-the-art deterministic concurrency control protocol Aria has three issues. First, it is impractical to configure a suitable batch size when the read-write set is unknown. Second, Aria running in low-concurrency scenarios, e.g., a single-thread scenario, suffers from the same conflicts as running in high-concurrency scenarios. Third, the single-version schema brings write-after-write conflicts.</p><p>To address these issues, we propose Gria, an efficient deterministic concurrency control protocol. Gria has the following properties. First, the batch size of Gria is auto-scaling. Second, Gria’s conflict probability in low-concurrency scenarios is lower than that in high-concurrency scenarios. Third, Gria has no write-after-write conflicts by adopting a multi-version structure. To further reduce conflicts, we propose two optimizations: a reordering mechanism as well as a rechecking strategy. The evaluation result on two popular benchmarks shows that Gria outperforms Aria by 13x.</p>","PeriodicalId":12640,"journal":{"name":"Frontiers of Computer Science","volume":"5 1","pages":""},"PeriodicalIF":4.2,"publicationDate":"2023-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138681630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Density estimation-based method to determine sample size for random sample partition of big data 基于密度估计的方法确定大数据随机抽样分区的样本量
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-12-16 DOI: 10.1007/s11704-023-2356-x

Abstract

Random sample partition (RSP) is a newly developed big data representation and management model to deal with big data approximate computation problems. Academic research and practical applications have confirmed that RSP is an efficient solution for big data processing and analysis. However, a challenge for implementing RSP is determining an appropriate sample size for RSP data blocks. While a large sample size increases the burden of big data computation, a small size will lead to insufficient distribution information for RSP data blocks. To address this problem, this paper presents a novel density estimation-based method (DEM) to determine the optimal sample size for RSP data blocks. First, a theoretical sample size is calculated based on the multivariate Dvoretzky-Kiefer-Wolfowitz (DKW) inequality by using the fixed-point iteration (FPI) method. Second, a practical sample size is determined by minimizing the validation error of a kernel density estimator (KDE) constructed on RSP data blocks for an increasing sample size. Finally, a series of persuasive experiments are conducted to validate the feasibility, rationality, and effectiveness of DEM. Experimental results show that (1) the iteration function of the FPI method is convergent for calculating the theoretical sample size from the multivariate DKW inequality; (2) the KDE constructed on RSP data blocks with sample size determined by DEM can yield a good approximation of the probability density function (p.d.f.); and (3) DEM provides more accurate sample sizes than the existing sample size determination methods from the perspective of p.d.f. estimation. This demonstrates that DEM is a viable approach to deal with the sample size determination problem for big data RSP implementation.

摘要 随机抽样分区(RSP)是一种新开发的大数据表示和管理模型,用于处理大数据近似计算问题。学术研究和实际应用证实,RSP 是一种高效的大数据处理和分析解决方案。然而,实施 RSP 的一个挑战是确定 RSP 数据块的适当样本大小。样本量大会增加大数据计算的负担,而样本量小又会导致 RSP 数据块的分布信息不足。为解决这一问题,本文提出了一种新颖的基于密度估计的方法(DEM)来确定 RSP 数据块的最佳样本量。首先,根据多元 Dvoretzky-Kiefer-Wolfowitz (DKW) 不等式,使用定点迭代 (FPI) 方法计算出理论样本量。其次,通过最小化在 RSP 数据块上构建的核密度估算器 (KDE) 的验证误差来确定实际样本量,以增加样本量。最后,通过一系列有说服力的实验来验证 DEM 的可行性、合理性和有效性。实验结果表明:(1) 从多元 DKW 不等式计算理论样本量时,FPI 方法的迭代函数是收敛的;(2) 在 RSP 数据块上构建的 KDE,其样本量由 DEM 确定,可以很好地近似概率密度函数(p.d.f.);(3) 从 p.d.f. 估计的角度来看,DEM 比现有的样本量确定方法提供了更精确的样本量。这表明,DEM 是处理大数据 RSP 实施中样本量确定问题的一种可行方法。
{"title":"Density estimation-based method to determine sample size for random sample partition of big data","authors":"","doi":"10.1007/s11704-023-2356-x","DOIUrl":"https://doi.org/10.1007/s11704-023-2356-x","url":null,"abstract":"<h3>Abstract</h3> <p>Random sample partition (RSP) is a newly developed big data representation and management model to deal with big data approximate computation problems. Academic research and practical applications have confirmed that RSP is an efficient solution for big data processing and analysis. However, a challenge for implementing RSP is determining an appropriate sample size for RSP data blocks. While a large sample size increases the burden of big data computation, a small size will lead to insufficient distribution information for RSP data blocks. To address this problem, this paper presents a novel density estimation-based method (DEM) to determine the optimal sample size for RSP data blocks. First, a theoretical sample size is calculated based on the multivariate Dvoretzky-Kiefer-Wolfowitz (DKW) inequality by using the fixed-point iteration (FPI) method. Second, a practical sample size is determined by minimizing the validation error of a kernel density estimator (KDE) constructed on RSP data blocks for an increasing sample size. Finally, a series of persuasive experiments are conducted to validate the feasibility, rationality, and effectiveness of DEM. Experimental results show that (1) the iteration function of the FPI method is convergent for calculating the theoretical sample size from the multivariate DKW inequality; (2) the KDE constructed on RSP data blocks with sample size determined by DEM can yield a good approximation of the probability density function (<em>p.d.f.</em>); and (3) DEM provides more accurate sample sizes than the existing sample size determination methods from the perspective of <em>p.d.f.</em> estimation. This demonstrates that DEM is a viable approach to deal with the sample size determination problem for big data RSP implementation.</p>","PeriodicalId":12640,"journal":{"name":"Frontiers of Computer Science","volume":"60 1","pages":""},"PeriodicalIF":4.2,"publicationDate":"2023-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138681701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Minimizing the cost of periodically replicated systems via model and quantitative analysis 通过模型和定量分析使周期性复制系统的成本最小化
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-12-16 DOI: 10.1007/s11704-023-2625-8
Chenhao Zhang, Liang Wang, Limin Xiao, Shixuan Jiang, Meng Han, Jinquan Wang, Bing Wei, Guangjun Qin

Geographically replicating objects across multiple data centers improves the performance and reliability of cloud storage systems. Maintaining consistent replicas comes with high synchronization costs, as it faces more expensive WAN transport prices and increased latency. Periodic replication is the widely used technique to reduce the synchronization costs. Periodic replication strategies in existing cloud storage systems are too static to handle traffic changes, which indicates that they are inflexible in the face of unforeseen loads, resulting in additional synchronization cost. We propose quantitative analysis models to quantify consistency and synchronization cost for periodically replicated systems, and derive the optimal synchronization period to achieve the best tradeoff between consistency and synchronization cost. Based on this, we propose a dynamic periodic synchronization method, Sync-Opt, which allows systems to set the optimal synchronization period according to the variable load in clouds to minimize the synchronization cost. Simulation results demonstrate the effectiveness of our models. Compared with the policies widely used in modern cloud storage systems, the Sync-Opt strategy significantly reduces the synchronization cost.

在多个数据中心对对象进行地理复制可提高云存储系统的性能和可靠性。保持一致的复制需要高昂的同步成本,因为它面临着更昂贵的广域网传输价格和更高的延迟。定期复制是降低同步成本的广泛应用技术。现有云存储系统中的定期复制策略过于静态,无法应对流量变化,这表明它们在面对不可预见的负载时缺乏灵活性,从而导致额外的同步成本。我们提出了量化分析模型来量化周期性复制系统的一致性和同步成本,并推导出最佳同步周期,以实现一致性和同步成本之间的最佳权衡。在此基础上,我们提出了一种动态周期同步方法 Sync-Opt,它允许系统根据云中的可变负载设置最佳同步周期,从而使同步成本最小化。仿真结果证明了我们模型的有效性。与现代云存储系统中广泛使用的策略相比,Sync-Opt 策略大大降低了同步成本。
{"title":"Minimizing the cost of periodically replicated systems via model and quantitative analysis","authors":"Chenhao Zhang, Liang Wang, Limin Xiao, Shixuan Jiang, Meng Han, Jinquan Wang, Bing Wei, Guangjun Qin","doi":"10.1007/s11704-023-2625-8","DOIUrl":"https://doi.org/10.1007/s11704-023-2625-8","url":null,"abstract":"<p>Geographically replicating objects across multiple data centers improves the performance and reliability of cloud storage systems. Maintaining consistent replicas comes with high synchronization costs, as it faces more expensive WAN transport prices and increased latency. Periodic replication is the widely used technique to reduce the synchronization costs. Periodic replication strategies in existing cloud storage systems are too static to handle traffic changes, which indicates that they are inflexible in the face of unforeseen loads, resulting in additional synchronization cost. We propose quantitative analysis models to quantify consistency and synchronization cost for periodically replicated systems, and derive the optimal synchronization period to achieve the best tradeoff between consistency and synchronization cost. Based on this, we propose a dynamic periodic synchronization method, Sync-Opt, which allows systems to set the optimal synchronization period according to the variable load in clouds to minimize the synchronization cost. Simulation results demonstrate the effectiveness of our models. Compared with the policies widely used in modern cloud storage systems, the Sync-Opt strategy significantly reduces the synchronization cost.</p>","PeriodicalId":12640,"journal":{"name":"Frontiers of Computer Science","volume":"25 1","pages":""},"PeriodicalIF":4.2,"publicationDate":"2023-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138681627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Index-free triangle-based graph local clustering 基于无索引三角形的图形局部聚类
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-12-13 DOI: 10.1007/s11704-023-2768-7
Zhe Yuan, Zhewei Wei, Fangrui Lv, Ji-Rong Wen

Motif-based graph local clustering (MGLC) is a popular method for graph mining tasks due to its various applications. However, the traditional two-phase approach of precomputing motif weights before performing local clustering loses locality and is impractical for large graphs. While some attempts have been made to address the efficiency bottleneck, there is still no applicable algorithm for large scale graphs with billions of edges. In this paper, we propose a purely local and index-free method called Index-free Triangle-based Graph Local Clustering (TGLC*) to solve the MGLC problem w.r.t. a triangle. TGLC* directly estimates the Personalized PageRank (PPR) vector using random walks with the desired triangle-weighted distribution and proposes the clustering result using a standard sweep procedure. We demonstrate TGLC*’s scalability through theoretical analysis and its practical benefits through a novel visualization layout. TGLC* is the first algorithm to solve the MGLC problem without precomputing the motif weight. Extensive experiments on seven real-world large-scale datasets show that TGLC* is applicable and scalable for large graphs.

基于图案的图局部聚类(MGLC)因其应用广泛而成为图挖掘任务的常用方法。然而,传统的两阶段方法是先预先计算图案权重,然后再进行局部聚类,这种方法失去了局部性,对于大型图来说不切实际。虽然已经有人尝试解决效率瓶颈问题,但仍没有适用于拥有数十亿条边的大规模图的算法。在本文中,我们提出了一种纯局部、无索引的方法,称为无索引三角形图局部聚类(TGLC*),用于解决三角形的 MGLC 问题。TGLC* 使用具有所需的三角形加权分布的随机行走直接估计个性化页面排名(PPR)向量,并使用标准扫频程序提出聚类结果。我们通过理论分析展示了 TGLC* 的可扩展性,并通过新颖的可视化布局展示了其实际优势。TGLC* 是首个无需预先计算图案权重就能解决 MGLC 问题的算法。在七个真实世界的大规模数据集上进行的广泛实验表明,TGLC* 适用于大型图并具有可扩展性。
{"title":"Index-free triangle-based graph local clustering","authors":"Zhe Yuan, Zhewei Wei, Fangrui Lv, Ji-Rong Wen","doi":"10.1007/s11704-023-2768-7","DOIUrl":"https://doi.org/10.1007/s11704-023-2768-7","url":null,"abstract":"<p>Motif-based graph local clustering (MGLC) is a popular method for graph mining tasks due to its various applications. However, the traditional two-phase approach of precomputing motif weights before performing local clustering loses locality and is impractical for large graphs. While some attempts have been made to address the efficiency bottleneck, there is still no applicable algorithm for large scale graphs with billions of edges. In this paper, we propose a purely local and index-free method called Index-free Triangle-based Graph Local Clustering (TGLC*) to solve the MGLC problem w.r.t. a triangle. TGLC* directly estimates the Personalized PageRank (PPR) vector using random walks with the desired triangle-weighted distribution and proposes the clustering result using a standard sweep procedure. We demonstrate TGLC*’s scalability through theoretical analysis and its practical benefits through a novel visualization layout. TGLC* is the first algorithm to solve the MGLC problem without precomputing the motif weight. Extensive experiments on seven real-world large-scale datasets show that TGLC* is applicable and scalable for large graphs.</p>","PeriodicalId":12640,"journal":{"name":"Frontiers of Computer Science","volume":"232 1","pages":""},"PeriodicalIF":4.2,"publicationDate":"2023-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138579486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Constrained clustering with weak label prior 弱标签先验的受限聚类
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-12-13 DOI: 10.1007/s11704-023-3355-7
Jing Zhang, Ruidong Fan, Hong Tao, Jiacheng Jiang, Chenping Hou

Clustering is widely exploited in data mining. It has been proved that embedding weak label prior into clustering is effective to promote its performance. Previous researches mainly focus on only one type of prior. However, in many real scenarios, two kinds of weak label prior information, e.g., pairwise constraints and cluster ratio, are easily obtained or already available. How to incorporate them to improve clustering performance is important but rarely studied. We propose a novel constrained Clustering with Weak Label Prior method (CWLP), which is an integrated framework. Within the unified spectral clustering model, the pairwise constraints are employed as a regularizer in spectral embedding and label proportion is added as a constraint in spectral rotation. To approximate a variant of the embedding matrix more precisely, we replace a cluster indicator matrix with its scaled version. Instead of fixing an initial similarity matrix, we propose a new similarity matrix that is more suitable for deriving clustering results. Except for the theoretical convergence and computational complexity analyses, we validate the effectiveness of CWLP through several benchmark datasets, together with its ability to discriminate suspected breast cancer patients from healthy controls. The experimental evaluation illustrates the superiority of our proposed approach.

聚类在数据挖掘中被广泛应用。实践证明,在聚类中嵌入弱标签先验可以有效提高聚类性能。以往的研究主要只关注一种先验信息。然而,在许多实际场景中,有两种弱标签先验信息,如成对约束和聚类比率,是很容易获得或已经存在的。如何结合它们来提高聚类性能非常重要,但却很少有人研究。我们提出了一种新颖的弱标签先验约束聚类方法(CWLP),它是一个集成框架。在统一的光谱聚类模型中,配对约束被用作光谱嵌入的正则化器,标签比例被添加为光谱旋转的约束。为了更精确地近似嵌入矩阵的变体,我们用其缩放版本取代了聚类指标矩阵。我们没有固定初始相似性矩阵,而是提出了一种更适合得出聚类结果的新相似性矩阵。除了理论收敛性和计算复杂性分析外,我们还通过几个基准数据集验证了 CWLP 的有效性,以及它区分疑似乳腺癌患者和健康对照组的能力。实验评估证明了我们提出的方法的优越性。
{"title":"Constrained clustering with weak label prior","authors":"Jing Zhang, Ruidong Fan, Hong Tao, Jiacheng Jiang, Chenping Hou","doi":"10.1007/s11704-023-3355-7","DOIUrl":"https://doi.org/10.1007/s11704-023-3355-7","url":null,"abstract":"<p>Clustering is widely exploited in data mining. It has been proved that embedding weak label prior into clustering is effective to promote its performance. Previous researches mainly focus on only one type of prior. However, in many real scenarios, two kinds of weak label prior information, e.g., pairwise constraints and cluster ratio, are easily obtained or already available. How to incorporate them to improve clustering performance is important but rarely studied. We propose a novel constrained Clustering with Weak Label Prior method (CWLP), which is an integrated framework. Within the unified spectral clustering model, the pairwise constraints are employed as a regularizer in spectral embedding and label proportion is added as a constraint in spectral rotation. To approximate a variant of the embedding matrix more precisely, we replace a cluster indicator matrix with its scaled version. Instead of fixing an initial similarity matrix, we propose a new similarity matrix that is more suitable for deriving clustering results. Except for the theoretical convergence and computational complexity analyses, we validate the effectiveness of CWLP through several benchmark datasets, together with its ability to discriminate suspected breast cancer patients from healthy controls. The experimental evaluation illustrates the superiority of our proposed approach.</p>","PeriodicalId":12640,"journal":{"name":"Frontiers of Computer Science","volume":"34 1","pages":""},"PeriodicalIF":4.2,"publicationDate":"2023-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138579387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Safeguarding text generation API’s intellectual property through meaning-preserving lexical watermarks 通过意义保护词汇水印保护文本生成应用程序接口的知识产权
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-12-13 DOI: 10.1007/s11704-023-3252-0
Shiyu Zhu, Yun Li, Xiaoye Ouyang, Xiaocheng Hu, Jipeng Qiang

We aim to protect text generation APIs in this work. Previous LW methods compromised text quality and made watermarks easy to detect through error analysis due to not considering polysemy. To fit this, we propose meaning-preserving lexical substitution method that considers the target word’s correct meaning in context x. This enables high-confidence identification while making watermarks more invisible.

我们的目标是在这项工作中保护文本生成 API。以前的 LW 方法由于没有考虑多义词而影响了文本质量,并使水印很容易通过错误分析被检测出来。为了解决这个问题,我们提出了意义保护词汇替换法,这种方法会考虑目标词在上下文 x 中的正确含义。
{"title":"Safeguarding text generation API’s intellectual property through meaning-preserving lexical watermarks","authors":"Shiyu Zhu, Yun Li, Xiaoye Ouyang, Xiaocheng Hu, Jipeng Qiang","doi":"10.1007/s11704-023-3252-0","DOIUrl":"https://doi.org/10.1007/s11704-023-3252-0","url":null,"abstract":"<p>We aim to protect text generation APIs in this work. Previous LW methods compromised text quality and made watermarks easy to detect through error analysis due to not considering polysemy. To fit this, we propose meaning-preserving lexical substitution method that considers the target word’s correct meaning in context <b>x</b>. This enables high-confidence identification while making watermarks more invisible.</p>","PeriodicalId":12640,"journal":{"name":"Frontiers of Computer Science","volume":"7 1","pages":""},"PeriodicalIF":4.2,"publicationDate":"2023-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138581818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Semantic similarity-based program retrieval: a multi-relational graph perspective 基于语义相似性的程序检索:多关系图的视角
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-12-13 DOI: 10.1007/s11704-023-2678-8
Qianwen Gou, Yunwei Dong, YuJiao Wu, Qiao Ke

In this paper, we formulate the program retrieval problem as a graph similarity problem. This is achieved by first explicitly representing queries and program snippets as AMR and CPG, respectively. Then, through intra-level and inter-level attention mechanisms to infer fine-grained correspondence by propagating node correspondence along the graph edge. Moreover, such a design can learn correspondence of nodes at different levels, which were mostly ignored by previous works. Experiments have demonstrated the superiority of USRAE.

在本文中,我们将程序检索问题表述为图相似性问题。为此,我们首先将查询和程序片段分别明确表示为 AMR 和 CPG。然后,通过层内和层间关注机制,沿着图边传播节点对应关系,从而推断出细粒度的对应关系。此外,这种设计还能学习不同层次节点的对应关系,而这一点在以往的研究中大多被忽略。实验证明了 USRAE 的优越性。
{"title":"Semantic similarity-based program retrieval: a multi-relational graph perspective","authors":"Qianwen Gou, Yunwei Dong, YuJiao Wu, Qiao Ke","doi":"10.1007/s11704-023-2678-8","DOIUrl":"https://doi.org/10.1007/s11704-023-2678-8","url":null,"abstract":"<p>In this paper, we formulate the program retrieval problem as a graph similarity problem. This is achieved by first explicitly representing queries and program snippets as AMR and CPG, respectively. Then, through intra-level and inter-level attention mechanisms to infer fine-grained correspondence by propagating node correspondence along the graph edge. Moreover, such a design can learn correspondence of nodes at different levels, which were mostly ignored by previous works. Experiments have demonstrated the superiority of USRAE.</p>","PeriodicalId":12640,"journal":{"name":"Frontiers of Computer Science","volume":"13 1","pages":""},"PeriodicalIF":4.2,"publicationDate":"2023-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138579522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The governance technology for blockchain systems: a survey 区块链系统的治理技术:调查
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-12-06 DOI: 10.1007/s11704-023-3113-x
Guocheng Zhu, Debiao He, Haoyang An, Min Luo, Cong Peng

After the Ethereum DAO attack in 2016, which resulted in significant economic losses, blockchain governance has become a prominent research area. However, there is a lack of comprehensive and systematic literature review on blockchain governance. To deeply understand the process of blockchain governance and provide guidance for the future design of the blockchain governance model, we provide an in-depth review of blockchain governance. In this paper, first we introduce the consensus algorithms currently used in blockchain and relate them to governance theory. Second, we present the main content of off-chain governance and investigate two well-known off-chain governance projects. Third, we investigate four common on-chain governance voting techniques, then summarize the seven attributes that the on-chain governance voting process should meet, and finally analyze four well-known on-chain governance blockchain projects based on the previous research. We hope this survey will provide an in-depth insight into the potential development direction of blockchain governance and device future research agenda.

2016年以太坊DAO攻击事件造成重大经济损失后,区块链治理成为一个突出的研究领域。然而,目前对区块链治理缺乏全面、系统的文献综述。为了深入了解区块链治理的过程,为未来区块链治理模式的设计提供指导,我们对区块链治理进行了深入的回顾。在本文中,我们首先介绍了目前在区块链中使用的共识算法,并将它们与治理理论联系起来。其次,介绍了链下治理的主要内容,并考察了两个知名的链下治理项目。第三,我们研究了四种常见的链上治理投票技术,然后总结了链上治理投票过程应该满足的七个属性,最后在前人研究的基础上分析了四个知名的链上治理区块链项目。我们希望这项调查能够深入了解区块链治理的潜在发展方向和设备未来的研究议程。
{"title":"The governance technology for blockchain systems: a survey","authors":"Guocheng Zhu, Debiao He, Haoyang An, Min Luo, Cong Peng","doi":"10.1007/s11704-023-3113-x","DOIUrl":"https://doi.org/10.1007/s11704-023-3113-x","url":null,"abstract":"<p>After the Ethereum DAO attack in 2016, which resulted in significant economic losses, blockchain governance has become a prominent research area. However, there is a lack of comprehensive and systematic literature review on blockchain governance. To deeply understand the process of blockchain governance and provide guidance for the future design of the blockchain governance model, we provide an in-depth review of blockchain governance. In this paper, first we introduce the consensus algorithms currently used in blockchain and relate them to governance theory. Second, we present the main content of off-chain governance and investigate two well-known off-chain governance projects. Third, we investigate four common on-chain governance voting techniques, then summarize the seven attributes that the on-chain governance voting process should meet, and finally analyze four well-known on-chain governance blockchain projects based on the previous research. We hope this survey will provide an in-depth insight into the potential development direction of blockchain governance and device future research agenda.</p>","PeriodicalId":12640,"journal":{"name":"Frontiers of Computer Science","volume":"14 1","pages":""},"PeriodicalIF":4.2,"publicationDate":"2023-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138534550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MLDA: a multi-level k-degree anonymity scheme on directed social network graphs 有向社交网络图上的多层级k度匿名方案
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-12-04 DOI: 10.1007/s11704-023-2759-8
Yuanjing Hao, Long Li, Liang Chang, Tianlong Gu

With the emergence of network-centric data, social network graph publishing is conducive to data analysts to mine the value of social networks, analyze the social behavior of individuals or groups, implement personalized recommendations, and so on. However, published social network graphs are often subject to re-identification attacks from adversaries, which results in the leakage of users’ privacy. The k-anonymity technology is widely used in the field of graph publishing, which is quite effective to resist re-identification attacks. However, the current researches still exist some issues to be solved: the protection of directed graphs is less concerned than that of undirected graphs; the protection of graph structure is often ignored while achieving the protection of nodes’ identities; the same protection is performed for different users, which doesn’t meet the different privacy requirements of users. Therefore, to address the above issues, a multi-level k-degree anonymity (MLDA) scheme on directed social network graphs is proposed in this paper. First, node sets with different importance are divided by the firefly algorithm and constrained connectedness upper approximation, and they are performed different k-degree anonymity protection to meet the different privacy requirements of users. Second, a new graph anonymity method is proposed, which achieves the addition and removal of edges with the help of fake nodes. In addition, to improve the utility of the anonymized graph, a new edge cost criterion is proposed, which is used to select the most appropriate edge to be removed. Third, to protect the community structure of the original graph as much as possible, fake nodes contained in a same community are merged prior to fake nodes contained in different communities. Experimental results on real datasets show that the newly proposed MLDA scheme is effective to balance the privacy and utility of the anonymized graph.

随着以网络为中心的数据的出现,社交网络图发布有利于数据分析师挖掘社交网络的价值,分析个人或群体的社交行为,实施个性化推荐等。然而,公开的社交网络图经常会受到对手的再识别攻击,导致用户隐私的泄露。k-匿名技术广泛应用于图形发布领域,能够有效地抵御再识别攻击。然而,目前的研究还存在一些有待解决的问题:有向图的保护不如无向图的保护受关注;在实现节点身份保护的同时,图结构的保护往往被忽略;对不同的用户进行相同的保护,不能满足用户不同的隐私需求。因此,为了解决上述问题,本文提出了一种基于有向社交网络图的多级k度匿名(MLDA)方案。首先,采用萤火虫算法和约束连通上近似对不同重要度的节点集进行划分,并对其进行不同的k度匿名保护,以满足用户的不同隐私要求;其次,提出了一种新的图匿名方法,利用假节点实现图边的添加和删除。此外,为了提高匿名图的效用,提出了一种新的边缘代价准则,用于选择最合适的要去除的边缘。第三,为了尽可能地保护原始图的社区结构,将包含在同一社区中的假节点合并在不同社区中的假节点之前。在真实数据集上的实验结果表明,所提出的MLDA方案能够有效地平衡匿名图的隐私性和实用性。
{"title":"MLDA: a multi-level k-degree anonymity scheme on directed social network graphs","authors":"Yuanjing Hao, Long Li, Liang Chang, Tianlong Gu","doi":"10.1007/s11704-023-2759-8","DOIUrl":"https://doi.org/10.1007/s11704-023-2759-8","url":null,"abstract":"<p>With the emergence of network-centric data, social network graph publishing is conducive to data analysts to mine the value of social networks, analyze the social behavior of individuals or groups, implement personalized recommendations, and so on. However, published social network graphs are often subject to re-identification attacks from adversaries, which results in the leakage of users’ privacy. The <i>k</i>-anonymity technology is widely used in the field of graph publishing, which is quite effective to resist re-identification attacks. However, the current researches still exist some issues to be solved: the protection of directed graphs is less concerned than that of undirected graphs; the protection of graph structure is often ignored while achieving the protection of nodes’ identities; the same protection is performed for different users, which doesn’t meet the different privacy requirements of users. Therefore, to address the above issues, a multi-level <i>k</i>-degree anonymity (MLDA) scheme on directed social network graphs is proposed in this paper. First, node sets with different importance are divided by the firefly algorithm and constrained connectedness upper approximation, and they are performed different <i>k</i>-degree anonymity protection to meet the different privacy requirements of users. Second, a new graph anonymity method is proposed, which achieves the addition and removal of edges with the help of fake nodes. In addition, to improve the utility of the anonymized graph, a new edge cost criterion is proposed, which is used to select the most appropriate edge to be removed. Third, to protect the community structure of the original graph as much as possible, fake nodes contained in a same community are merged prior to fake nodes contained in different communities. Experimental results on real datasets show that the newly proposed MLDA scheme is effective to balance the privacy and utility of the anonymized graph.</p>","PeriodicalId":12640,"journal":{"name":"Frontiers of Computer Science","volume":"1 1","pages":""},"PeriodicalIF":4.2,"publicationDate":"2023-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138534539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Frontiers of Computer Science
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1