首页 > 最新文献

ACM Transactions on Parallel Computing最新文献

英文 中文
Performance Implication of Tensor Irregularity and Optimization for Distributed Tensor Decomposition 张量不规则性的性能蕴涵与分布式张量分解的优化
IF 1.6 Q2 Computer Science Pub Date : 2023-02-07 DOI: 10.1145/3580315
Zheng Miao, Jon C. Calhoun, Rong Ge, Jiajia Li
Tensors are used by a wide variety of applications to represent multi-dimensional data; tensor decompositions are a class of methods for latent data analytics, data compression, and so on. Many of these applications generate large tensors with irregular dimension sizes and nonzero distribution. CANDECOMP/PARAFAC decomposition (Cpd) is a popular low-rank tensor decomposition for discovering latent features. The increasing overhead on memory and execution time of Cpd for large tensors requires distributed memory implementations as the only feasible solution. The sparsity and irregularity of tensors hinder the improvement of performance and scalability of distributed memory implementations. While previous works have been proved successful in Cpd for tensors with relatively regular dimension sizes and nonzero distribution, they either deliver unsatisfactory performance and scalability for irregular tensors or require significant time overhead in preprocessing. In this work, we focus on medium-grained tensor distribution to address their limitation for irregular tensors. We first thoroughly investigate through theoretical and experimental analysis. We disclose that the main cause of poor Cpd performance and scalability is the imbalance of multiple types of computations and communications and their tradeoffs; and sparsity and irregularity make it challenging to achieve their balances and tradeoffs. Irregularity of a sparse tensor is categorized based on two aspects: very different dimension sizes and a non-uniform nonzero distribution. Typically, focusing on optimizing one type of load imbalance causes other ones more severe for irregular tensors. To address such challenges, we propose irregularity-aware distributed Cpd that leverages the sparsity and irregularity information to identify the best tradeoff between different imbalances with low time overhead. We materialize the idea with two optimization methods: the prediction-based grid configuration and matrix-oriented distribution policy, where the former forms the global balance among computations and communications, and the latter further adjusts the balances among computations. The experimental results show that our proposed irregularity-aware distributed Cpd is more scalable and outperforms the medium- and fine-grained distributed implementations by up to 4.4 × and 11.4 × on 1,536 processors, respectively. Our optimizations support different sparse tensor formats, such as compressed sparse fiber (CSF), coordinate (COO), and Hierarchical Coordinate (HiCOO), and gain good scalability for all of them.
张量被各种各样的应用程序用来表示多维数据;张量分解是一类用于潜在数据分析、数据压缩等的方法。其中许多应用程序生成具有不规则维数和非零分布的大张量。CANDECOMP/PARAFAC分解(Cpd)是一种流行的用于发现潜在特征的低阶张量分解。对于大张量,不断增加的内存开销和Cpd的执行时间需要分布式内存实现作为唯一可行的解决方案。张量的稀疏性和不规则性阻碍了分布式存储器实现的性能和可扩展性的提高。虽然先前的工作已被证明在具有相对规则维度大小和非零分布的张量的Cpd中是成功的,但它们要么对不规则张量提供了不令人满意的性能和可扩展性,要么在预处理中需要大量的时间开销。在这项工作中,我们专注于中等粒度张量分布,以解决它们对不规则张量的限制。我们首先通过理论和实验分析进行深入研究。我们披露了Cpd性能和可扩展性较差的主要原因是多种类型的计算和通信的不平衡及其权衡;稀疏性和不规则性使得实现它们的平衡和权衡具有挑战性。稀疏张量的不规则性基于两个方面进行分类:非常不同的维度大小和非均匀的非零分布。通常,专注于优化一种类型的负载不平衡会导致其他类型的不规则张量更严重。为了应对这些挑战,我们提出了不规则感知分布式Cpd,该分布式Cpd利用稀疏性和不规则性信息来确定不同不平衡之间的最佳折衷,同时降低时间开销。我们用两种优化方法来实现这一想法:基于预测的网格配置和面向矩阵的分配策略,前者形成计算和通信之间的全局平衡,后者进一步调整计算之间的平衡。实验结果表明,我们提出的不规则感知分布式Cpd更具可扩展性,在1536个处理器上分别比中粒度和细粒度分布式实现高出4.4倍和11.4倍。我们的优化支持不同的稀疏张量格式,如压缩稀疏光纤(CSF)、坐标(COO)和层次坐标(HiCOO),并为所有这些格式获得了良好的可扩展性。
{"title":"Performance Implication of Tensor Irregularity and Optimization for Distributed Tensor Decomposition","authors":"Zheng Miao, Jon C. Calhoun, Rong Ge, Jiajia Li","doi":"10.1145/3580315","DOIUrl":"https://doi.org/10.1145/3580315","url":null,"abstract":"Tensors are used by a wide variety of applications to represent multi-dimensional data; tensor decompositions are a class of methods for latent data analytics, data compression, and so on. Many of these applications generate large tensors with irregular dimension sizes and nonzero distribution. CANDECOMP/PARAFAC decomposition (Cpd) is a popular low-rank tensor decomposition for discovering latent features. The increasing overhead on memory and execution time of Cpd for large tensors requires distributed memory implementations as the only feasible solution. The sparsity and irregularity of tensors hinder the improvement of performance and scalability of distributed memory implementations. While previous works have been proved successful in Cpd for tensors with relatively regular dimension sizes and nonzero distribution, they either deliver unsatisfactory performance and scalability for irregular tensors or require significant time overhead in preprocessing. In this work, we focus on medium-grained tensor distribution to address their limitation for irregular tensors. We first thoroughly investigate through theoretical and experimental analysis. We disclose that the main cause of poor Cpd performance and scalability is the imbalance of multiple types of computations and communications and their tradeoffs; and sparsity and irregularity make it challenging to achieve their balances and tradeoffs. Irregularity of a sparse tensor is categorized based on two aspects: very different dimension sizes and a non-uniform nonzero distribution. Typically, focusing on optimizing one type of load imbalance causes other ones more severe for irregular tensors. To address such challenges, we propose irregularity-aware distributed Cpd that leverages the sparsity and irregularity information to identify the best tradeoff between different imbalances with low time overhead. We materialize the idea with two optimization methods: the prediction-based grid configuration and matrix-oriented distribution policy, where the former forms the global balance among computations and communications, and the latter further adjusts the balances among computations. The experimental results show that our proposed irregularity-aware distributed Cpd is more scalable and outperforms the medium- and fine-grained distributed implementations by up to 4.4 × and 11.4 × on 1,536 processors, respectively. Our optimizations support different sparse tensor formats, such as compressed sparse fiber (CSF), coordinate (COO), and Hierarchical Coordinate (HiCOO), and gain good scalability for all of them.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2023-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43336829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Tridigpu: A GPU Library for Block Tridiagonal and Banded Linear Equation Systems 块三对角线和带状线性方程组的GPU库
IF 1.6 Q2 Computer Science Pub Date : 2023-01-31 DOI: 10.1145/3580373
Christopher J. Klein, R. Strzodka
In this article, we present a CUDA library with a C API for solving block cyclic tridiagonal and banded systems on one GPU. The library can process block tridiagonal systems with block sizes from 1 × 1 (scalar) to 4 × 4 and banded systems with up to four sub- and superdiagonals. For the compute-intensive block size cases and cases with many right-hand sides, we write out an explicit factorization to memory; however, for the scalar case, the fastest approach is to only output the coarse system and recompute the factorization. Prominent features of the library are (scaled) partial pivoting for improved numeric stability; highest-performance kernels, which completely utilize GPU memory bandwidth; and support for multiple sparse or dense right-hand side and solution vectors. The additional memory consumption is only 5% of the original tridiagonal system, which enables the solution of systems up to GPU memory size. The performance of the state-of-the-art scalar tridiagonal solver of cuSPARSE is outperformed by factor 5 for large problem sizes of 225 unknowns, on a GeForce RTX 2080 Ti.
在本文中,我们提出了一个带有C API的CUDA库,用于在一个GPU上求解块循环三对角线和带状系统。该库可以处理块大小从1 × 1(标量)到4 × 4的块三对角线系统,以及具有多达四条次对角线和超对角线的带状系统。对于计算密集型的块大小情况和有许多右手边的情况,我们写了一个显式的内存分解;然而,对于标量情况,最快的方法是只输出粗系统并重新计算分解。该库的突出特点是(缩放)部分枢轴,以提高数值稳定性;性能最高的内核,完全利用GPU内存带宽;并且支持多个稀疏或密集的右侧和解向量。额外的内存消耗仅为原始三对角线系统的5%,这使得系统的解决方案可以达到GPU内存大小。在GeForce RTX 2080 Ti上,最先进的cuSPARSE标量三对角线求解器在225个未知数的大型问题上的性能优于5倍。
{"title":"Tridigpu: A GPU Library for Block Tridiagonal and Banded Linear Equation Systems","authors":"Christopher J. Klein, R. Strzodka","doi":"10.1145/3580373","DOIUrl":"https://doi.org/10.1145/3580373","url":null,"abstract":"In this article, we present a CUDA library with a C API for solving block cyclic tridiagonal and banded systems on one GPU. The library can process block tridiagonal systems with block sizes from 1 × 1 (scalar) to 4 × 4 and banded systems with up to four sub- and superdiagonals. For the compute-intensive block size cases and cases with many right-hand sides, we write out an explicit factorization to memory; however, for the scalar case, the fastest approach is to only output the coarse system and recompute the factorization. Prominent features of the library are (scaled) partial pivoting for improved numeric stability; highest-performance kernels, which completely utilize GPU memory bandwidth; and support for multiple sparse or dense right-hand side and solution vectors. The additional memory consumption is only 5% of the original tridiagonal system, which enables the solution of systems up to GPU memory size. The performance of the state-of-the-art scalar tridiagonal solver of cuSPARSE is outperformed by factor 5 for large problem sizes of 225 unknowns, on a GeForce RTX 2080 Ti.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2023-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45870092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Non-overlapping High-accuracy Parallel Closure for Compact Schemes: Application in Multiphysics and Complex Geometry 紧格式的非重叠高精度并行闭包:在多物理场和复杂几何中的应用
IF 1.6 Q2 Computer Science Pub Date : 2023-01-17 DOI: 10.1145/3580005
P. Sundaram, A. Sengupta, V. K. Suman, T. Sengupta
Compact schemes are often preferred in performing scientific computing for their superior spectral resolution. Error-free parallelization of a compact scheme is a challenging task due to the requirement of additional closures at the inter-processor boundaries. Here, sources of the error due to sub-domain boundary closures for the compact schemes are analyzed with global spectral analysis. A high-accuracy parallel computing strategy devised in “ A high-accuracy preserving parallel algorithm for compact schemes for DNS. ACM Trans. Parallel Comput. 7, 4, 1-32 (2020)” systematically eliminates error due to parallelization and does not require overlapping points at the sub-domain boundaries. This closure is applicable for any compact scheme and is termed here as non-overlapping high-accuracy parallel (NOHAP) sub-domain boundary closure. In the present work, the advantages of the NOHAP closure are shown with the model convection equation and by solving the compressible Navier–Stokes equation for three-dimensional Rayleigh–Taylor instability simulations involving multiphysics dynamics and high Reynolds number flow past a natural laminar flow airfoil using a body-conforming curvilinear coordinate system. Linear scalability of the NOHAP closure is shown for the large-scale simulations using up to 19,200 processors.
紧凑型方案由于其优越的光谱分辨率,在进行科学计算时通常是首选方案。紧凑方案的无错误并行化是一项具有挑战性的任务,因为在处理器间边界需要额外的闭包。在这里,通过全局谱分析来分析由于紧致格式的子域边界闭合引起的误差源。“用于DNS的紧凑方案的高精度保持并行算法。ACM Trans.parallel Comput.7,4,1-32(2020)”中设计的高精度并行计算策略系统地消除了由于并行化而产生的错误,并且不需要子域边界处的重叠点。这种闭包适用于任何紧凑格式,在这里被称为非重叠高精度并行(NOHAP)子域边界闭包。在目前的工作中,NOHAP闭合的优势通过模型对流方程和求解三维瑞利-泰勒不稳定性模拟的可压缩Navier–Stokes方程来显示,该不稳定性模拟涉及多物理动力学和高雷诺数流动通过自然层流翼型,使用符合体的曲线坐标系。对于使用多达19200个处理器的大规模模拟,显示了NOHAP闭包的线性可扩展性。
{"title":"Non-overlapping High-accuracy Parallel Closure for Compact Schemes: Application in Multiphysics and Complex Geometry","authors":"P. Sundaram, A. Sengupta, V. K. Suman, T. Sengupta","doi":"10.1145/3580005","DOIUrl":"https://doi.org/10.1145/3580005","url":null,"abstract":"Compact schemes are often preferred in performing scientific computing for their superior spectral resolution. Error-free parallelization of a compact scheme is a challenging task due to the requirement of additional closures at the inter-processor boundaries. Here, sources of the error due to sub-domain boundary closures for the compact schemes are analyzed with global spectral analysis. A high-accuracy parallel computing strategy devised in “ A high-accuracy preserving parallel algorithm for compact schemes for DNS. ACM Trans. Parallel Comput. 7, 4, 1-32 (2020)” systematically eliminates error due to parallelization and does not require overlapping points at the sub-domain boundaries. This closure is applicable for any compact scheme and is termed here as non-overlapping high-accuracy parallel (NOHAP) sub-domain boundary closure. In the present work, the advantages of the NOHAP closure are shown with the model convection equation and by solving the compressible Navier–Stokes equation for three-dimensional Rayleigh–Taylor instability simulations involving multiphysics dynamics and high Reynolds number flow past a natural laminar flow airfoil using a body-conforming curvilinear coordinate system. Linear scalability of the NOHAP closure is shown for the large-scale simulations using up to 19,200 processors.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2023-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45088678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Fast Parallel Algorithms for Enumeration of Simple, Temporal, and Hop-constrained Cycles 简单、时间和跳跃约束循环枚举的快速并行算法
IF 1.6 Q2 Computer Science Pub Date : 2023-01-03 DOI: 10.1145/3611642
J. Blanuša, K. Atasu, P. Ienne
Cycles are one of the fundamental subgraph patterns and being able to enumerate them in graphs enables important applications in a wide variety of fields, including finance, biology, chemistry, and network science. However, to enable cycle enumeration in real-world applications, efficient parallel algorithms are required. In this work, we propose scalable parallelisation of state-of-the-art sequential algorithms for enumerating simple, temporal, and hop-constrained cycles. First, we focus on the simple cycle enumeration problem and parallelise the algorithms by Johnson and by Read and Tarjan in a fine-grained manner. We theoretically show that our resulting fine-grained parallel algorithms are scalable, with the fine-grained parallel Read-Tarjan algorithm being strongly scalable. In contrast, we show that straightforward coarse-grained parallel versions of these simple cycle enumeration algorithms that exploit edge- or vertex-level parallelism are not scalable. Next, we adapt our fine-grained approach to enable the enumeration of cycles under time-window, temporal, and hop constraints. Our evaluation on a cluster with 256 CPU cores that can execute up to 1,024 simultaneous threads demonstrates a near-linear scalability of our fine-grained parallel algorithms when enumerating cycles under the aforementioned constraints. On the same cluster, our fine-grained parallel algorithms achieve, on average, one order of magnitude speedup compared to the respective coarse-grained parallel versions of the state-of-the-art algorithms for cycle enumeration. The performance gap between the fine-grained and the coarse-grained parallel algorithms increases as we use more CPU cores.
循环是基本的子图模式之一,能够在图中枚举它们可以在各种领域中得到重要的应用,包括金融、生物、化学和网络科学。然而,要在实际应用程序中启用循环枚举,需要高效的并行算法。在这项工作中,我们提出了最先进的顺序算法的可扩展并行化,用于枚举简单的、时间的和跳跃约束的循环。首先,我们将重点放在简单循环枚举问题上,并以细粒度的方式并行化Johnson、Read和Tarjan的算法。我们从理论上证明了我们得到的细粒度并行算法是可扩展的,其中细粒度并行Read-Tarjan算法具有强可扩展性。相反,我们表明,这些利用边缘或顶点级并行性的简单循环枚举算法的直接粗粒度并行版本是不可扩展的。接下来,我们调整我们的细粒度方法,以便在时间窗口、时间和跳跃约束下枚举循环。我们对具有256个CPU内核的集群进行了评估,该集群可以同时执行1,024个线程,在上述约束下枚举周期时,我们的细粒度并行算法具有近似线性的可伸缩性。在同一个集群上,我们的细粒度并行算法与循环枚举的最先进算法的粗粒度并行版本相比,平均实现了一个数量级的加速。细粒度和粗粒度并行算法之间的性能差距随着我们使用更多的CPU内核而增加。
{"title":"Fast Parallel Algorithms for Enumeration of Simple, Temporal, and Hop-constrained Cycles","authors":"J. Blanuša, K. Atasu, P. Ienne","doi":"10.1145/3611642","DOIUrl":"https://doi.org/10.1145/3611642","url":null,"abstract":"Cycles are one of the fundamental subgraph patterns and being able to enumerate them in graphs enables important applications in a wide variety of fields, including finance, biology, chemistry, and network science. However, to enable cycle enumeration in real-world applications, efficient parallel algorithms are required. In this work, we propose scalable parallelisation of state-of-the-art sequential algorithms for enumerating simple, temporal, and hop-constrained cycles. First, we focus on the simple cycle enumeration problem and parallelise the algorithms by Johnson and by Read and Tarjan in a fine-grained manner. We theoretically show that our resulting fine-grained parallel algorithms are scalable, with the fine-grained parallel Read-Tarjan algorithm being strongly scalable. In contrast, we show that straightforward coarse-grained parallel versions of these simple cycle enumeration algorithms that exploit edge- or vertex-level parallelism are not scalable. Next, we adapt our fine-grained approach to enable the enumeration of cycles under time-window, temporal, and hop constraints. Our evaluation on a cluster with 256 CPU cores that can execute up to 1,024 simultaneous threads demonstrates a near-linear scalability of our fine-grained parallel algorithms when enumerating cycles under the aforementioned constraints. On the same cluster, our fine-grained parallel algorithms achieve, on average, one order of magnitude speedup compared to the respective coarse-grained parallel versions of the state-of-the-art algorithms for cycle enumeration. The performance gap between the fine-grained and the coarse-grained parallel algorithms increases as we use more CPU cores.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2023-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45204474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Parallel Minimum Cuts in O(m log2 n) Work and Low Depth O(m log2n)工作和低深度的平行最小切口
IF 1.6 Q2 Computer Science Pub Date : 2022-12-16 DOI: 10.1145/3565557
Daniel Anderson, G. Blelloch
We present a randomized O(m log2 n) work, O(polylog n) depth parallel algorithm for minimum cut. This algorithm matches the work bounds of a recent sequential algorithm by Gawrychowski, Mozes, and Weimann [ICALP’20], and improves on the previously best parallel algorithm by Geissmann and Gianinazzi [SPAA’18], which performs O(m log4 n) work in O(polylog n) depth. Our algorithm makes use of three components that might be of independent interest. Firstly, we design a parallel data structure that efficiently supports batched mixed queries and updates on trees. It generalizes and improves the work bounds of a previous data structure of Geissmann and Gianinazzi and is work efficient with respect to the best sequential algorithm. Secondly, we design a parallel algorithm for approximate minimum cut that improves on previous results by Karger and Motwani. We use this algorithm to give a work-efficient procedure to produce a tree packing, as in Karger’s sequential algorithm for minimum cuts. Lastly, we design an efficient parallel algorithm for solving the minimum 2-respecting cut problem.
我们提出了一种用于最小割的O(m log2n)工作,O(polylogn)深度并行算法。该算法与Gawrychowski、Mozes和Weimann最近的序列算法[ICALP'20]的工作边界相匹配,并改进了Geissmann和Gianinazzi之前的最佳并行算法[SPA'18],该算法在O(polylogn)深度中执行O(m log4n)功。我们的算法使用了三个可能独立感兴趣的组件。首先,我们设计了一个并行数据结构,它有效地支持对树的批量混合查询和更新。它推广和改进了Geissmann和Gianinazzi先前数据结构的工作边界,并且相对于最佳序列算法是有效的。其次,我们设计了一个近似最小割的并行算法,该算法改进了Karger和Motwani先前的结果。我们使用这个算法来给出一个高效的生成树包装的过程,就像Karger的最小切割序列算法一样。最后,我们设计了一个有效的并行算法来解决最小2相关割问题。
{"title":"Parallel Minimum Cuts in O(m log2 n) Work and Low Depth","authors":"Daniel Anderson, G. Blelloch","doi":"10.1145/3565557","DOIUrl":"https://doi.org/10.1145/3565557","url":null,"abstract":"We present a randomized O(m log2 n) work, O(polylog n) depth parallel algorithm for minimum cut. This algorithm matches the work bounds of a recent sequential algorithm by Gawrychowski, Mozes, and Weimann [ICALP’20], and improves on the previously best parallel algorithm by Geissmann and Gianinazzi [SPAA’18], which performs O(m log4 n) work in O(polylog n) depth. Our algorithm makes use of three components that might be of independent interest. Firstly, we design a parallel data structure that efficiently supports batched mixed queries and updates on trees. It generalizes and improves the work bounds of a previous data structure of Geissmann and Gianinazzi and is work efficient with respect to the best sequential algorithm. Secondly, we design a parallel algorithm for approximate minimum cut that improves on previous results by Karger and Motwani. We use this algorithm to give a work-efficient procedure to produce a tree packing, as in Karger’s sequential algorithm for minimum cuts. Lastly, we design an efficient parallel algorithm for solving the minimum 2-respecting cut problem.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2022-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46270508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Optimal Algorithms for Right-sizing Data Centers 合适规模数据中心的最优算法
IF 1.6 Q2 Computer Science Pub Date : 2022-10-11 DOI: 10.1145/3565513
S. Albers, Jens Quedenfeld
Electricity cost is a dominant and rapidly growing expense in data centers. Unfortunately, much of the consumed energy is wasted, because servers are idle for extended periods of time. We study a capacity management problem that dynamically right-sizes a data center, matching the number of active servers with the varying demand for computing capacity. We resort to a data-center optimization problem introduced by Lin, Wierman, Andrew, and Thereska [25, 27] that, over a time horizon, minimizes a combined objective function consisting of operating cost, modeled by a sequence of convex functions, and server switching cost. All prior work addresses a continuous setting in which the number of active servers, at any time, may take a fractional value. In this article, we investigate for the first time the discrete data-center optimization problem where the number of active servers, at any time, must be integer valued. Thereby, we seek truly feasible solutions. First, we show that the offline problem can be solved in polynomial time. Our algorithm relies on a new, yet intuitive graph theoretic model of the optimization problem and performs binary search in a layered graph. Second, we study the online problem and extend the algorithm Lazy Capacity Provisioning (LCP) by Lin et al. [25, 27] to the discrete setting. We prove that LCP is 3-competitive. Moreover, we show that no deterministic online algorithm can achieve a competitive ratio smaller than 3. Hence, while LCP does not attain an optimal competitiveness in the continuous setting, it does so in the discrete problem examined here. We prove that the lower bound of 3 also holds in a problem variant with more restricted operating cost functions, introduced by Lin et al. [25]. In addition, we develop a randomized online algorithm that is 2-competitive against an oblivious adversary. It is based on the algorithm of Bansal et al. [7] (a deterministic, 2-competitive algorithm for the continuous setting) and uses randomized rounding to obtain an integral solution. Moreover, we prove that 2 is a lower bound for the competitive ratio of randomized online algorithms, so our algorithm is optimal. We prove that the lower bound still holds for the more restricted model. Finally, we address the continuous setting and give a lower bound of 2 on the best competitiveness of online algorithms. This matches an upper bound by Bansal et al. [7]. A lower bound of 2 was also shown by Antoniadis and Schewior [4]. We develop an independent proof that extends to the scenario with more restricted operating cost.
电力成本是数据中心中占主导地位且快速增长的费用。不幸的是,大部分消耗的能量被浪费了,因为服务器长时间处于空闲状态。我们研究了一个容量管理问题,该问题动态地调整数据中心的大小,使活动服务器的数量与对计算能力的不同需求相匹配。我们采用Lin、Wierman、Andrew和Thereska[25,27]提出的数据中心优化问题,在一个时间范围内,最小化由一系列凸函数建模的运营成本和服务器切换成本组成的组合目标函数。所有先前的工作都针对连续设置,其中活动服务器的数量在任何时候都可以取小数。在本文中,我们首次研究离散数据中心优化问题,其中活动服务器的数量在任何时候都必须为整数值。因此,我们寻求真正可行的解决方案。首先,我们证明了离线问题可以在多项式时间内解决。我们的算法依赖于一种新的、直观的优化问题图论模型,并在分层图中执行二分搜索。其次,我们研究了在线问题,并将Lin等[25,27]的Lazy Capacity Provisioning (LCP)算法扩展到离散设置。我们证明了LCP是3竞争的。此外,我们还证明了任何确定性在线算法都无法实现小于3的竞争比。因此,虽然LCP在连续环境中不能达到最优竞争,但在这里研究的离散问题中却可以。我们证明了3的下界在Lin et al.[25]引入的具有更有限运行成本函数的问题变体中也成立。此外,我们开发了一种随机在线算法,该算法与一个无意识的对手是2竞争的。它基于Bansal等人的算法[7](连续设置的确定性,2竞争算法),并使用随机舍入来获得积分解。此外,我们还证明了2是随机在线算法竞争比的下界,因此我们的算法是最优的。我们证明了对于更严格的模型下界仍然成立。最后,我们讨论了连续设置,并给出了在线算法最佳竞争的下界为2。这与Bansal等人的上界相匹配。Antoniadis和Schewior[4]也证明了2的下界。我们开发了一个独立的证明,扩展到具有更有限的运营成本的场景。
{"title":"Optimal Algorithms for Right-sizing Data Centers","authors":"S. Albers, Jens Quedenfeld","doi":"10.1145/3565513","DOIUrl":"https://doi.org/10.1145/3565513","url":null,"abstract":"Electricity cost is a dominant and rapidly growing expense in data centers. Unfortunately, much of the consumed energy is wasted, because servers are idle for extended periods of time. We study a capacity management problem that dynamically right-sizes a data center, matching the number of active servers with the varying demand for computing capacity. We resort to a data-center optimization problem introduced by Lin, Wierman, Andrew, and Thereska [25, 27] that, over a time horizon, minimizes a combined objective function consisting of operating cost, modeled by a sequence of convex functions, and server switching cost. All prior work addresses a continuous setting in which the number of active servers, at any time, may take a fractional value. In this article, we investigate for the first time the discrete data-center optimization problem where the number of active servers, at any time, must be integer valued. Thereby, we seek truly feasible solutions. First, we show that the offline problem can be solved in polynomial time. Our algorithm relies on a new, yet intuitive graph theoretic model of the optimization problem and performs binary search in a layered graph. Second, we study the online problem and extend the algorithm Lazy Capacity Provisioning (LCP) by Lin et al. [25, 27] to the discrete setting. We prove that LCP is 3-competitive. Moreover, we show that no deterministic online algorithm can achieve a competitive ratio smaller than 3. Hence, while LCP does not attain an optimal competitiveness in the continuous setting, it does so in the discrete problem examined here. We prove that the lower bound of 3 also holds in a problem variant with more restricted operating cost functions, introduced by Lin et al. [25]. In addition, we develop a randomized online algorithm that is 2-competitive against an oblivious adversary. It is based on the algorithm of Bansal et al. [7] (a deterministic, 2-competitive algorithm for the continuous setting) and uses randomized rounding to obtain an integral solution. Moreover, we prove that 2 is a lower bound for the competitive ratio of randomized online algorithms, so our algorithm is optimal. We prove that the lower bound still holds for the more restricted model. Finally, we address the continuous setting and give a lower bound of 2 on the best competitiveness of online algorithms. This matches an upper bound by Bansal et al. [7]. A lower bound of 2 was also shown by Antoniadis and Schewior [4]. We develop an independent proof that extends to the scenario with more restricted operating cost.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2022-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42464168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Family of Relaxed Concurrent Queues for Low-Latency Operations and Item Transfers 一类用于低延迟操作和项目传输的松弛并发队列
IF 1.6 Q2 Computer Science Pub Date : 2022-10-04 DOI: 10.1145/3565514
Giorgos Kappes, S. Anastasiadis
The producer-consumer communication over shared memory is a critical function of current scalable systems. Queues that provide low latency and high throughput on highly utilized systems can improve the overall performance perceived by the end users. In order to address this demand, we set as priority to achieve both high operation performance and item transfer speed. The Relaxed Concurrent Queues (RCQs) are a family of queues that we have designed and implemented for that purpose. Our key idea is a relaxed ordering model that splits the enqueue and dequeue operations into a stage of sequential assignment to a queue slot and a stage of concurrent execution across the slots. At each slot, we apply no order restrictions among the operations of the same type. We define several variants of the RCQ algorithms with respect to offered concurrency, required hardware instructions, supported operations, occupied memory space, and precondition handling. For specific RCQ algorithms, we provide pseudo-code definitions and reason about their correctness and progress properties. Additionally, we theoretically estimate and experimentally validate the worst-case distance between an RCQ algorithm and a strict first-in-first-out (FIFO) queue. We developed prototype implementations of the RCQ algorithms and experimentally compare them with several representative strict FIFO and relaxed data structures over a range of workload and system settings. The RCQS algorithm is a provably linearizable lock-free member of the RCQ family. We experimentally show that RCQS achieves factors to orders of magnitude advantage over the state-of-the-art strict or relaxed queue algorithms across several latency and throughput statistics of the queue operations and item transfers.
共享内存上的生产者-消费者通信是当前可扩展系统的一个关键功能。在高利用率系统上提供低延迟和高吞吐量的队列可以提高最终用户感受到的整体性能。为了满足这一需求,我们将实现高操作性能和物品传递速度作为优先事项。放松并发队列(rcq)是我们为此目的设计和实现的一系列队列。我们的关键思想是一个宽松的排序模型,该模型将排队和脱队操作分为对队列槽的顺序分配阶段和跨槽的并发执行阶段。在每个槽位,我们对同一类型的操作不施加顺序限制。我们根据提供的并发性、所需的硬件指令、支持的操作、占用的内存空间和前提条件处理定义了RCQ算法的几种变体。对于特定的RCQ算法,我们提供了伪代码定义,并解释了它们的正确性和进度属性。此外,我们从理论上估计和实验验证了RCQ算法和严格的先进先出(FIFO)队列之间的最坏情况距离。我们开发了RCQ算法的原型实现,并在一系列工作负载和系统设置下,将它们与几种具有代表性的严格FIFO和宽松数据结构进行了实验比较。RCQS算法是RCQ族中可线性化的无锁算法。我们通过实验表明,RCQS在队列操作和项目传输的几个延迟和吞吐量统计数据上,比最先进的严格或宽松队列算法实现了数量级的优势。
{"title":"A Family of Relaxed Concurrent Queues for Low-Latency Operations and Item Transfers","authors":"Giorgos Kappes, S. Anastasiadis","doi":"10.1145/3565514","DOIUrl":"https://doi.org/10.1145/3565514","url":null,"abstract":"The producer-consumer communication over shared memory is a critical function of current scalable systems. Queues that provide low latency and high throughput on highly utilized systems can improve the overall performance perceived by the end users. In order to address this demand, we set as priority to achieve both high operation performance and item transfer speed. The Relaxed Concurrent Queues (RCQs) are a family of queues that we have designed and implemented for that purpose. Our key idea is a relaxed ordering model that splits the enqueue and dequeue operations into a stage of sequential assignment to a queue slot and a stage of concurrent execution across the slots. At each slot, we apply no order restrictions among the operations of the same type. We define several variants of the RCQ algorithms with respect to offered concurrency, required hardware instructions, supported operations, occupied memory space, and precondition handling. For specific RCQ algorithms, we provide pseudo-code definitions and reason about their correctness and progress properties. Additionally, we theoretically estimate and experimentally validate the worst-case distance between an RCQ algorithm and a strict first-in-first-out (FIFO) queue. We developed prototype implementations of the RCQ algorithms and experimentally compare them with several representative strict FIFO and relaxed data structures over a range of workload and system settings. The RCQS algorithm is a provably linearizable lock-free member of the RCQ family. We experimentally show that RCQS achieves factors to orders of magnitude advantage over the state-of-the-art strict or relaxed queue algorithms across several latency and throughput statistics of the queue operations and item transfers.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2022-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45988660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Orthogonal Layers of Parallelism in Large-Scale Eigenvalue Computations 大规模特征值计算中的正交并行层
IF 1.6 Q2 Computer Science Pub Date : 2022-09-05 DOI: 10.1145/3614444
A. Alvermann, G. Hager, H. Fehske
We address the communication overhead of distributed sparse matrix-(multiple)-vector multiplication in the context of large-scale eigensolvers, using filter diagonalization as an example. The basis of our study is a performance model, which includes a communication metric that is computed directly from the matrix sparsity pattern without running any code. The performance model quantifies to which extent scalability and parallel efficiency are lost due to communication overhead. To restore scalability, we identify two orthogonal layers of parallelism in the filter diagonalization technique. In the horizontal layer the rows of the sparse matrix are distributed across individual processes. In the vertical layer bundles of multiple vectors are distributed across separate process groups. An analysis in terms of the communication metric predicts that scalability can be restored if, and only if, one implements the two orthogonal layers of parallelism via different distributed vector layouts. Our theoretical analysis is corroborated by benchmarks for application matrices from quantum and solid state physics, road networks, and nonlinear programming. We finally demonstrate the benefits of using orthogonal layers of parallelism with two exemplary application cases—an exciton and a strongly correlated electron system—which incur either small or large communication overhead.
我们以滤波器对角化为例,解决了大规模特征求解中分布式稀疏矩阵-(多)向量乘法的通信开销。我们研究的基础是一个性能模型,其中包括一个通信度量,该度量直接从矩阵稀疏性模式计算而不运行任何代码。性能模型量化了由于通信开销而导致的可伸缩性和并行效率损失的程度。为了恢复可扩展性,我们在滤波器对角化技术中确定了两个正交的并行层。在水平层,稀疏矩阵的行分布在各个过程之间。在垂直层中,多个向量的束分布在不同的过程组中。根据通信度量的分析预测,当且仅当通过不同的分布式矢量布局实现两个正交的并行层时,可以恢复可伸缩性。我们的理论分析得到了来自量子和固态物理、道路网络和非线性规划的应用矩阵基准的证实。最后,我们通过两个示例应用案例(激子和强相关电子系统)演示了使用正交并行层的好处,这两个示例应用会导致或小或大的通信开销。
{"title":"Orthogonal Layers of Parallelism in Large-Scale Eigenvalue Computations","authors":"A. Alvermann, G. Hager, H. Fehske","doi":"10.1145/3614444","DOIUrl":"https://doi.org/10.1145/3614444","url":null,"abstract":"We address the communication overhead of distributed sparse matrix-(multiple)-vector multiplication in the context of large-scale eigensolvers, using filter diagonalization as an example. The basis of our study is a performance model, which includes a communication metric that is computed directly from the matrix sparsity pattern without running any code. The performance model quantifies to which extent scalability and parallel efficiency are lost due to communication overhead. To restore scalability, we identify two orthogonal layers of parallelism in the filter diagonalization technique. In the horizontal layer the rows of the sparse matrix are distributed across individual processes. In the vertical layer bundles of multiple vectors are distributed across separate process groups. An analysis in terms of the communication metric predicts that scalability can be restored if, and only if, one implements the two orthogonal layers of parallelism via different distributed vector layouts. Our theoretical analysis is corroborated by benchmarks for application matrices from quantum and solid state physics, road networks, and nonlinear programming. We finally demonstrate the benefits of using orthogonal layers of parallelism with two exemplary application cases—an exciton and a strongly correlated electron system—which incur either small or large communication overhead.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2022-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41741959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Checkpointing Workflows à la Young/Daly Is Not Good Enough 检查点工作流<s:1>年轻/每日不够好
IF 1.6 Q2 Computer Science Pub Date : 2022-09-02 DOI: 10.1145/3548607
A. Benoit, Lucas Perotin, Y. Robert, Hongyang Sun
This article revisits checkpointing strategies when workflows composed of multiple tasks execute on a parallel platform. The objective is to minimize the expectation of the total execution time. For a single task, the Young/Daly formula provides the optimal checkpointing period. However, when many tasks execute simultaneously, the risk that one of them is severely delayed increases with the number of tasks. To mitigate this risk, a possibility is to checkpoint each task more often than with the Young/Daly strategy. But is it worth slowing each task down with extra checkpoints? Does the extra checkpointing make a difference globally? This article answers these questions. On the theoretical side, we prove several negative results for keeping the Young/Daly period when many tasks execute concurrently, and we design novel checkpointing strategies that guarantee an efficient execution with high probability. On the practical side, we report comprehensive experiments that demonstrate the need to go beyond the Young/Daly period and to checkpoint more often for a wide range of application/platform settings.
本文将回顾由多个任务组成的工作流在并行平台上执行时的检查点策略。目标是最小化总执行时间的期望。对于单个任务,Young/Daly公式提供最佳检查点周期。然而,当许多任务同时执行时,其中一个任务严重延迟的风险随着任务数量的增加而增加。为了降低这种风险,可以比Young/Daly策略更频繁地检查每个任务。但是是否值得用额外的检查点来减慢每个任务的速度呢?额外的检查点是否对全局有影响?本文将回答这些问题。在理论方面,我们证明了多个任务并发执行时保持Young/Daly周期的几个负面结果,并设计了新的检查点策略,以保证高概率的有效执行。在实践方面,我们报告了全面的实验,证明了需要超越Young/Daly时期,并在广泛的应用程序/平台设置中更频繁地进行检查点。
{"title":"Checkpointing Workflows à la Young/Daly Is Not Good Enough","authors":"A. Benoit, Lucas Perotin, Y. Robert, Hongyang Sun","doi":"10.1145/3548607","DOIUrl":"https://doi.org/10.1145/3548607","url":null,"abstract":"This article revisits checkpointing strategies when workflows composed of multiple tasks execute on a parallel platform. The objective is to minimize the expectation of the total execution time. For a single task, the Young/Daly formula provides the optimal checkpointing period. However, when many tasks execute simultaneously, the risk that one of them is severely delayed increases with the number of tasks. To mitigate this risk, a possibility is to checkpoint each task more often than with the Young/Daly strategy. But is it worth slowing each task down with extra checkpoints? Does the extra checkpointing make a difference globally? This article answers these questions. On the theoretical side, we prove several negative results for keeping the Young/Daly period when many tasks execute concurrently, and we design novel checkpointing strategies that guarantee an efficient execution with high probability. On the practical side, we report comprehensive experiments that demonstrate the need to go beyond the Young/Daly period and to checkpoint more often for a wide range of application/platform settings.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2022-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47044084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Improving the Speed and Quality of Parallel Graph Coloring 提高并行图着色的速度和质量
IF 1.6 Q2 Computer Science Pub Date : 2022-07-11 DOI: 10.1145/3543545
Ghadeer Alabandi, Martin Burtscher
Graph coloring assigns a color to each vertex of a graph such that no two adjacent vertices get the same color. It is a key building block in many applications. In practice, solutions that require fewer distinct colors and that can be computed faster are typically preferred. Various coloring heuristics exist that provide different quality versus speed tradeoffs. The highest-quality heuristics tend to be slow. To improve performance, several parallel implementations have been proposed. This paper describes two improvements of the widely used LDF heuristic. First, we present a “shortcutting” approach to increase the parallelism by non-speculatively breaking data dependencies. Second, we present “color reduction” techniques to boost the solution of LDF. On 18 graphs from various domains, the shortcutting approach yields 2.5 times more parallelism in the mean, and the color-reduction techniques improve the result quality by up to 20%. Our deterministic CUDA implementation running on a Titan V is 2.9 times faster in the mean and uses as few or fewer colors as the best GPU codes from the literature.
图形着色为图形的每个顶点指定一种颜色,这样就不会有两个相邻的顶点获得相同的颜色。它是许多应用程序中的关键构建块。在实践中,通常优选需要较少不同颜色并且可以更快地计算的解决方案。存在提供不同质量与速度权衡的各种着色启发法。最高质量的启发式往往是缓慢的。为了提高性能,已经提出了几种并行实现。本文描述了广泛使用的LDF启发式算法的两个改进。首先,我们提出了一种“快捷”方法,通过非推测性地打破数据依赖关系来提高并行性。其次,我们提出了“颜色减少”技术来促进LDF的解决方案。在来自不同领域的18张图上,短切方法的平均并行度提高了2.5倍,颜色减少技术将结果质量提高了20%。我们在Titan V上运行的确定性CUDA实现平均速度快2.9倍,使用的颜色与文献中最好的GPU代码一样少。
{"title":"Improving the Speed and Quality of Parallel Graph Coloring","authors":"Ghadeer Alabandi, Martin Burtscher","doi":"10.1145/3543545","DOIUrl":"https://doi.org/10.1145/3543545","url":null,"abstract":"Graph coloring assigns a color to each vertex of a graph such that no two adjacent vertices get the same color. It is a key building block in many applications. In practice, solutions that require fewer distinct colors and that can be computed faster are typically preferred. Various coloring heuristics exist that provide different quality versus speed tradeoffs. The highest-quality heuristics tend to be slow. To improve performance, several parallel implementations have been proposed. This paper describes two improvements of the widely used LDF heuristic. First, we present a “shortcutting” approach to increase the parallelism by non-speculatively breaking data dependencies. Second, we present “color reduction” techniques to boost the solution of LDF. On 18 graphs from various domains, the shortcutting approach yields 2.5 times more parallelism in the mean, and the color-reduction techniques improve the result quality by up to 20%. Our deterministic CUDA implementation running on a Titan V is 2.9 times faster in the mean and uses as few or fewer colors as the best GPU codes from the literature.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2022-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47285975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
ACM Transactions on Parallel Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1