首页 > 最新文献

Parallel Computing最新文献

英文 中文
FastPTM: Fast weights loading of pre-trained models for parallel inference service provisioning FastPTM:快速加载预训练模型的权重以提供并行推理服务
IF 2 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-10-10 DOI: 10.1016/j.parco.2024.103114
Fenglong Cai , Dong Yuan , Zhe Yang , Yonghui Xu , Wei He , Wei Guo , Lizhen Cui
Pre-trained models (PTMs) have demonstrated great success in a variety of NLP and CV tasks and have become a significant development in the field of deep learning. However, the large memory and high computational requirements associated with PTMs can increase the cost and time of inference, limiting their service provisioning in practical applications. To improve the Quality of Service (QoS) of PTM applications by reducing waiting and response times, we propose the FastPTM framework. This general framework aims to accelerate PTM inference services in a multi-tenant environment by reducing model loading time and switching overhead on GPUs. The framework utilizes a fast weights loading method based on weights and model separation of PTMs to efficiently accelerate parallel inference services in resource-constrained environments. Furthermore, an online scheduling algorithm is designed to reduce the inference service time. The results of the experiments indicate that FastPTM can improve the throughput of inference services by an average of 4x and up to 8.2x, while reducing the number of switches by 4.7x and the number of overtimes by 15.3x.
预训练模型(PTM)在各种 NLP 和 CV 任务中取得了巨大成功,并已成为深度学习领域的一项重要发展。然而,与 PTM 相关的大内存和高计算要求会增加推理的成本和时间,从而限制了其在实际应用中的服务提供。为了通过减少等待和响应时间来提高 PTM 应用的服务质量(QoS),我们提出了 FastPTM 框架。这个通用框架旨在通过减少 GPU 上的模型加载时间和切换开销来加速多租户环境中的 PTM 推断服务。该框架利用基于 PTM 权重和模型分离的快速权重加载方法,在资源受限的环境中有效加速并行推理服务。此外,还设计了一种在线调度算法来缩短推理服务时间。实验结果表明,FastPTM 可以将推理服务的吞吐量平均提高 4 倍,最高可达 8.2 倍,同时将切换次数减少 4.7 倍,超时次数减少 15.3 倍。
{"title":"FastPTM: Fast weights loading of pre-trained models for parallel inference service provisioning","authors":"Fenglong Cai ,&nbsp;Dong Yuan ,&nbsp;Zhe Yang ,&nbsp;Yonghui Xu ,&nbsp;Wei He ,&nbsp;Wei Guo ,&nbsp;Lizhen Cui","doi":"10.1016/j.parco.2024.103114","DOIUrl":"10.1016/j.parco.2024.103114","url":null,"abstract":"<div><div>Pre-trained models (PTMs) have demonstrated great success in a variety of NLP and CV tasks and have become a significant development in the field of deep learning. However, the large memory and high computational requirements associated with PTMs can increase the cost and time of inference, limiting their service provisioning in practical applications. To improve the Quality of Service (QoS) of PTM applications by reducing waiting and response times, we propose the FastPTM framework. This general framework aims to accelerate PTM inference services in a multi-tenant environment by reducing model loading time and switching overhead on GPUs. The framework utilizes a fast weights loading method based on weights and model separation of PTMs to efficiently accelerate parallel inference services in resource-constrained environments. Furthermore, an online scheduling algorithm is designed to reduce the inference service time. The results of the experiments indicate that FastPTM can improve the throughput of inference services by an average of 4x and up to 8.2x, while reducing the number of switches by 4.7x and the number of overtimes by 15.3x.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"122 ","pages":"Article 103114"},"PeriodicalIF":2.0,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142532380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Distributed consensus-based estimation of the leading eigenvalue of a non-negative irreducible matrix 基于分布式共识的非负不可还原矩阵前导特征值估算
IF 2 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-10-05 DOI: 10.1016/j.parco.2024.103113
Rahim Alizadeh , Shahriar Bijani , Fatemeh Shakeri
This paper presents an algorithm to solve the problem of estimating the largest eigenvalue and its corresponding eigenvector for irreducible matrices in a distributed manner. The proposed algorithm utilizes a network of computational nodes that interact with each other, forming a strongly connected digraph where each node handles one row of the matrix, without the need for centralized storage or knowledge of the entire matrix. Each node possesses a solution space, and the intersection of all these solution spaces contains the leading eigenvector of the matrix. Initially, each node selects a random vector from its solution space, and then, while interacting with its neighbors, updates the vector at each step by solving a quadratically constrained linear program (QCLP). The updates are done so that the nodes reach a consensus on the leading eigenvector of the matrix. The numerical outcomes demonstrate the effectiveness of our proposed method.
本文提出了一种算法,用于解决以分布式方式估算不可还原矩阵的最大特征值及其相应特征向量的问题。所提出的算法利用一个计算节点网络,这些节点相互影响,形成一个强连接的数字图,其中每个节点处理矩阵的一行,而无需集中存储或了解整个矩阵。每个节点都有一个解空间,所有这些解空间的交集包含矩阵的前导特征向量。最初,每个节点从自己的解空间中随机选择一个向量,然后在与相邻节点交互的过程中,通过求解二次约束线性规划(QCLP)来更新每一步的向量。更新的目的是使节点就矩阵的前特征向量达成共识。数值结果证明了我们所提方法的有效性。
{"title":"Distributed consensus-based estimation of the leading eigenvalue of a non-negative irreducible matrix","authors":"Rahim Alizadeh ,&nbsp;Shahriar Bijani ,&nbsp;Fatemeh Shakeri","doi":"10.1016/j.parco.2024.103113","DOIUrl":"10.1016/j.parco.2024.103113","url":null,"abstract":"<div><div>This paper presents an algorithm to solve the problem of estimating the largest eigenvalue and its corresponding eigenvector for irreducible matrices in a distributed manner. The proposed algorithm utilizes a network of computational nodes that interact with each other, forming a strongly connected digraph where each node handles one row of the matrix, without the need for centralized storage or knowledge of the entire matrix. Each node possesses a solution space, and the intersection of all these solution spaces contains the leading eigenvector of the matrix. Initially, each node selects a random vector from its solution space, and then, while interacting with its neighbors, updates the vector at each step by solving a quadratically constrained linear program (QCLP). The updates are done so that the nodes reach a consensus on the leading eigenvector of the matrix. The numerical outcomes demonstrate the effectiveness of our proposed method.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"122 ","pages":"Article 103113"},"PeriodicalIF":2.0,"publicationDate":"2024-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142424535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Parallel Pattern Compiler for Automatic Global Optimizations 自动全局优化的并行模式编译器
IF 2 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-09-21 DOI: 10.1016/j.parco.2024.103112
Adrian Schmitz, Semih Burak, Julian Miller, Matthias S. Müller
High-performance computing (HPC) systems enable scientific advances through simulation and data processing. The heterogeneity in HPC hardware and software increases the application complexity and reduces its maintainability and productivity. This work proposes a prototype implementation for a parallel pattern-based source-to-source compiler to address these challenges. The prototype limits the complexity of parallelism and heterogeneous architectures to parallel patterns that are optimized towards a given target architecture. By applying high-level optimizations and a mapping between parallel patterns and execution units during compile time, portability between systems is achieved. The compiler can address architectures with shared memory, distributed memory, and accelerator offloading.
The approach shows speedups for seven of the nine supported Rodinia benchmarks, reaching speedups of up to twelve times. Porting LULESH to the Parallel Pattern Language (PPL) shows a compression of code size by 65% (3.4 thousand lines of code) through a more concise expression and a higher level of abstraction. The tool’s limitations include dynamic algorithms that are challenging to analyze statically and overheads during the compile time optimization. This paper is an extended version of a previous PMAM publication (Schmitz et al., 2024).
高性能计算(HPC)系统通过模拟和数据处理实现了科学进步。高性能计算硬件和软件的异构性增加了应用的复杂性,降低了其可维护性和生产率。这项工作提出了一个基于并行模式的源代码到源代码编译器的原型实现,以应对这些挑战。该原型将并行性和异构架构的复杂性限制在针对特定目标架构进行优化的并行模式上。通过在编译时应用高级优化和并行模式与执行单元之间的映射,实现了系统间的可移植性。编译器可以处理具有共享内存、分布式内存和加速器卸载功能的体系结构。该方法对支持的 9 个 Rodinia 基准中的 7 个进行了提速,提速高达 12 倍。将 LULESH 移植到并行模式语言 (PPL) 后,通过更简洁的表达和更高的抽象层次,代码量压缩了 65%(3.4 千行代码)。该工具的局限性包括动态算法难以进行静态分析,以及编译优化时的开销。本文是 PMAM 先前出版物(Schmitz 等人,2024 年)的扩展版本。
{"title":"Parallel Pattern Compiler for Automatic Global Optimizations","authors":"Adrian Schmitz,&nbsp;Semih Burak,&nbsp;Julian Miller,&nbsp;Matthias S. Müller","doi":"10.1016/j.parco.2024.103112","DOIUrl":"10.1016/j.parco.2024.103112","url":null,"abstract":"<div><div>High-performance computing (HPC) systems enable scientific advances through simulation and data processing. The heterogeneity in HPC hardware and software increases the application complexity and reduces its maintainability and productivity. This work proposes a prototype implementation for a parallel pattern-based source-to-source compiler to address these challenges. The prototype limits the complexity of parallelism and heterogeneous architectures to parallel patterns that are optimized towards a given target architecture. By applying high-level optimizations and a mapping between parallel patterns and execution units during compile time, portability between systems is achieved. The compiler can address architectures with shared memory, distributed memory, and accelerator offloading.</div><div>The approach shows speedups for seven of the nine supported Rodinia benchmarks, reaching speedups of up to twelve times. Porting LULESH to the Parallel Pattern Language (PPL) shows a compression of code size by 65% (3.4 thousand lines of code) through a more concise expression and a higher level of abstraction. The tool’s limitations include dynamic algorithms that are challenging to analyze statically and overheads during the compile time optimization. This paper is an extended version of a previous PMAM publication (Schmitz et al., 2024).</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"122 ","pages":"Article 103112"},"PeriodicalIF":2.0,"publicationDate":"2024-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142323330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Task scheduling in cloud computing based on grey wolf optimization with a new encoding mechanism 基于灰狼优化和新编码机制的云计算任务调度
IF 2 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-09-17 DOI: 10.1016/j.parco.2024.103111
Xingwang Huang , Min Xie , Dong An , Shubin Su , Zongliang Zhang

Task scheduling in the cloud computing still remains challenging in terms of performance. Several evolutionary-derived algorithms have been proposed to solve or alleviate this problem. However, evolutionary algorithms have good exploration ability, but the performance drops significantly in high dimensions. To address this issue, considering the characteristic of task scheduling in cloud computing (i.e. all task-VM mappings are 1-dimensional and have the same search range), we propose a task scheduling algorithm based on grey wolf optimization using a new encoding mechanism (GWOEM) in this work. Through this new encoding mechanism, greedy and evolutionary algorithms are rationally integrated in GWOEM. Besides, based on the new mechanism, the dimension of search space is reduced to 1 and the key parameter (i.e., the population size) is eliminated. We apply the proposed GWOEM to the Google Cloud Jobs dataset (GoCJ) and demonstrate better performance than the prior state of the art in terms of makespan.

云计算中的任务调度在性能方面仍面临挑战。为了解决或缓解这一问题,人们提出了几种进化衍生算法。然而,进化算法具有良好的探索能力,但在高维度下性能明显下降。针对这一问题,考虑到云计算中任务调度的特点(即所有任务-VM 映射都是一维的,且具有相同的搜索范围),我们在本研究中提出了一种基于灰狼优化的任务调度算法,并使用了一种新的编码机制(GWOEM)。通过这种新的编码机制,贪婪算法和进化算法被合理地集成到了 GWOEM 中。此外,在新机制的基础上,搜索空间的维度被降为 1,关键参数(即种群规模)被取消。我们将所提出的 GWOEM 应用于 Google Cloud Jobs 数据集 (GoCJ),结果表明其在时间跨度方面的性能优于现有技术。
{"title":"Task scheduling in cloud computing based on grey wolf optimization with a new encoding mechanism","authors":"Xingwang Huang ,&nbsp;Min Xie ,&nbsp;Dong An ,&nbsp;Shubin Su ,&nbsp;Zongliang Zhang","doi":"10.1016/j.parco.2024.103111","DOIUrl":"10.1016/j.parco.2024.103111","url":null,"abstract":"<div><p>Task scheduling in the cloud computing still remains challenging in terms of performance. Several evolutionary-derived algorithms have been proposed to solve or alleviate this problem. However, evolutionary algorithms have good exploration ability, but the performance drops significantly in high dimensions. To address this issue, considering the characteristic of task scheduling in cloud computing (i.e. all task-VM mappings are 1-dimensional and have the same search range), we propose a task scheduling algorithm based on grey wolf optimization using a new encoding mechanism (GWOEM) in this work. Through this new encoding mechanism, greedy and evolutionary algorithms are rationally integrated in GWOEM. Besides, based on the new mechanism, the dimension of search space is reduced to 1 and the key parameter (i.e., the population size) is eliminated. We apply the proposed GWOEM to the Google Cloud Jobs dataset (GoCJ) and demonstrate better performance than the prior state of the art in terms of makespan.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"122 ","pages":"Article 103111"},"PeriodicalIF":2.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An automated OpenMP mutation testing framework for performance optimization 用于性能优化的自动 OpenMP 突变测试框架
IF 2 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-08-21 DOI: 10.1016/j.parco.2024.103097
Dolores Miao , Ignacio Laguna , Giorgis Georgakoudis , Konstantinos Parasyris , Cindy Rubio-González

Performance optimization continues to be a challenge in modern HPC software. Existing performance optimization techniques, including profiling-based and auto-tuning techniques, fail to indicate program modifications at the source level thus preventing their portability across compilers. This paper describes Muppet, a new approach that identifies program modifications called mutations aimed at improving program performance. Muppet’s mutations help developers reason about performance defects and missed opportunities to improve performance at the source code level. In contrast to compiler techniques that optimize code at intermediate representations (IR), Muppet uses the idea of source-level mutation testing to relax correctness constraints and automatically discover optimization opportunities that otherwise are not feasible using the IR. We demonstrate the Muppet’s concept in the OpenMP programming model. Muppet generates a list of OpenMP mutations that alter the program parallelism in various ways, and is capable of running a variety of optimization algorithms such as delta debugging, Bayesian Optimization and decision tree optimization to find a subset of mutations which, when applied to the original program, cause the most speedup while maintaining program correctness. When Muppet is evaluated against a diverse set of benchmark programs and proxy applications, it is capable of finding sets of mutations that induce speedup in 75.9% of the evaluated programs.

性能优化仍然是现代高性能计算软件面临的一项挑战。现有的性能优化技术(包括基于剖析的技术和自动调整技术)无法指出源代码级的程序修改,因此无法在不同编译器之间移植。本文介绍的 Muppet 是一种新方法,它能识别称为突变的程序修改,旨在提高程序性能。Muppet 的突变可帮助开发人员推理性能缺陷,并在源代码级错过提高性能的机会。与在中间表示(IR)上优化代码的编译器技术不同,Muppet 使用源代码级突变测试的理念来放松正确性约束,并自动发现优化机会,否则使用中间表示是不可行的。我们在 OpenMP 编程模型中演示了 Muppet 的概念。Muppet 会生成一个 OpenMP 突变列表,以各种方式改变程序的并行性,并能够运行各种优化算法,如 delta 调试、贝叶斯优化和决策树优化,以找到一个突变子集,当应用到原始程序时,该子集能在保持程序正确性的同时带来最大的速度提升。当 Muppet 针对一组不同的基准程序和代理应用程序进行评估时,它能够找到可提高 75.9% 被评估程序速度的突变集。
{"title":"An automated OpenMP mutation testing framework for performance optimization","authors":"Dolores Miao ,&nbsp;Ignacio Laguna ,&nbsp;Giorgis Georgakoudis ,&nbsp;Konstantinos Parasyris ,&nbsp;Cindy Rubio-González","doi":"10.1016/j.parco.2024.103097","DOIUrl":"10.1016/j.parco.2024.103097","url":null,"abstract":"<div><p>Performance optimization continues to be a challenge in modern HPC software. Existing performance optimization techniques, including profiling-based and auto-tuning techniques, fail to indicate program modifications at the source level thus preventing their portability across compilers. This paper describes <span>Muppet</span>, a new approach that identifies program modifications called <em>mutations</em> aimed at improving program performance. <span>Muppet</span>’s mutations help developers reason about performance defects and missed opportunities to improve performance at the source code level. In contrast to compiler techniques that optimize code at intermediate representations (IR), <span>Muppet</span> uses the idea of source-level <em>mutation testing</em> to relax correctness constraints and automatically discover optimization opportunities that otherwise are not feasible using the IR. We demonstrate the <span>Muppet</span>’s concept in the OpenMP programming model. <span>Muppet</span> generates a list of OpenMP mutations that alter the program parallelism in various ways, and is capable of running a variety of optimization algorithms such as delta debugging, Bayesian Optimization and decision tree optimization to find a subset of mutations which, when applied to the original program, cause the most speedup while maintaining program correctness. When <span>Muppet</span> is evaluated against a diverse set of benchmark programs and proxy applications, it is capable of finding sets of mutations that induce speedup in 75.9% of the evaluated programs.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"121 ","pages":"Article 103097"},"PeriodicalIF":2.0,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167819124000358/pdfft?md5=139743a6196b36bc64bd1733300112aa&pid=1-s2.0-S0167819124000358-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142040335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Abstractions for C++ code optimizations in parallel high-performance applications 并行高性能应用程序中的 C++ 代码优化抽象
IF 2 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-08-14 DOI: 10.1016/j.parco.2024.103096
Jiří Klepl, Adam Šmelko, Lukáš Rozsypal, Martin Kruliš

Many computational problems consider memory throughput a performance bottleneck, especially in the domain of parallel computing. Software needs to be attuned to hardware features like cache architectures or concurrent memory banks to reach a decent level of performance efficiency. This can be achieved by selecting the right memory layouts for data structures or changing the order of data structure traversal. In this work, we present an abstraction for traversing a set of regular data structures (e.g., multidimensional arrays) that allows the design of traversal-agnostic algorithms. Such algorithms can easily optimize for memory performance and employ semi-automated parallelization or autotuning without altering their internal code. We also add an abstraction for autotuning that allows defining tuning parameters in one place and removes boilerplate code. The proposed solution was implemented as an extension of the Noarr library that simplifies a layout-agnostic design of regular data structures. It is implemented entirely using C++ template meta-programming without any nonstandard dependencies, so it is fully compatible with existing compilers, including CUDA NVCC or Intel DPC++. We evaluate the performance and expressiveness of our approach on the Polybench-C benchmarks.

许多计算问题都将内存吞吐量视为性能瓶颈,尤其是在并行计算领域。软件需要与缓存架构或并发内存库等硬件特性相适应,才能达到适当的性能效率水平。这可以通过为数据结构选择正确的内存布局或改变数据结构遍历顺序来实现。在这项工作中,我们提出了一种用于遍历一组常规数据结构(如多维数组)的抽象概念,允许设计与遍历无关的算法。这种算法可以轻松优化内存性能,并在不改变内部代码的情况下采用半自动并行化或自动调整。我们还为自动调整添加了一个抽象概念,允许在一个地方定义调整参数并删除模板代码。我们提出的解决方案是作为 Noarr 库的扩展实现的,它简化了常规数据结构的布局无关性设计。它完全使用 C++ 模板元编程实现,没有任何非标准依赖性,因此完全兼容现有编译器,包括 CUDA NVCC 或 Intel DPC++。我们在 Polybench-C 基准上评估了我们的方法的性能和表现力。
{"title":"Abstractions for C++ code optimizations in parallel high-performance applications","authors":"Jiří Klepl,&nbsp;Adam Šmelko,&nbsp;Lukáš Rozsypal,&nbsp;Martin Kruliš","doi":"10.1016/j.parco.2024.103096","DOIUrl":"10.1016/j.parco.2024.103096","url":null,"abstract":"<div><p>Many computational problems consider memory throughput a performance bottleneck, especially in the domain of parallel computing. Software needs to be attuned to hardware features like cache architectures or concurrent memory banks to reach a decent level of performance efficiency. This can be achieved by selecting the right memory layouts for data structures or changing the order of data structure traversal. In this work, we present an abstraction for traversing a set of regular data structures (e.g., multidimensional arrays) that allows the design of traversal-agnostic algorithms. Such algorithms can easily optimize for memory performance and employ semi-automated parallelization or autotuning without altering their internal code. We also add an abstraction for autotuning that allows defining tuning parameters in one place and removes boilerplate code. The proposed solution was implemented as an extension of the Noarr library that simplifies a layout-agnostic design of regular data structures. It is implemented entirely using C<span>++</span> template meta-programming without any nonstandard dependencies, so it is fully compatible with existing compilers, including CUDA NVCC or Intel DPC++. We evaluate the performance and expressiveness of our approach on the Polybench-C benchmarks.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"121 ","pages":"Article 103096"},"PeriodicalIF":2.0,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167819124000346/pdfft?md5=9cd8ac7a1eebfc9480655a05bba5ca50&pid=1-s2.0-S0167819124000346-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142012840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Mobilizing underutilized storage nodes via job path: A job-aware file striping approach 通过工作路径调动未充分利用的存储节点:作业感知文件条带化方法
IF 2 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-08-10 DOI: 10.1016/j.parco.2024.103095
Gang Xian , Wenxiang Yang , Yusong Tan , Jinghua Feng , Yuqi Li , Jian Zhang , Jie Yu

Users’ limited understanding of the storage system architecture prevents them from fully utilizing the parallel I/O capability of the storage system, leading to a negative impact on the overall performance of supercomputers. Therefore, exploring effective strategies for utilizing parallel I/O capabilities is imperative. In this regard, we conduct an analysis of the workload on two production supercomputers’ Object Storage Targets (OSTs) and study the potential inefficient I/O patterns for high performance computing jobs. Our research findings indicate that under the traditional stripe settings that most supercomputers use to ensure stability, the real-time load on OSTs is severely unbalanced. This imbalance results in I/O requests that fail to fully utilize the available OSTs. To tackle this issue, we propose a job-aware optimization approach, which includes static and dynamic file striping. Static file striping optimizes all user jobs, whereas dynamic file striping employs clustering of job names and job paths to extract similarities among jobs and predict partially stripe-optimizable jobs for users. Additionally, a stripe recovery mechanism is employed to mitigate the negative impact of stripe misconfigurations. This approach appropriately adjusts the file stripe layout based on the job’s I/O pattern, allowing for better mobilization of underutilized OSTs to enhance parallel I/O capabilities. Through experimental verification, the number of OSTs that jobs can use has been increased, effectively improving the parallel I/O performance of the job without significantly affecting operational stability.

用户对存储系统架构的了解有限,无法充分利用存储系统的并行 I/O 功能,从而对超级计算机的整体性能产生负面影响。因此,探索利用并行 I/O 功能的有效策略势在必行。为此,我们对两台生产型超级计算机的对象存储目标(OST)上的工作负载进行了分析,并研究了高性能计算作业潜在的低效 I/O 模式。我们的研究结果表明,在大多数超级计算机为确保稳定性而使用的传统条带设置下,OST 上的实时负载严重失衡。这种不平衡导致 I/O 请求无法充分利用可用的 OST。为解决这一问题,我们提出了一种作业感知优化方法,其中包括静态和动态文件条带化。静态文件条带化会优化所有用户作业,而动态文件条带化则利用作业名称和作业路径聚类来提取作业之间的相似性,并预测用户的部分可条带化作业。此外,还采用了磁条恢复机制,以减轻磁条配置错误带来的负面影响。这种方法可根据作业的 I/O 模式适当调整文件磁条布局,从而更好地调动未充分利用的 OST,增强并行 I/O 能力。通过实验验证,作业可使用的 OST 数量增加了,有效提高了作业的并行 I/O 性能,同时不会对运行稳定性造成重大影响。
{"title":"Mobilizing underutilized storage nodes via job path: A job-aware file striping approach","authors":"Gang Xian ,&nbsp;Wenxiang Yang ,&nbsp;Yusong Tan ,&nbsp;Jinghua Feng ,&nbsp;Yuqi Li ,&nbsp;Jian Zhang ,&nbsp;Jie Yu","doi":"10.1016/j.parco.2024.103095","DOIUrl":"10.1016/j.parco.2024.103095","url":null,"abstract":"<div><p>Users’ limited understanding of the storage system architecture prevents them from fully utilizing the parallel I/O capability of the storage system, leading to a negative impact on the overall performance of supercomputers. Therefore, exploring effective strategies for utilizing parallel I/O capabilities is imperative. In this regard, we conduct an analysis of the workload on two production supercomputers’ Object Storage Targets (OSTs) and study the potential inefficient I/O patterns for high performance computing jobs. Our research findings indicate that under the traditional stripe settings that most supercomputers use to ensure stability, the real-time load on OSTs is severely unbalanced. This imbalance results in I/O requests that fail to fully utilize the available OSTs. To tackle this issue, we propose a job-aware optimization approach, which includes static and dynamic file striping. Static file striping optimizes all user jobs, whereas dynamic file striping employs clustering of job names and job paths to extract similarities among jobs and predict partially stripe-optimizable jobs for users. Additionally, a stripe recovery mechanism is employed to mitigate the negative impact of stripe misconfigurations. This approach appropriately adjusts the file stripe layout based on the job’s I/O pattern, allowing for better mobilization of underutilized OSTs to enhance parallel I/O capabilities. Through experimental verification, the number of OSTs that jobs can use has been increased, effectively improving the parallel I/O performance of the job without significantly affecting operational stability.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"121 ","pages":"Article 103095"},"PeriodicalIF":2.0,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142041036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
NxtSPR: A deadlock-free shortest path routing dedicated to relaying for Triplet-Based many-core Architecture NxtSPR:基于三核的多核架构的中继专用无死锁最短路径路由算法
IF 2 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-07-24 DOI: 10.1016/j.parco.2024.103094
Chunfeng Li, Karim Soliman, Fei Yin, Jin Wei, Feng Shi

Deadlock-free routing is a significant challenge in Network-on-Chip (NoC) design as it affects the network’s latency, power consumption, and load balance, impacting the performance of multi-processor systems-on-chip. However, achieving deadlock-free routing will routinely result in expensive overhead as previous solutions either sacrifice performance or power efficiency to proactively avoid deadlocks or impose high hardware complexity to resolve deadlocks when they occur reactively. Utilizing the various characteristics of NoC to implement deadlock-free routing can be significantly more cost-effective with less impact on performance. This paper proposes a relay routing algorithm (NxtSPR) with a shortest path property and a deadlock prevention mechanism based on a synchronized Hamiltonian ring. The proposal is based on an in-depth study of the characteristics of a Triplet-Based many-core Architecture (TriBA) NoC. We establish various important topology-related theories and perform a formal verification (proof-based) for them. By utilizing the critical subgraph and apex of TriBA, NxtSPR can pre-calculate downstream nodes forwarding ports for packets by using a concise judgment strategy. This significantly reduces the computational overhead required for data transmission while optimizing the pipeline design of routers to decrease packet transmission latency and power consumption compared to other TriBA routing algorithms. We group the data transmissions according to the levels of maximum Hamiltonian edges a packet will traverse during its transmission life cycle. Independent data transmissions between groups can avoid mutual interference and resource competition, eliminating potential deadlocks. Gem5 simulation results show that, under the synthetic traffic patterns, compared to the representative (Table) and up-to-date (SPR4T) routing algorithms, NxtSPR achieves a 20.19%, 14.76%, and 5.54%, 4.66% reduction in average packet latency and per-packet power consumption, respectively. Moreover, it has an average of 18.50% and 4.34% improvement in throughput, as compared to them. PARSEC benchmark results show that NxtSPR reduces application runtime by up to a maximum of 22.30% and 12.82% compared to Table and SPR4T, and running the same applications with TriBA results in a maximum runtime reduction of 10.77% compared to 2D-Mesh.

无死锁路由是片上网络(NoC)设计中的一个重大挑战,因为它会影响网络的延迟、功耗和负载平衡,从而影响多处理器片上系统的性能。然而,实现无死锁路由通常会产生昂贵的开销,因为以往的解决方案要么牺牲性能或能效来主动避免死锁,要么在死锁发生时强加高硬件复杂性来被动解决死锁。利用 NoC 的各种特性来实现无死锁路由,可以大大提高成本效益,同时减少对性能的影响。本文提出了一种具有最短路径特性的中继路由算法(NxtSPR),以及一种基于同步哈密顿环的死锁预防机制。该提案基于对基于三重多核架构(TriBA)NoC 特性的深入研究。我们建立了各种重要的拓扑相关理论,并对它们进行了形式验证(基于证明)。通过利用 TriBA 的临界子图和顶点,NxtSPR 可以使用简明的判断策略预先计算下游节点的数据包转发端口。与其他 TriBA 路由算法相比,这大大减少了数据传输所需的计算开销,同时优化了路由器的流水线设计,降低了数据包传输延迟和功耗。我们根据数据包在其传输生命周期内将穿越的最大汉密尔顿边的级别对数据传输进行分组。分组间独立的数据传输可以避免相互干扰和资源竞争,消除潜在的死锁。Gem5 仿真结果表明,在合成流量条件下,与代表性路由算法(表)和最新路由算法(SPR4T)相比,NxtSPR 在数据包平均延迟和每数据包功耗方面分别降低了 20.19% 和 14.76%,以及 5.54% 和 4.66%。此外,与它们相比,它的吞吐量平均提高了 18.50% 和 4.34%。PARSEC 基准测试结果表明,与 Table 和 SPR4T 相比,NxtSPR 最多缩短了 22.30% 和 12.82% 的应用运行时间;与 2D-Mesh 相比,使用 TriBA 运行相同的应用最多缩短了 10.77% 的运行时间。
{"title":"NxtSPR: A deadlock-free shortest path routing dedicated to relaying for Triplet-Based many-core Architecture","authors":"Chunfeng Li,&nbsp;Karim Soliman,&nbsp;Fei Yin,&nbsp;Jin Wei,&nbsp;Feng Shi","doi":"10.1016/j.parco.2024.103094","DOIUrl":"10.1016/j.parco.2024.103094","url":null,"abstract":"<div><p>Deadlock-free routing is a significant challenge in Network-on-Chip (NoC) design as it affects the network’s latency, power consumption, and load balance, impacting the performance of multi-processor systems-on-chip. However, achieving deadlock-free routing will routinely result in expensive overhead as previous solutions either sacrifice performance or power efficiency to proactively avoid deadlocks or impose high hardware complexity to resolve deadlocks when they occur reactively. Utilizing the various characteristics of NoC to implement deadlock-free routing can be significantly more cost-effective with less impact on performance. This paper proposes a relay routing algorithm (NxtSPR) with a shortest path property and a deadlock prevention mechanism based on a synchronized Hamiltonian ring. The proposal is based on an in-depth study of the characteristics of a Triplet-Based many-core Architecture (TriBA) NoC. We establish various important topology-related theories and perform a formal verification (proof-based) for them. By utilizing the critical subgraph and apex of TriBA, NxtSPR can pre-calculate downstream nodes forwarding ports for packets by using a concise judgment strategy. This significantly reduces the computational overhead required for data transmission while optimizing the pipeline design of routers to decrease packet transmission latency and power consumption compared to other TriBA routing algorithms. We group the data transmissions according to the levels of maximum Hamiltonian edges a packet will traverse during its transmission life cycle. Independent data transmissions between groups can avoid mutual interference and resource competition, eliminating potential deadlocks. Gem5 simulation results show that, under the synthetic traffic patterns, compared to the representative (Table) and up-to-date (SPR4T) routing algorithms, NxtSPR achieves a 20.19%, 14.76%, and 5.54%, 4.66% reduction in average packet latency and per-packet power consumption, respectively. Moreover, it has an average of 18.50% and 4.34% improvement in throughput, as compared to them. PARSEC benchmark results show that NxtSPR reduces application runtime by up to a maximum of 22.30% and 12.82% compared to Table and SPR4T, and running the same applications with TriBA results in a maximum runtime reduction of 10.77% compared to 2D-Mesh.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"121 ","pages":"Article 103094"},"PeriodicalIF":2.0,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141785971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-GPU 3D k-nearest neighbors computation with application to ICP, point cloud smoothing and normals computation 多 GPU 3D k 最近邻计算在 ICP、点云平滑和法线计算中的应用
IF 2 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-07-02 DOI: 10.1016/j.parco.2024.103093
Alexander Agathos , Philip Azariadis

The k-Nearest Neighbors algorithm is a fundamental algorithm that finds applications in many fields like Machine Learning, Computer Graphics, Computer Vision, and others. The algorithm determines the closest points (d-dimensional) of a reference set R according to a query set of points Q under a specific metric (Euclidean, Mahalanobis, Manhattan, etc.). This work focuses on the utilization of multiple Graphical Processing Units for the acceleration of the k-Nearest Neighbors algorithm with large or very large sets of 3D points. With the proposed approach the space of the reference set is divided into a 3D grid which is used to facilitate the search for the nearest neighbors. The search in the grid is performed in a multiresolution manner starting from a high-resolution grid and ending up in a coarse one, thus accounting for point clouds that may have non-uniform sampling and/or outliers. Three important algorithms in reverse engineering are revisited and new multi-GPU versions are proposed based on the introduced KNN algorithm. More specifically, the new multi-GPU approach is applied to the Iterative Closest Point algorithm, to the point cloud smoothing, and to the point cloud normal vectors computation and orientation problem. A series of tests and experiments have been conducted and discussed in the paper showing the merits of the proposed multi-GPU approach.

k 近邻算法是一种基本算法,在机器学习、计算机图形学、计算机视觉等许多领域都有应用。该算法根据特定度量(欧氏、马哈罗诺比、曼哈顿等)下的查询点集合 Q,确定参考集合 R 的最近点(d 维)。这项工作的重点是利用多个图形处理单元来加速大型或超大型三维点集的 k 近邻算法。利用所提出的方法,参考集的空间被划分为一个三维网格,用于促进近邻搜索。网格中的搜索是以多分辨率方式进行的,从高分辨率网格开始,到粗网格结束,从而考虑到可能存在非均匀采样和/或异常值的点云。我们重新审视了逆向工程中的三种重要算法,并基于引入的 KNN 算法提出了新的多 GPU 版本。更具体地说,新的多 GPU 方法适用于迭代最接近点算法、点云平滑以及点云法向量计算和定向问题。文中进行了一系列测试和实验,并讨论了所提出的多 GPU 方法的优点。
{"title":"Multi-GPU 3D k-nearest neighbors computation with application to ICP, point cloud smoothing and normals computation","authors":"Alexander Agathos ,&nbsp;Philip Azariadis","doi":"10.1016/j.parco.2024.103093","DOIUrl":"https://doi.org/10.1016/j.parco.2024.103093","url":null,"abstract":"<div><p>The k-Nearest Neighbors algorithm is a fundamental algorithm that finds applications in many fields like Machine Learning, Computer Graphics, Computer Vision, and others. The algorithm determines the closest points (d-dimensional) of a reference set R according to a query set of points Q under a specific metric (Euclidean, Mahalanobis, Manhattan, etc.). This work focuses on the utilization of multiple Graphical Processing Units for the acceleration of the k-Nearest Neighbors algorithm with large or very large sets of 3D points. With the proposed approach the space of the reference set is divided into a 3D grid which is used to facilitate the search for the nearest neighbors. The search in the grid is performed in a multiresolution manner starting from a high-resolution grid and ending up in a coarse one, thus accounting for point clouds that may have non-uniform sampling and/or outliers. Three important algorithms in reverse engineering are revisited and new multi-GPU versions are proposed based on the introduced KNN algorithm. More specifically, the new multi-GPU approach is applied to the Iterative Closest Point algorithm, to the point cloud smoothing, and to the point cloud normal vectors computation and orientation problem. A series of tests and experiments have been conducted and discussed in the paper showing the merits of the proposed multi-GPU approach.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"121 ","pages":"Article 103093"},"PeriodicalIF":2.0,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141607940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
WBSP: Addressing stragglers in distributed machine learning with worker-busy synchronous parallel WBSP:利用工作繁忙同步并行技术解决分布式机器学习中的落后问题
IF 2 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-06-29 DOI: 10.1016/j.parco.2024.103092
Duo Yang , Bing Hu , An Liu , A-Long Jin , Kwan L. Yeung , Yang You

Parameter server is widely used in distributed machine learning to accelerate training. However, the increasing heterogeneity of workers’ computing capabilities leads to the issue of stragglers, making parameter synchronization challenging. To address this issue, we propose a solution called Worker-Busy Synchronous Parallel (WBSP). This approach eliminates the waiting time of fast workers during the synchronization process and decouples the gradient upload and model download of fast workers into asymmetric parts. By doing so, it allows fast workers to complete multiple steps of local training and upload more gradients to the server, improving computational resource utilization. Additionally, the global model is only updated when the slowest worker uploads the gradients, ensuring the consistency of global models that are pulled down by all workers and the convergence of the global model. Building upon WBSP, we propose an optimized version to further reduce the communication overhead. It enables parallel execution of communication and computation tasks on workers to shorten the global synchronization interval, thereby improving training speed. We conduct theoretical analyses for the proposed mechanisms. Extensive experiments verify that our mechanism can reduce the required time to achieve the target accuracy by up to 60% compared with the fastest method and increase the proportion of computation time from 55%–72% in existing methods to 91%.

参数服务器广泛应用于分布式机器学习,以加速训练。然而,由于工人计算能力的异质性越来越大,导致了散兵游勇的问题,使参数同步变得非常具有挑战性。为了解决这个问题,我们提出了一种名为 "工作繁忙同步并行"(WBSP)的解决方案。这种方法消除了快速工作者在同步过程中的等待时间,并将快速工作者的梯度上传和模型下载分离为非对称部分。这样,快速工作者就能完成多步本地训练,并向服务器上传更多梯度,从而提高计算资源利用率。此外,只有当速度最慢的工作者上传梯度时,才会更新全局模型,从而确保所有工作者下拉的全局模型的一致性和全局模型的收敛性。在 WBSP 的基础上,我们提出了一个优化版本,以进一步减少通信开销。它可以在 Worker 上并行执行通信和计算任务,缩短全局同步间隔,从而提高训练速度。我们对提出的机制进行了理论分析。大量实验验证,与最快的方法相比,我们的机制可以将达到目标精度所需的时间减少 60%,并将计算时间的比例从现有方法的 55%-72% 提高到 91%。
{"title":"WBSP: Addressing stragglers in distributed machine learning with worker-busy synchronous parallel","authors":"Duo Yang ,&nbsp;Bing Hu ,&nbsp;An Liu ,&nbsp;A-Long Jin ,&nbsp;Kwan L. Yeung ,&nbsp;Yang You","doi":"10.1016/j.parco.2024.103092","DOIUrl":"https://doi.org/10.1016/j.parco.2024.103092","url":null,"abstract":"<div><p>Parameter server is widely used in distributed machine learning to accelerate training. However, the increasing heterogeneity of workers’ computing capabilities leads to the issue of stragglers, making parameter synchronization challenging. To address this issue, we propose a solution called Worker-Busy Synchronous Parallel (WBSP). This approach eliminates the waiting time of fast workers during the synchronization process and decouples the gradient upload and model download of fast workers into asymmetric parts. By doing so, it allows fast workers to complete multiple steps of local training and upload more gradients to the server, improving computational resource utilization. Additionally, the global model is only updated when the slowest worker uploads the gradients, ensuring the consistency of global models that are pulled down by all workers and the convergence of the global model. Building upon WBSP, we propose an optimized version to further reduce the communication overhead. It enables parallel execution of communication and computation tasks on workers to shorten the global synchronization interval, thereby improving training speed. We conduct theoretical analyses for the proposed mechanisms. Extensive experiments verify that our mechanism can reduce the required time to achieve the target accuracy by up to 60% compared with the fastest method and increase the proportion of computation time from 55%–72% in existing methods to 91%.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"121 ","pages":"Article 103092"},"PeriodicalIF":2.0,"publicationDate":"2024-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141607939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Parallel Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1