首页 > 最新文献

Parallel Computing最新文献

英文 中文
Towards resilient and energy efficient scalable Krylov solvers 实现有弹性和高能效的可扩展克雷洛夫求解器
IF 2 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-11-13 DOI: 10.1016/j.parco.2024.103122
Zheng Miao , Jon C. Calhoun , Rong Ge
Exascale computing must simultaneously address both energy efficiency and resilience as power limits impact scalability and faults are more common. Unfortunately, energy efficiency and resilience have been traditionally studied in isolation and optimizing one typically detrimentally impacts the other. To deliver the promised performance within the given power budget, exascale computing mandates a deep understanding of the interplay among energy efficiency, resilience, and scalability. In this work, we propose novel methods to analyze and optimize the costs of common resilience techniques including checkpoint-restart and forward recovery. We focus on sparse linear solvers as they are the fundamental kernels in many scientific applications. In particular, we present generalized analytical and experimental methods to analyze and quantify the time and energy costs of various recovery schemes on computer clusters, and develop and prototype performance optimization and power management strategies to improve energy efficiency. Moreover, we take a deep dive into the forward recovery that recently started to draw attention from researchers, and propose a practical matrix-aware optimization technique to reduce its recovery time. This work shows that while the time and energy costs of various resilience techniques are different, they share the common components and can be quantitatively evaluated with a generalized framework. This analysis framework can be used to guide the design of performance and energy optimization technologies. While each resilience technique has its advantages depending on the fault rate, system size, and power budget, the forward recovery can further benefit from matrix-aware optimizations for large-scale computing.
超大规模计算必须同时解决能效和弹性问题,因为功率限制会影响可扩展性,故障也会更加常见。遗憾的是,能效和弹性传统上都是孤立研究的,优化其中一个通常会对另一个产生不利影响。为了在给定的功率预算内实现承诺的性能,超大规模计算要求深入了解能效、弹性和可扩展性之间的相互作用。在这项工作中,我们提出了分析和优化常见弹性技术成本的新方法,包括检查点重启和前向恢复。我们的重点是稀疏线性求解器,因为它们是许多科学应用中的基本内核。特别是,我们提出了通用的分析和实验方法,用于分析和量化计算机集群上各种恢复方案的时间和能源成本,并开发和原型了性能优化和电源管理策略,以提高能源效率。此外,我们还深入研究了最近开始引起研究人员关注的前向恢复,并提出了一种实用的矩阵感知优化技术,以缩短其恢复时间。这项工作表明,虽然各种恢复技术的时间和能源成本各不相同,但它们都有共同的组成部分,可以通过一个通用框架进行定量评估。这一分析框架可用于指导性能和能源优化技术的设计。虽然每种恢复技术都有其优势,但取决于故障率、系统规模和功率预算,前向恢复可进一步受益于大规模计算的矩阵感知优化。
{"title":"Towards resilient and energy efficient scalable Krylov solvers","authors":"Zheng Miao ,&nbsp;Jon C. Calhoun ,&nbsp;Rong Ge","doi":"10.1016/j.parco.2024.103122","DOIUrl":"10.1016/j.parco.2024.103122","url":null,"abstract":"<div><div>Exascale computing must simultaneously address both energy efficiency and resilience as power limits impact scalability and faults are more common. Unfortunately, energy efficiency and resilience have been traditionally studied in isolation and optimizing one typically detrimentally impacts the other. To deliver the promised performance within the given power budget, exascale computing mandates a deep understanding of the interplay among energy efficiency, resilience, and scalability. In this work, we propose novel methods to analyze and optimize the costs of common resilience techniques including checkpoint-restart and forward recovery. We focus on sparse linear solvers as they are the fundamental kernels in many scientific applications. In particular, we present generalized analytical and experimental methods to analyze and quantify the time and energy costs of various recovery schemes on computer clusters, and develop and prototype performance optimization and power management strategies to improve energy efficiency. Moreover, we take a deep dive into the forward recovery that recently started to draw attention from researchers, and propose a practical matrix-aware optimization technique to reduce its recovery time. This work shows that while the time and energy costs of various resilience techniques are different, they share the common components and can be quantitatively evaluated with a generalized framework. This analysis framework can be used to guide the design of performance and energy optimization technologies. While each resilience technique has its advantages depending on the fault rate, system size, and power budget, the forward recovery can further benefit from matrix-aware optimizations for large-scale computing.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"123 ","pages":"Article 103122"},"PeriodicalIF":2.0,"publicationDate":"2024-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142703732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Seesaw: A 4096-bit vector processor for accelerating Kyber based on RISC-V ISA extensions Seesaw:基于 RISC-V ISA 扩展的用于加速 Kyber 的 4096 位矢量处理器
IF 2 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-11-08 DOI: 10.1016/j.parco.2024.103121
Xiaofeng Zou , Yuanxi Peng , Tuo Li , Lingjun Kong , Lu Zhang
The ML-KEM standard based on Kyber algorithm is one of the post-quantum cryptography (PQC) standards released by the National Institute of Standards and Technology (NIST) to withstand quantum attacks. To increase throughput and reduce the execution time that is limited by the high computational complexity of the Kyber algorithm, an RISC-V-based processor Seesaw is designed to accelerate the Kyber algorithm. The 32 specialized extension instructions are mainly designed to enhance the parallel computing ability of the processor and accelerate all the processes of the Kyber algorithm by thoroughly analyzing its characteristics. Subsequently, by carefully designing hardware such as poly vector registers and algorithm execution units on the RISC-V processor, the support of microarchitecture for extension instructions was achieved. Seesaw supports 4096-bit vector calculations through its poly vector registers and execution unit to meet high-throughput requirements and is implemented on the field-programmable gate array (FPGA). In addition, we modify the compiler simultaneously to adapt to the instruction extension and execution of Seesaw. Experimental results indicate that the processor achieves a speed-up of 432× and 18864× for hash and NTT, respectively, compared with that without extension instructions and a speed-up of 5.6× for the execution of the Kyber algorithm compared with the advanced hardware design.
基于 Kyber 算法的 ML-KEM 标准是美国国家标准与技术研究院(NIST)为抵御量子攻击而发布的后量子加密(PQC)标准之一。为了提高吞吐量并减少受 Kyber 算法高计算复杂性限制的执行时间,设计了一种基于 RISC-V 的处理器 Seesaw 来加速 Kyber 算法。通过深入分析 Kyber 算法的特点,设计了 32 条专门的扩展指令,主要用于增强处理器的并行计算能力,加速 Kyber 算法的所有进程。随后,通过在 RISC-V 处理器上精心设计多向量寄存器和算法执行单元等硬件,实现了微体系结构对扩展指令的支持。Seesaw 通过多向量寄存器和执行单元支持 4096 位向量计算,以满足高吞吐量要求,并在现场可编程门阵列(FPGA)上实现。此外,我们还同时修改了编译器,以适应 Seesaw 的指令扩展和执行。实验结果表明,与没有扩展指令的处理器相比,该处理器在哈希和 NTT 方面的速度分别提高了 432 倍和 18864 倍;与先进的硬件设计相比,Kyber 算法的执行速度提高了 5.6 倍。
{"title":"Seesaw: A 4096-bit vector processor for accelerating Kyber based on RISC-V ISA extensions","authors":"Xiaofeng Zou ,&nbsp;Yuanxi Peng ,&nbsp;Tuo Li ,&nbsp;Lingjun Kong ,&nbsp;Lu Zhang","doi":"10.1016/j.parco.2024.103121","DOIUrl":"10.1016/j.parco.2024.103121","url":null,"abstract":"<div><div>The ML-KEM standard based on Kyber algorithm is one of the post-quantum cryptography (PQC) standards released by the National Institute of Standards and Technology (NIST) to withstand quantum attacks. To increase throughput and reduce the execution time that is limited by the high computational complexity of the Kyber algorithm, an RISC-V-based processor Seesaw is designed to accelerate the Kyber algorithm. The 32 specialized extension instructions are mainly designed to enhance the parallel computing ability of the processor and accelerate all the processes of the Kyber algorithm by thoroughly analyzing its characteristics. Subsequently, by carefully designing hardware such as poly vector registers and algorithm execution units on the RISC-V processor, the support of microarchitecture for extension instructions was achieved. Seesaw supports 4096-bit vector calculations through its poly vector registers and execution unit to meet high-throughput requirements and is implemented on the field-programmable gate array (FPGA). In addition, we modify the compiler simultaneously to adapt to the instruction extension and execution of Seesaw. Experimental results indicate that the processor achieves a speed-up of 432<span><math><mo>×</mo></math></span> and 18864<span><math><mo>×</mo></math></span> for hash and NTT, respectively, compared with that without extension instructions and a speed-up of 5.6<span><math><mo>×</mo></math></span> for the execution of the Kyber algorithm compared with the advanced hardware design.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"123 ","pages":"Article 103121"},"PeriodicalIF":2.0,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142660390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FastPTM: Fast weights loading of pre-trained models for parallel inference service provisioning FastPTM:快速加载预训练模型的权重以提供并行推理服务
IF 2 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-10-10 DOI: 10.1016/j.parco.2024.103114
Fenglong Cai , Dong Yuan , Zhe Yang , Yonghui Xu , Wei He , Wei Guo , Lizhen Cui
Pre-trained models (PTMs) have demonstrated great success in a variety of NLP and CV tasks and have become a significant development in the field of deep learning. However, the large memory and high computational requirements associated with PTMs can increase the cost and time of inference, limiting their service provisioning in practical applications. To improve the Quality of Service (QoS) of PTM applications by reducing waiting and response times, we propose the FastPTM framework. This general framework aims to accelerate PTM inference services in a multi-tenant environment by reducing model loading time and switching overhead on GPUs. The framework utilizes a fast weights loading method based on weights and model separation of PTMs to efficiently accelerate parallel inference services in resource-constrained environments. Furthermore, an online scheduling algorithm is designed to reduce the inference service time. The results of the experiments indicate that FastPTM can improve the throughput of inference services by an average of 4x and up to 8.2x, while reducing the number of switches by 4.7x and the number of overtimes by 15.3x.
预训练模型(PTM)在各种 NLP 和 CV 任务中取得了巨大成功,并已成为深度学习领域的一项重要发展。然而,与 PTM 相关的大内存和高计算要求会增加推理的成本和时间,从而限制了其在实际应用中的服务提供。为了通过减少等待和响应时间来提高 PTM 应用的服务质量(QoS),我们提出了 FastPTM 框架。这个通用框架旨在通过减少 GPU 上的模型加载时间和切换开销来加速多租户环境中的 PTM 推断服务。该框架利用基于 PTM 权重和模型分离的快速权重加载方法,在资源受限的环境中有效加速并行推理服务。此外,还设计了一种在线调度算法来缩短推理服务时间。实验结果表明,FastPTM 可以将推理服务的吞吐量平均提高 4 倍,最高可达 8.2 倍,同时将切换次数减少 4.7 倍,超时次数减少 15.3 倍。
{"title":"FastPTM: Fast weights loading of pre-trained models for parallel inference service provisioning","authors":"Fenglong Cai ,&nbsp;Dong Yuan ,&nbsp;Zhe Yang ,&nbsp;Yonghui Xu ,&nbsp;Wei He ,&nbsp;Wei Guo ,&nbsp;Lizhen Cui","doi":"10.1016/j.parco.2024.103114","DOIUrl":"10.1016/j.parco.2024.103114","url":null,"abstract":"<div><div>Pre-trained models (PTMs) have demonstrated great success in a variety of NLP and CV tasks and have become a significant development in the field of deep learning. However, the large memory and high computational requirements associated with PTMs can increase the cost and time of inference, limiting their service provisioning in practical applications. To improve the Quality of Service (QoS) of PTM applications by reducing waiting and response times, we propose the FastPTM framework. This general framework aims to accelerate PTM inference services in a multi-tenant environment by reducing model loading time and switching overhead on GPUs. The framework utilizes a fast weights loading method based on weights and model separation of PTMs to efficiently accelerate parallel inference services in resource-constrained environments. Furthermore, an online scheduling algorithm is designed to reduce the inference service time. The results of the experiments indicate that FastPTM can improve the throughput of inference services by an average of 4x and up to 8.2x, while reducing the number of switches by 4.7x and the number of overtimes by 15.3x.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"122 ","pages":"Article 103114"},"PeriodicalIF":2.0,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142532380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Distributed consensus-based estimation of the leading eigenvalue of a non-negative irreducible matrix 基于分布式共识的非负不可还原矩阵前导特征值估算
IF 2 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-10-05 DOI: 10.1016/j.parco.2024.103113
Rahim Alizadeh , Shahriar Bijani , Fatemeh Shakeri
This paper presents an algorithm to solve the problem of estimating the largest eigenvalue and its corresponding eigenvector for irreducible matrices in a distributed manner. The proposed algorithm utilizes a network of computational nodes that interact with each other, forming a strongly connected digraph where each node handles one row of the matrix, without the need for centralized storage or knowledge of the entire matrix. Each node possesses a solution space, and the intersection of all these solution spaces contains the leading eigenvector of the matrix. Initially, each node selects a random vector from its solution space, and then, while interacting with its neighbors, updates the vector at each step by solving a quadratically constrained linear program (QCLP). The updates are done so that the nodes reach a consensus on the leading eigenvector of the matrix. The numerical outcomes demonstrate the effectiveness of our proposed method.
本文提出了一种算法,用于解决以分布式方式估算不可还原矩阵的最大特征值及其相应特征向量的问题。所提出的算法利用一个计算节点网络,这些节点相互影响,形成一个强连接的数字图,其中每个节点处理矩阵的一行,而无需集中存储或了解整个矩阵。每个节点都有一个解空间,所有这些解空间的交集包含矩阵的前导特征向量。最初,每个节点从自己的解空间中随机选择一个向量,然后在与相邻节点交互的过程中,通过求解二次约束线性规划(QCLP)来更新每一步的向量。更新的目的是使节点就矩阵的前特征向量达成共识。数值结果证明了我们所提方法的有效性。
{"title":"Distributed consensus-based estimation of the leading eigenvalue of a non-negative irreducible matrix","authors":"Rahim Alizadeh ,&nbsp;Shahriar Bijani ,&nbsp;Fatemeh Shakeri","doi":"10.1016/j.parco.2024.103113","DOIUrl":"10.1016/j.parco.2024.103113","url":null,"abstract":"<div><div>This paper presents an algorithm to solve the problem of estimating the largest eigenvalue and its corresponding eigenvector for irreducible matrices in a distributed manner. The proposed algorithm utilizes a network of computational nodes that interact with each other, forming a strongly connected digraph where each node handles one row of the matrix, without the need for centralized storage or knowledge of the entire matrix. Each node possesses a solution space, and the intersection of all these solution spaces contains the leading eigenvector of the matrix. Initially, each node selects a random vector from its solution space, and then, while interacting with its neighbors, updates the vector at each step by solving a quadratically constrained linear program (QCLP). The updates are done so that the nodes reach a consensus on the leading eigenvector of the matrix. The numerical outcomes demonstrate the effectiveness of our proposed method.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"122 ","pages":"Article 103113"},"PeriodicalIF":2.0,"publicationDate":"2024-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142424535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Parallel Pattern Compiler for Automatic Global Optimizations 自动全局优化的并行模式编译器
IF 2 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-09-21 DOI: 10.1016/j.parco.2024.103112
Adrian Schmitz, Semih Burak, Julian Miller, Matthias S. Müller
High-performance computing (HPC) systems enable scientific advances through simulation and data processing. The heterogeneity in HPC hardware and software increases the application complexity and reduces its maintainability and productivity. This work proposes a prototype implementation for a parallel pattern-based source-to-source compiler to address these challenges. The prototype limits the complexity of parallelism and heterogeneous architectures to parallel patterns that are optimized towards a given target architecture. By applying high-level optimizations and a mapping between parallel patterns and execution units during compile time, portability between systems is achieved. The compiler can address architectures with shared memory, distributed memory, and accelerator offloading.
The approach shows speedups for seven of the nine supported Rodinia benchmarks, reaching speedups of up to twelve times. Porting LULESH to the Parallel Pattern Language (PPL) shows a compression of code size by 65% (3.4 thousand lines of code) through a more concise expression and a higher level of abstraction. The tool’s limitations include dynamic algorithms that are challenging to analyze statically and overheads during the compile time optimization. This paper is an extended version of a previous PMAM publication (Schmitz et al., 2024).
高性能计算(HPC)系统通过模拟和数据处理实现了科学进步。高性能计算硬件和软件的异构性增加了应用的复杂性,降低了其可维护性和生产率。这项工作提出了一个基于并行模式的源代码到源代码编译器的原型实现,以应对这些挑战。该原型将并行性和异构架构的复杂性限制在针对特定目标架构进行优化的并行模式上。通过在编译时应用高级优化和并行模式与执行单元之间的映射,实现了系统间的可移植性。编译器可以处理具有共享内存、分布式内存和加速器卸载功能的体系结构。该方法对支持的 9 个 Rodinia 基准中的 7 个进行了提速,提速高达 12 倍。将 LULESH 移植到并行模式语言 (PPL) 后,通过更简洁的表达和更高的抽象层次,代码量压缩了 65%(3.4 千行代码)。该工具的局限性包括动态算法难以进行静态分析,以及编译优化时的开销。本文是 PMAM 先前出版物(Schmitz 等人,2024 年)的扩展版本。
{"title":"Parallel Pattern Compiler for Automatic Global Optimizations","authors":"Adrian Schmitz,&nbsp;Semih Burak,&nbsp;Julian Miller,&nbsp;Matthias S. Müller","doi":"10.1016/j.parco.2024.103112","DOIUrl":"10.1016/j.parco.2024.103112","url":null,"abstract":"<div><div>High-performance computing (HPC) systems enable scientific advances through simulation and data processing. The heterogeneity in HPC hardware and software increases the application complexity and reduces its maintainability and productivity. This work proposes a prototype implementation for a parallel pattern-based source-to-source compiler to address these challenges. The prototype limits the complexity of parallelism and heterogeneous architectures to parallel patterns that are optimized towards a given target architecture. By applying high-level optimizations and a mapping between parallel patterns and execution units during compile time, portability between systems is achieved. The compiler can address architectures with shared memory, distributed memory, and accelerator offloading.</div><div>The approach shows speedups for seven of the nine supported Rodinia benchmarks, reaching speedups of up to twelve times. Porting LULESH to the Parallel Pattern Language (PPL) shows a compression of code size by 65% (3.4 thousand lines of code) through a more concise expression and a higher level of abstraction. The tool’s limitations include dynamic algorithms that are challenging to analyze statically and overheads during the compile time optimization. This paper is an extended version of a previous PMAM publication (Schmitz et al., 2024).</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"122 ","pages":"Article 103112"},"PeriodicalIF":2.0,"publicationDate":"2024-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142323330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Task scheduling in cloud computing based on grey wolf optimization with a new encoding mechanism 基于灰狼优化和新编码机制的云计算任务调度
IF 2 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-09-17 DOI: 10.1016/j.parco.2024.103111
Xingwang Huang , Min Xie , Dong An , Shubin Su , Zongliang Zhang

Task scheduling in the cloud computing still remains challenging in terms of performance. Several evolutionary-derived algorithms have been proposed to solve or alleviate this problem. However, evolutionary algorithms have good exploration ability, but the performance drops significantly in high dimensions. To address this issue, considering the characteristic of task scheduling in cloud computing (i.e. all task-VM mappings are 1-dimensional and have the same search range), we propose a task scheduling algorithm based on grey wolf optimization using a new encoding mechanism (GWOEM) in this work. Through this new encoding mechanism, greedy and evolutionary algorithms are rationally integrated in GWOEM. Besides, based on the new mechanism, the dimension of search space is reduced to 1 and the key parameter (i.e., the population size) is eliminated. We apply the proposed GWOEM to the Google Cloud Jobs dataset (GoCJ) and demonstrate better performance than the prior state of the art in terms of makespan.

云计算中的任务调度在性能方面仍面临挑战。为了解决或缓解这一问题,人们提出了几种进化衍生算法。然而,进化算法具有良好的探索能力,但在高维度下性能明显下降。针对这一问题,考虑到云计算中任务调度的特点(即所有任务-VM 映射都是一维的,且具有相同的搜索范围),我们在本研究中提出了一种基于灰狼优化的任务调度算法,并使用了一种新的编码机制(GWOEM)。通过这种新的编码机制,贪婪算法和进化算法被合理地集成到了 GWOEM 中。此外,在新机制的基础上,搜索空间的维度被降为 1,关键参数(即种群规模)被取消。我们将所提出的 GWOEM 应用于 Google Cloud Jobs 数据集 (GoCJ),结果表明其在时间跨度方面的性能优于现有技术。
{"title":"Task scheduling in cloud computing based on grey wolf optimization with a new encoding mechanism","authors":"Xingwang Huang ,&nbsp;Min Xie ,&nbsp;Dong An ,&nbsp;Shubin Su ,&nbsp;Zongliang Zhang","doi":"10.1016/j.parco.2024.103111","DOIUrl":"10.1016/j.parco.2024.103111","url":null,"abstract":"<div><p>Task scheduling in the cloud computing still remains challenging in terms of performance. Several evolutionary-derived algorithms have been proposed to solve or alleviate this problem. However, evolutionary algorithms have good exploration ability, but the performance drops significantly in high dimensions. To address this issue, considering the characteristic of task scheduling in cloud computing (i.e. all task-VM mappings are 1-dimensional and have the same search range), we propose a task scheduling algorithm based on grey wolf optimization using a new encoding mechanism (GWOEM) in this work. Through this new encoding mechanism, greedy and evolutionary algorithms are rationally integrated in GWOEM. Besides, based on the new mechanism, the dimension of search space is reduced to 1 and the key parameter (i.e., the population size) is eliminated. We apply the proposed GWOEM to the Google Cloud Jobs dataset (GoCJ) and demonstrate better performance than the prior state of the art in terms of makespan.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"122 ","pages":"Article 103111"},"PeriodicalIF":2.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An automated OpenMP mutation testing framework for performance optimization 用于性能优化的自动 OpenMP 突变测试框架
IF 2 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-08-21 DOI: 10.1016/j.parco.2024.103097
Dolores Miao , Ignacio Laguna , Giorgis Georgakoudis , Konstantinos Parasyris , Cindy Rubio-González

Performance optimization continues to be a challenge in modern HPC software. Existing performance optimization techniques, including profiling-based and auto-tuning techniques, fail to indicate program modifications at the source level thus preventing their portability across compilers. This paper describes Muppet, a new approach that identifies program modifications called mutations aimed at improving program performance. Muppet’s mutations help developers reason about performance defects and missed opportunities to improve performance at the source code level. In contrast to compiler techniques that optimize code at intermediate representations (IR), Muppet uses the idea of source-level mutation testing to relax correctness constraints and automatically discover optimization opportunities that otherwise are not feasible using the IR. We demonstrate the Muppet’s concept in the OpenMP programming model. Muppet generates a list of OpenMP mutations that alter the program parallelism in various ways, and is capable of running a variety of optimization algorithms such as delta debugging, Bayesian Optimization and decision tree optimization to find a subset of mutations which, when applied to the original program, cause the most speedup while maintaining program correctness. When Muppet is evaluated against a diverse set of benchmark programs and proxy applications, it is capable of finding sets of mutations that induce speedup in 75.9% of the evaluated programs.

性能优化仍然是现代高性能计算软件面临的一项挑战。现有的性能优化技术(包括基于剖析的技术和自动调整技术)无法指出源代码级的程序修改,因此无法在不同编译器之间移植。本文介绍的 Muppet 是一种新方法,它能识别称为突变的程序修改,旨在提高程序性能。Muppet 的突变可帮助开发人员推理性能缺陷,并在源代码级错过提高性能的机会。与在中间表示(IR)上优化代码的编译器技术不同,Muppet 使用源代码级突变测试的理念来放松正确性约束,并自动发现优化机会,否则使用中间表示是不可行的。我们在 OpenMP 编程模型中演示了 Muppet 的概念。Muppet 会生成一个 OpenMP 突变列表,以各种方式改变程序的并行性,并能够运行各种优化算法,如 delta 调试、贝叶斯优化和决策树优化,以找到一个突变子集,当应用到原始程序时,该子集能在保持程序正确性的同时带来最大的速度提升。当 Muppet 针对一组不同的基准程序和代理应用程序进行评估时,它能够找到可提高 75.9% 被评估程序速度的突变集。
{"title":"An automated OpenMP mutation testing framework for performance optimization","authors":"Dolores Miao ,&nbsp;Ignacio Laguna ,&nbsp;Giorgis Georgakoudis ,&nbsp;Konstantinos Parasyris ,&nbsp;Cindy Rubio-González","doi":"10.1016/j.parco.2024.103097","DOIUrl":"10.1016/j.parco.2024.103097","url":null,"abstract":"<div><p>Performance optimization continues to be a challenge in modern HPC software. Existing performance optimization techniques, including profiling-based and auto-tuning techniques, fail to indicate program modifications at the source level thus preventing their portability across compilers. This paper describes <span>Muppet</span>, a new approach that identifies program modifications called <em>mutations</em> aimed at improving program performance. <span>Muppet</span>’s mutations help developers reason about performance defects and missed opportunities to improve performance at the source code level. In contrast to compiler techniques that optimize code at intermediate representations (IR), <span>Muppet</span> uses the idea of source-level <em>mutation testing</em> to relax correctness constraints and automatically discover optimization opportunities that otherwise are not feasible using the IR. We demonstrate the <span>Muppet</span>’s concept in the OpenMP programming model. <span>Muppet</span> generates a list of OpenMP mutations that alter the program parallelism in various ways, and is capable of running a variety of optimization algorithms such as delta debugging, Bayesian Optimization and decision tree optimization to find a subset of mutations which, when applied to the original program, cause the most speedup while maintaining program correctness. When <span>Muppet</span> is evaluated against a diverse set of benchmark programs and proxy applications, it is capable of finding sets of mutations that induce speedup in 75.9% of the evaluated programs.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"121 ","pages":"Article 103097"},"PeriodicalIF":2.0,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167819124000358/pdfft?md5=139743a6196b36bc64bd1733300112aa&pid=1-s2.0-S0167819124000358-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142040335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Abstractions for C++ code optimizations in parallel high-performance applications 并行高性能应用程序中的 C++ 代码优化抽象
IF 2 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-08-14 DOI: 10.1016/j.parco.2024.103096
Jiří Klepl, Adam Šmelko, Lukáš Rozsypal, Martin Kruliš

Many computational problems consider memory throughput a performance bottleneck, especially in the domain of parallel computing. Software needs to be attuned to hardware features like cache architectures or concurrent memory banks to reach a decent level of performance efficiency. This can be achieved by selecting the right memory layouts for data structures or changing the order of data structure traversal. In this work, we present an abstraction for traversing a set of regular data structures (e.g., multidimensional arrays) that allows the design of traversal-agnostic algorithms. Such algorithms can easily optimize for memory performance and employ semi-automated parallelization or autotuning without altering their internal code. We also add an abstraction for autotuning that allows defining tuning parameters in one place and removes boilerplate code. The proposed solution was implemented as an extension of the Noarr library that simplifies a layout-agnostic design of regular data structures. It is implemented entirely using C++ template meta-programming without any nonstandard dependencies, so it is fully compatible with existing compilers, including CUDA NVCC or Intel DPC++. We evaluate the performance and expressiveness of our approach on the Polybench-C benchmarks.

许多计算问题都将内存吞吐量视为性能瓶颈,尤其是在并行计算领域。软件需要与缓存架构或并发内存库等硬件特性相适应,才能达到适当的性能效率水平。这可以通过为数据结构选择正确的内存布局或改变数据结构遍历顺序来实现。在这项工作中,我们提出了一种用于遍历一组常规数据结构(如多维数组)的抽象概念,允许设计与遍历无关的算法。这种算法可以轻松优化内存性能,并在不改变内部代码的情况下采用半自动并行化或自动调整。我们还为自动调整添加了一个抽象概念,允许在一个地方定义调整参数并删除模板代码。我们提出的解决方案是作为 Noarr 库的扩展实现的,它简化了常规数据结构的布局无关性设计。它完全使用 C++ 模板元编程实现,没有任何非标准依赖性,因此完全兼容现有编译器,包括 CUDA NVCC 或 Intel DPC++。我们在 Polybench-C 基准上评估了我们的方法的性能和表现力。
{"title":"Abstractions for C++ code optimizations in parallel high-performance applications","authors":"Jiří Klepl,&nbsp;Adam Šmelko,&nbsp;Lukáš Rozsypal,&nbsp;Martin Kruliš","doi":"10.1016/j.parco.2024.103096","DOIUrl":"10.1016/j.parco.2024.103096","url":null,"abstract":"<div><p>Many computational problems consider memory throughput a performance bottleneck, especially in the domain of parallel computing. Software needs to be attuned to hardware features like cache architectures or concurrent memory banks to reach a decent level of performance efficiency. This can be achieved by selecting the right memory layouts for data structures or changing the order of data structure traversal. In this work, we present an abstraction for traversing a set of regular data structures (e.g., multidimensional arrays) that allows the design of traversal-agnostic algorithms. Such algorithms can easily optimize for memory performance and employ semi-automated parallelization or autotuning without altering their internal code. We also add an abstraction for autotuning that allows defining tuning parameters in one place and removes boilerplate code. The proposed solution was implemented as an extension of the Noarr library that simplifies a layout-agnostic design of regular data structures. It is implemented entirely using C<span>++</span> template meta-programming without any nonstandard dependencies, so it is fully compatible with existing compilers, including CUDA NVCC or Intel DPC++. We evaluate the performance and expressiveness of our approach on the Polybench-C benchmarks.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"121 ","pages":"Article 103096"},"PeriodicalIF":2.0,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167819124000346/pdfft?md5=9cd8ac7a1eebfc9480655a05bba5ca50&pid=1-s2.0-S0167819124000346-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142012840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Mobilizing underutilized storage nodes via job path: A job-aware file striping approach 通过工作路径调动未充分利用的存储节点:作业感知文件条带化方法
IF 2 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-08-10 DOI: 10.1016/j.parco.2024.103095
Gang Xian , Wenxiang Yang , Yusong Tan , Jinghua Feng , Yuqi Li , Jian Zhang , Jie Yu

Users’ limited understanding of the storage system architecture prevents them from fully utilizing the parallel I/O capability of the storage system, leading to a negative impact on the overall performance of supercomputers. Therefore, exploring effective strategies for utilizing parallel I/O capabilities is imperative. In this regard, we conduct an analysis of the workload on two production supercomputers’ Object Storage Targets (OSTs) and study the potential inefficient I/O patterns for high performance computing jobs. Our research findings indicate that under the traditional stripe settings that most supercomputers use to ensure stability, the real-time load on OSTs is severely unbalanced. This imbalance results in I/O requests that fail to fully utilize the available OSTs. To tackle this issue, we propose a job-aware optimization approach, which includes static and dynamic file striping. Static file striping optimizes all user jobs, whereas dynamic file striping employs clustering of job names and job paths to extract similarities among jobs and predict partially stripe-optimizable jobs for users. Additionally, a stripe recovery mechanism is employed to mitigate the negative impact of stripe misconfigurations. This approach appropriately adjusts the file stripe layout based on the job’s I/O pattern, allowing for better mobilization of underutilized OSTs to enhance parallel I/O capabilities. Through experimental verification, the number of OSTs that jobs can use has been increased, effectively improving the parallel I/O performance of the job without significantly affecting operational stability.

用户对存储系统架构的了解有限,无法充分利用存储系统的并行 I/O 功能,从而对超级计算机的整体性能产生负面影响。因此,探索利用并行 I/O 功能的有效策略势在必行。为此,我们对两台生产型超级计算机的对象存储目标(OST)上的工作负载进行了分析,并研究了高性能计算作业潜在的低效 I/O 模式。我们的研究结果表明,在大多数超级计算机为确保稳定性而使用的传统条带设置下,OST 上的实时负载严重失衡。这种不平衡导致 I/O 请求无法充分利用可用的 OST。为解决这一问题,我们提出了一种作业感知优化方法,其中包括静态和动态文件条带化。静态文件条带化会优化所有用户作业,而动态文件条带化则利用作业名称和作业路径聚类来提取作业之间的相似性,并预测用户的部分可条带化作业。此外,还采用了磁条恢复机制,以减轻磁条配置错误带来的负面影响。这种方法可根据作业的 I/O 模式适当调整文件磁条布局,从而更好地调动未充分利用的 OST,增强并行 I/O 能力。通过实验验证,作业可使用的 OST 数量增加了,有效提高了作业的并行 I/O 性能,同时不会对运行稳定性造成重大影响。
{"title":"Mobilizing underutilized storage nodes via job path: A job-aware file striping approach","authors":"Gang Xian ,&nbsp;Wenxiang Yang ,&nbsp;Yusong Tan ,&nbsp;Jinghua Feng ,&nbsp;Yuqi Li ,&nbsp;Jian Zhang ,&nbsp;Jie Yu","doi":"10.1016/j.parco.2024.103095","DOIUrl":"10.1016/j.parco.2024.103095","url":null,"abstract":"<div><p>Users’ limited understanding of the storage system architecture prevents them from fully utilizing the parallel I/O capability of the storage system, leading to a negative impact on the overall performance of supercomputers. Therefore, exploring effective strategies for utilizing parallel I/O capabilities is imperative. In this regard, we conduct an analysis of the workload on two production supercomputers’ Object Storage Targets (OSTs) and study the potential inefficient I/O patterns for high performance computing jobs. Our research findings indicate that under the traditional stripe settings that most supercomputers use to ensure stability, the real-time load on OSTs is severely unbalanced. This imbalance results in I/O requests that fail to fully utilize the available OSTs. To tackle this issue, we propose a job-aware optimization approach, which includes static and dynamic file striping. Static file striping optimizes all user jobs, whereas dynamic file striping employs clustering of job names and job paths to extract similarities among jobs and predict partially stripe-optimizable jobs for users. Additionally, a stripe recovery mechanism is employed to mitigate the negative impact of stripe misconfigurations. This approach appropriately adjusts the file stripe layout based on the job’s I/O pattern, allowing for better mobilization of underutilized OSTs to enhance parallel I/O capabilities. Through experimental verification, the number of OSTs that jobs can use has been increased, effectively improving the parallel I/O performance of the job without significantly affecting operational stability.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"121 ","pages":"Article 103095"},"PeriodicalIF":2.0,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142041036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
NxtSPR: A deadlock-free shortest path routing dedicated to relaying for Triplet-Based many-core Architecture NxtSPR:基于三核的多核架构的中继专用无死锁最短路径路由算法
IF 2 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-07-24 DOI: 10.1016/j.parco.2024.103094
Chunfeng Li, Karim Soliman, Fei Yin, Jin Wei, Feng Shi

Deadlock-free routing is a significant challenge in Network-on-Chip (NoC) design as it affects the network’s latency, power consumption, and load balance, impacting the performance of multi-processor systems-on-chip. However, achieving deadlock-free routing will routinely result in expensive overhead as previous solutions either sacrifice performance or power efficiency to proactively avoid deadlocks or impose high hardware complexity to resolve deadlocks when they occur reactively. Utilizing the various characteristics of NoC to implement deadlock-free routing can be significantly more cost-effective with less impact on performance. This paper proposes a relay routing algorithm (NxtSPR) with a shortest path property and a deadlock prevention mechanism based on a synchronized Hamiltonian ring. The proposal is based on an in-depth study of the characteristics of a Triplet-Based many-core Architecture (TriBA) NoC. We establish various important topology-related theories and perform a formal verification (proof-based) for them. By utilizing the critical subgraph and apex of TriBA, NxtSPR can pre-calculate downstream nodes forwarding ports for packets by using a concise judgment strategy. This significantly reduces the computational overhead required for data transmission while optimizing the pipeline design of routers to decrease packet transmission latency and power consumption compared to other TriBA routing algorithms. We group the data transmissions according to the levels of maximum Hamiltonian edges a packet will traverse during its transmission life cycle. Independent data transmissions between groups can avoid mutual interference and resource competition, eliminating potential deadlocks. Gem5 simulation results show that, under the synthetic traffic patterns, compared to the representative (Table) and up-to-date (SPR4T) routing algorithms, NxtSPR achieves a 20.19%, 14.76%, and 5.54%, 4.66% reduction in average packet latency and per-packet power consumption, respectively. Moreover, it has an average of 18.50% and 4.34% improvement in throughput, as compared to them. PARSEC benchmark results show that NxtSPR reduces application runtime by up to a maximum of 22.30% and 12.82% compared to Table and SPR4T, and running the same applications with TriBA results in a maximum runtime reduction of 10.77% compared to 2D-Mesh.

无死锁路由是片上网络(NoC)设计中的一个重大挑战,因为它会影响网络的延迟、功耗和负载平衡,从而影响多处理器片上系统的性能。然而,实现无死锁路由通常会产生昂贵的开销,因为以往的解决方案要么牺牲性能或能效来主动避免死锁,要么在死锁发生时强加高硬件复杂性来被动解决死锁。利用 NoC 的各种特性来实现无死锁路由,可以大大提高成本效益,同时减少对性能的影响。本文提出了一种具有最短路径特性的中继路由算法(NxtSPR),以及一种基于同步哈密顿环的死锁预防机制。该提案基于对基于三重多核架构(TriBA)NoC 特性的深入研究。我们建立了各种重要的拓扑相关理论,并对它们进行了形式验证(基于证明)。通过利用 TriBA 的临界子图和顶点,NxtSPR 可以使用简明的判断策略预先计算下游节点的数据包转发端口。与其他 TriBA 路由算法相比,这大大减少了数据传输所需的计算开销,同时优化了路由器的流水线设计,降低了数据包传输延迟和功耗。我们根据数据包在其传输生命周期内将穿越的最大汉密尔顿边的级别对数据传输进行分组。分组间独立的数据传输可以避免相互干扰和资源竞争,消除潜在的死锁。Gem5 仿真结果表明,在合成流量条件下,与代表性路由算法(表)和最新路由算法(SPR4T)相比,NxtSPR 在数据包平均延迟和每数据包功耗方面分别降低了 20.19% 和 14.76%,以及 5.54% 和 4.66%。此外,与它们相比,它的吞吐量平均提高了 18.50% 和 4.34%。PARSEC 基准测试结果表明,与 Table 和 SPR4T 相比,NxtSPR 最多缩短了 22.30% 和 12.82% 的应用运行时间;与 2D-Mesh 相比,使用 TriBA 运行相同的应用最多缩短了 10.77% 的运行时间。
{"title":"NxtSPR: A deadlock-free shortest path routing dedicated to relaying for Triplet-Based many-core Architecture","authors":"Chunfeng Li,&nbsp;Karim Soliman,&nbsp;Fei Yin,&nbsp;Jin Wei,&nbsp;Feng Shi","doi":"10.1016/j.parco.2024.103094","DOIUrl":"10.1016/j.parco.2024.103094","url":null,"abstract":"<div><p>Deadlock-free routing is a significant challenge in Network-on-Chip (NoC) design as it affects the network’s latency, power consumption, and load balance, impacting the performance of multi-processor systems-on-chip. However, achieving deadlock-free routing will routinely result in expensive overhead as previous solutions either sacrifice performance or power efficiency to proactively avoid deadlocks or impose high hardware complexity to resolve deadlocks when they occur reactively. Utilizing the various characteristics of NoC to implement deadlock-free routing can be significantly more cost-effective with less impact on performance. This paper proposes a relay routing algorithm (NxtSPR) with a shortest path property and a deadlock prevention mechanism based on a synchronized Hamiltonian ring. The proposal is based on an in-depth study of the characteristics of a Triplet-Based many-core Architecture (TriBA) NoC. We establish various important topology-related theories and perform a formal verification (proof-based) for them. By utilizing the critical subgraph and apex of TriBA, NxtSPR can pre-calculate downstream nodes forwarding ports for packets by using a concise judgment strategy. This significantly reduces the computational overhead required for data transmission while optimizing the pipeline design of routers to decrease packet transmission latency and power consumption compared to other TriBA routing algorithms. We group the data transmissions according to the levels of maximum Hamiltonian edges a packet will traverse during its transmission life cycle. Independent data transmissions between groups can avoid mutual interference and resource competition, eliminating potential deadlocks. Gem5 simulation results show that, under the synthetic traffic patterns, compared to the representative (Table) and up-to-date (SPR4T) routing algorithms, NxtSPR achieves a 20.19%, 14.76%, and 5.54%, 4.66% reduction in average packet latency and per-packet power consumption, respectively. Moreover, it has an average of 18.50% and 4.34% improvement in throughput, as compared to them. PARSEC benchmark results show that NxtSPR reduces application runtime by up to a maximum of 22.30% and 12.82% compared to Table and SPR4T, and running the same applications with TriBA results in a maximum runtime reduction of 10.77% compared to 2D-Mesh.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"121 ","pages":"Article 103094"},"PeriodicalIF":2.0,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141785971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Parallel Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1