首页 > 最新文献

IEEE Transactions on Computers最新文献

英文 中文
Raptor-T: A Fused and Memory-Efficient Sparse Transformer for Long and Variable-Length Sequences Raptor-T:用于长序列和变长序列的融合且内存效率高的稀疏变换器
IF 3.7 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-04-16 DOI: 10.1109/TC.2024.3389507
Hulin Wang;Donglin Yang;Yaqi Xia;Zheng Zhang;Qigang Wang;Jianping Fan;Xiaobo Zhou;Dazhao Cheng
Transformer-based models have made significant advancements across various domains, largely due to the self-attention mechanism's ability to capture contextual relationships in input sequences. However, processing long sequences remains computationally expensive for Transformer models, primarily due to the $O(n^{2})$ complexity associated with self-attention. To address this, sparse attention has been proposed to reduce the quadratic dependency to linear. Nevertheless, deploying the sparse transformer efficiently encounters two major obstacles: 1) Existing system optimizations are less effective for the sparse transformer due to the algorithm's approximation properties leading to fragmented attention, and 2) the variability of input sequences results in computation and memory access inefficiencies. We present Raptor-T, a cutting-edge transformer framework designed for handling long and variable-length sequences. Raptor-T harnesses the power of the sparse transformer to reduce resource requirements for processing long sequences while also implementing system-level optimizations to accelerate inference performance. To address the fragmented attention issue, Raptor-T employs fused and memory-efficient Multi-Head Attention. Additionally, we introduce an asynchronous data processing method to mitigate GPU-blocking operations caused by sparse attention. Furthermore, Raptor-T minimizes padding for variable-length inputs, effectively reducing the overhead associated with padding and achieving balanced computation on GPUs. In evaluation, we compare Raptor-T's performance against state-of-the-art frameworks on an NVIDIA A100 GPU. The experimental results demonstrate that Raptor-T outperforms FlashAttention-2 and FasterTransformer, achieving an impressive average end-to-end performance improvement of 3.41X and 3.71X, respectively.
基于变换器的模型在各个领域都取得了重大进展,这主要归功于自我注意机制捕捉输入序列中上下文关系的能力。然而,对于变换器模型来说,处理长序列的计算成本仍然很高,这主要是由于与自我注意相关的 $O(n^{2})$ 复杂性。为了解决这个问题,有人提出了稀疏注意力,将二次依赖关系降低为线性关系。然而,有效部署稀疏变换器遇到了两大障碍:1)由于算法的近似特性导致注意力分散,现有的系统优化对稀疏变换器的效果较差;2)输入序列的可变性导致计算和内存访问效率低下。我们提出了 Raptor-T,一个专为处理长序列和变长序列而设计的尖端变换器框架。Raptor-T 利用稀疏变换器的强大功能,降低了处理长序列的资源需求,同时还实现了系统级优化,加快了推理性能。为了解决注意力分散的问题,Raptor-T 采用了融合且内存效率高的多头注意力(Multi-Head Attention)。此外,我们还引入了一种异步数据处理方法,以减轻稀疏注意力造成的 GPU 阻塞操作。此外,Raptor-T 还最大限度地减少了变长输入的填充,有效降低了与填充相关的开销,实现了 GPU 上的均衡计算。在评估中,我们在英伟达 A100 GPU 上比较了 Raptor-T 与最先进框架的性能。实验结果表明,Raptor-T 的性能优于 FlashAttention-2 和 FasterTransformer,端到端平均性能分别提高了 3.41 倍和 3.71 倍,令人印象深刻。
{"title":"Raptor-T: A Fused and Memory-Efficient Sparse Transformer for Long and Variable-Length Sequences","authors":"Hulin Wang;Donglin Yang;Yaqi Xia;Zheng Zhang;Qigang Wang;Jianping Fan;Xiaobo Zhou;Dazhao Cheng","doi":"10.1109/TC.2024.3389507","DOIUrl":"10.1109/TC.2024.3389507","url":null,"abstract":"Transformer-based models have made significant advancements across various domains, largely due to the self-attention mechanism's ability to capture contextual relationships in input sequences. However, processing long sequences remains computationally expensive for Transformer models, primarily due to the \u0000<inline-formula><tex-math>$O(n^{2})$</tex-math></inline-formula>\u0000 complexity associated with self-attention. To address this, sparse attention has been proposed to reduce the quadratic dependency to linear. Nevertheless, deploying the sparse transformer efficiently encounters two major obstacles: 1) Existing system optimizations are less effective for the sparse transformer due to the algorithm's approximation properties leading to fragmented attention, and 2) the variability of input sequences results in computation and memory access inefficiencies. We present Raptor-T, a cutting-edge transformer framework designed for handling long and variable-length sequences. Raptor-T harnesses the power of the sparse transformer to reduce resource requirements for processing long sequences while also implementing system-level optimizations to accelerate inference performance. To address the fragmented attention issue, Raptor-T employs fused and memory-efficient Multi-Head Attention. Additionally, we introduce an asynchronous data processing method to mitigate GPU-blocking operations caused by sparse attention. Furthermore, Raptor-T minimizes padding for variable-length inputs, effectively reducing the overhead associated with padding and achieving balanced computation on GPUs. In evaluation, we compare Raptor-T's performance against state-of-the-art frameworks on an NVIDIA A100 GPU. The experimental results demonstrate that Raptor-T outperforms FlashAttention-2 and FasterTransformer, achieving an impressive average end-to-end performance improvement of 3.41X and 3.71X, respectively.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 7","pages":"1852-1865"},"PeriodicalIF":3.7,"publicationDate":"2024-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140617273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Ara2: Exploring Single- and Multi-Core Vector Processing With an Efficient RVV 1.0 Compliant Open-Source Processor Ara2:利用符合 RVV 1.0 标准的高效开源处理器探索单核和多核矢量处理技术
IF 3.7 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-04-15 DOI: 10.1109/TC.2024.3388896
Matteo Perotti;Matheus Cavalcante;Renzo Andri;Lukas Cavigelli;Luca Benini
Vector processing is highly effective in boosting processor performance and efficiency for data-parallel workloads. In this paper, we present Ara2, the first fully open-source vector processor to support the RISC-V V 1.0 frozen ISA. We evaluate Ara2's performance on a diverse set of data-parallel kernels for various problem sizes and vector-unit configurations, achieving an average functional-unit utilization of 95% on the most computationally intensive kernels. We pinpoint performance boosters and bottlenecks, including the scalar core, memories, and vector architecture, providing insights into the main vector architecture's performance drivers. Leveraging the openness of the design, we implement Ara2 in a 22nm technology, characterize its PPA metrics on various configurations (2-16 lanes), and analyze its microarchitecture and implementation bottlenecks. Ara2 achieves a state-of-the-art energy efficiency of 37.8 DP-GFLOPS/W (0.8V) and 1.35GHz of clock frequency (critical path: $sim$40 FO4 gates). Finally, we explore the performance and energy-efficiency trade-offs of multi-core vector processors: we find that multiple vector cores help overcome the scalar core issue-rate bound that limits short-vector performance. For example, a cluster of eight 2-lane Ara2 (16 FPUs) achieves more than 3x better performance than a 16-lane single-core Ara2 (16 FPUs) when executing a 32x32x32 matrix multiplication, with 1.5x improved energy efficiency.
矢量处理在提高处理器性能和数据并行工作负载的效率方面非常有效。在本文中,我们介绍了 Ara2,这是第一款支持 RISC-V V 1.0 frozen ISA 的完全开源矢量处理器。我们评估了 Ara2 在一组数据并行内核上的性能,针对不同的问题规模和矢量单元配置,在计算最密集的内核上实现了 95% 的平均功能单元利用率。我们精确定位了性能提升和瓶颈,包括标量内核、存储器和矢量架构,从而深入了解了主要矢量架构的性能驱动因素。利用设计的开放性,我们在 22 纳米技术中实现了 Ara2,鉴定了其在各种配置(2-16 通道)上的 PPA 指标,并分析了其微架构和实现瓶颈。Ara2 实现了 37.8 DP-GFLOPS/W (0.8V) 的一流能效和 1.35GHz 的时钟频率(关键路径:$sim$40 FO4 门)。最后,我们探讨了多核矢量处理器的性能和能效权衡:我们发现,多矢量内核有助于克服限制短矢量性能的标量内核问题速率约束。例如,在执行 32x32x32 矩阵乘法运算时,8 个 2 通道 Ara2(16 个 FPU)集群的性能比 16 通道单核 Ara2(16 个 FPU)高 3 倍以上,能效提高 1.5 倍。
{"title":"Ara2: Exploring Single- and Multi-Core Vector Processing With an Efficient RVV 1.0 Compliant Open-Source Processor","authors":"Matteo Perotti;Matheus Cavalcante;Renzo Andri;Lukas Cavigelli;Luca Benini","doi":"10.1109/TC.2024.3388896","DOIUrl":"10.1109/TC.2024.3388896","url":null,"abstract":"Vector processing is highly effective in boosting processor performance and efficiency for data-parallel workloads. In this paper, we present Ara2, the first fully open-source vector processor to support the RISC-V V 1.0 frozen ISA. We evaluate Ara2's performance on a diverse set of data-parallel kernels for various problem sizes and vector-unit configurations, achieving an average functional-unit utilization of 95% on the most computationally intensive kernels. We pinpoint performance boosters and bottlenecks, including the scalar core, memories, and vector architecture, providing insights into the main vector architecture's performance drivers. Leveraging the openness of the design, we implement Ara2 in a 22nm technology, characterize its PPA metrics on various configurations (2-16 lanes), and analyze its microarchitecture and implementation bottlenecks. Ara2 achieves a state-of-the-art energy efficiency of 37.8 DP-GFLOPS/W (0.8V) and 1.35GHz of clock frequency (critical path: \u0000<inline-formula><tex-math>$sim$</tex-math></inline-formula>\u000040 FO4 gates). Finally, we explore the performance and energy-efficiency trade-offs of multi-core vector processors: we find that multiple vector cores help overcome the scalar core issue-rate bound that limits short-vector performance. For example, a cluster of eight 2-lane Ara2 (16 FPUs) achieves more than 3x better performance than a 16-lane single-core Ara2 (16 FPUs) when executing a 32x32x32 matrix multiplication, with 1.5x improved energy efficiency.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 7","pages":"1822-1836"},"PeriodicalIF":3.7,"publicationDate":"2024-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140575962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Construction of Reed-Solomon Erasure Codes With Four Parities Based on Systematic Vandermonde Matrices 基于系统范德蒙德矩阵构建四奇偶校验的里德-所罗门纠错码
IF 3.7 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-04-10 DOI: 10.1109/TC.2024.3387069
Leilei Yu;Yunghsiang S. Han
In 2021, Tang et al. proposed an improved construction of Reed-Solomon (RS) erasure codes with four parity symbols to accelerate the computation of Reed-Muller (RM) transform-based RS algorithm. The idea is to change the original Vandermonde parity-check matrix into a systematic Vandermonde parity-check matrix. However, the construction relies on a computer search and requires that the size of the information vector of RS codes does not exceed $52$. This paper improves its idea and proposes a purely algebraic construction. The proposed method has a more explicit construction, a wider range of codeword lengths, and competitive encoding/erasure decoding computational complexity.
2021 年,Tang 等人提出了一种具有四个奇偶校验符号的里德-所罗门(RS)擦除码的改进结构,以加速基于里德-穆勒(RM)变换的 RS 算法的计算。其思路是将原始范德蒙德奇偶校验矩阵变为系统范德蒙德奇偶校验矩阵。然而,这种构造依赖于计算机搜索,并要求 RS 码的信息向量大小不超过 52 美元。本文改进了其想法,提出了一种纯代数的构造。所提出的方法具有更明确的构造、更宽的码字长度范围和有竞争力的编码/测度解码计算复杂度。
{"title":"Construction of Reed-Solomon Erasure Codes With Four Parities Based on Systematic Vandermonde Matrices","authors":"Leilei Yu;Yunghsiang S. Han","doi":"10.1109/TC.2024.3387069","DOIUrl":"10.1109/TC.2024.3387069","url":null,"abstract":"In 2021, Tang et al. proposed an improved construction of Reed-Solomon (RS) erasure codes with four parity symbols to accelerate the computation of Reed-Muller (RM) transform-based RS algorithm. The idea is to change the original Vandermonde parity-check matrix into a systematic Vandermonde parity-check matrix. However, the construction relies on a computer search and requires that the size of the information vector of RS codes does not exceed \u0000<inline-formula><tex-math>$52$</tex-math></inline-formula>\u0000. This paper improves its idea and proposes a purely algebraic construction. The proposed method has a more explicit construction, a wider range of codeword lengths, and competitive encoding/erasure decoding computational complexity.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 7","pages":"1875-1882"},"PeriodicalIF":3.7,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140575807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-Objective Hardware-Mapping Co-Optimisation for Multi-DNN Workloads on Chiplet-Based Accelerators 基于 Chiplet 的加速器上多 DNN 工作负载的多目标硬件映射协同优化
IF 3.6 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-04-10 DOI: 10.1109/TC.2024.3386067
Abhijit Das;Enrico Russo;Maurizio Palesi
The need to efficiently execute different Deep Neural Networks (DNNs) on the same computing platform, coupled with the requirement for easy scalability, makes Multi-Chip Module (MCM)-based accelerators a preferred design choice. Such an accelerator brings together heterogeneous sub-accelerators in the form of chiplets, interconnected by a Network-on-Package (NoP). This paper addresses the challenge of selecting the most suitable sub-accelerators, configuring them, determining their optimal placement in the NoP, and mapping the layers of a predetermined set of DNNs spatially and temporally. The objective is to minimise execution time and energy consumption during parallel execution while also minimising the overall cost, specifically the silicon area, of the accelerator. This paper presents MOHaM, a framework for multi-objective hardware-mapping co-optimisation for multi-DNN workloads on chiplet-based accelerators. MOHaM exploits a multi-objective evolutionary algorithm that has been specialised for the given problem by incorporating several customised genetic operators. MOHaM is evaluated against state-of-the-art Design Space Exploration (DSE) frameworks on different multi-DNN workload scenarios. The solutions discovered by MOHaM are Pareto optimal compared to those by the state-of-the-art. Specifically, MOHaM-generated accelerator designs can reduce latency by up to $96%$ and energy by up to $96.12%$.
由于需要在同一计算平台上高效执行不同的深度神经网络 (DNN),同时还要求易于扩展,因此基于多芯片模块 (MCM) 的加速器成为首选设计方案。这种加速器以芯片的形式汇集了异构子加速器,并通过包上网络(NoP)相互连接。本文要解决的难题是:选择最合适的子加速器、配置它们、确定它们在 NoP 中的最佳位置,以及在空间和时间上映射一组预定 DNN 的层。目标是在并行执行过程中最大限度地减少执行时间和能耗,同时最大限度地降低总体成本,特别是加速器的硅面积。本文介绍了 MOHaM,一个在基于芯片组的加速器上针对多 DNN 工作负载进行多目标硬件映射协同优化的框架。MOHaM 采用了一种多目标进化算法,该算法结合了多个定制的遗传算子,专门用于解决给定的问题。在不同的多 DNN 工作负载场景中,MOHaM 与最先进的设计空间探索(DSE)框架进行了对比评估。与最先进的方案相比,MOHaM 发现的解决方案是帕累托最优方案。具体来说,MOHaM生成的加速器设计最多可减少96美元的延迟和96.12美元的能耗。
{"title":"Multi-Objective Hardware-Mapping Co-Optimisation for Multi-DNN Workloads on Chiplet-Based Accelerators","authors":"Abhijit Das;Enrico Russo;Maurizio Palesi","doi":"10.1109/TC.2024.3386067","DOIUrl":"10.1109/TC.2024.3386067","url":null,"abstract":"The need to efficiently execute different Deep Neural Networks (DNNs) on the same computing platform, coupled with the requirement for easy scalability, makes Multi-Chip Module (MCM)-based accelerators a preferred design choice. Such an accelerator brings together heterogeneous sub-accelerators in the form of chiplets, interconnected by a Network-on-Package (NoP). This paper addresses the challenge of selecting the most suitable sub-accelerators, configuring them, determining their optimal placement in the NoP, and mapping the layers of a predetermined set of DNNs spatially and temporally. The objective is to minimise execution time and energy consumption during parallel execution while also minimising the overall cost, specifically the silicon area, of the accelerator. This paper presents MOHaM, a framework for multi-objective hardware-mapping co-optimisation for multi-DNN workloads on chiplet-based accelerators. MOHaM exploits a multi-objective evolutionary algorithm that has been specialised for the given problem by incorporating several customised genetic operators. MOHaM is evaluated against state-of-the-art Design Space Exploration (DSE) frameworks on different multi-DNN workload scenarios. The solutions discovered by MOHaM are Pareto optimal compared to those by the state-of-the-art. Specifically, MOHaM-generated accelerator designs can reduce latency by up to \u0000<inline-formula><tex-math>$96%$</tex-math></inline-formula>\u0000 and energy by up to \u0000<inline-formula><tex-math>$96.12%$</tex-math></inline-formula>\u0000.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 8","pages":"1883-1898"},"PeriodicalIF":3.6,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140575444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
COALA: A Compiler-Assisted Adaptive Library Routines Allocation Framework for Heterogeneous Systems COALA:异构系统的编译器辅助自适应库例程分配框架
IF 3.7 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-04-09 DOI: 10.1109/TC.2024.3385269
Qinyun Cai;Guanghua Tan;Wangdong Yang;Xianhao He;Yuwei Yan;Keqin Li;Kenli Li
Experienced developers often leverage well-tuned libraries and allocate their routines for computing tasks to enhance performance when building modern scientific and engineering applications. However, such well-tuned libraries are meticulously customized for specific target architectures or environments. Additionally, the performance of their routines is significantly impacted by the actual input data of computing tasks, which often remains uncertain until runtime. Accordingly, statically allocating these library routines may hinder the adaptability of applications and compromise performance, particularly in the context of heterogeneous systems. To address this issue, we propose the Compiler-Assisted Adaptive Library Routines Allocation (COALA) framework for heterogeneous systems. COALA is a fully automated mechanism that employs compiler assistance for dynamic allocation of the most suitable routine to each computing task on heterogeneous systems. It allows the deployment of varying allocation policies tailored to specific optimization targets. During the application compilation process, COALA reconstructs computing tasks and inserts a probe for each of these tasks. Probes serve the purpose of conveying vital information about the requirements of each task, including its computing objective, data size, and computing flops, to a user-level allocation component at runtime. Subsequently, the allocation component utilizes the probe information along with the allocation policy to assign the most optimal library routine for executing the computing tasks. In our prototype, we further introduce and deploy a performance-oriented allocation policy founded on a machine learning-based performance evaluation method for library routines. Experimental verification and evaluation on two heterogeneous systems reveal that COALA can significantly improve application performance, with gains of up to 4.3x for numerical simulation software and 4.2x for machine learning applications, and enhance system utilization by up to 27.8%.
经验丰富的开发人员在构建现代科学和工程应用程序时,通常会利用经过良好调试的库,并为计算任务分配例程,以提高性能。然而,这些经过精心调试的库是针对特定目标架构或环境精心定制的。此外,它们的例程性能受到计算任务实际输入数据的显著影响,而这些数据在运行前往往是不确定的。因此,静态分配这些库例程可能会妨碍应用程序的适应性并影响性能,尤其是在异构系统中。为解决这一问题,我们提出了适用于异构系统的编译器辅助自适应库例程分配(COALA)框架。COALA 是一种全自动机制,利用编译器辅助为异构系统上的每个计算任务动态分配最合适的例程。它允许根据特定的优化目标部署不同的分配策略。在应用编译过程中,COALA 会重构计算任务,并为每个任务插入一个探针。探针的作用是在运行时向用户级分配组件传递有关每个任务需求的重要信息,包括其计算目标、数据大小和计算次数。随后,分配组件利用探针信息和分配策略,为执行计算任务分配最优的库例程。在我们的原型中,我们进一步引入并部署了以性能为导向的分配策略,该策略建立在基于机器学习的库例程性能评估方法之上。在两个异构系统上进行的实验验证和评估表明,COALA 可以显著提高应用性能,数值模拟软件的性能提高了 4.3 倍,机器学习应用的性能提高了 4.2 倍,系统利用率提高了 27.8%。
{"title":"COALA: A Compiler-Assisted Adaptive Library Routines Allocation Framework for Heterogeneous Systems","authors":"Qinyun Cai;Guanghua Tan;Wangdong Yang;Xianhao He;Yuwei Yan;Keqin Li;Kenli Li","doi":"10.1109/TC.2024.3385269","DOIUrl":"10.1109/TC.2024.3385269","url":null,"abstract":"Experienced developers often leverage well-tuned libraries and allocate their routines for computing tasks to enhance performance when building modern scientific and engineering applications. However, such well-tuned libraries are meticulously customized for specific target architectures or environments. Additionally, the performance of their routines is significantly impacted by the actual input data of computing tasks, which often remains uncertain until runtime. Accordingly, statically allocating these library routines may hinder the adaptability of applications and compromise performance, particularly in the context of heterogeneous systems. To address this issue, we propose the Compiler-Assisted Adaptive Library Routines Allocation (COALA) framework for heterogeneous systems. COALA is a fully automated mechanism that employs compiler assistance for dynamic allocation of the most suitable routine to each computing task on heterogeneous systems. It allows the deployment of varying allocation policies tailored to specific optimization targets. During the application compilation process, COALA reconstructs computing tasks and inserts a probe for each of these tasks. Probes serve the purpose of conveying vital information about the requirements of each task, including its computing objective, data size, and computing flops, to a user-level allocation component at runtime. Subsequently, the allocation component utilizes the probe information along with the allocation policy to assign the most optimal library routine for executing the computing tasks. In our prototype, we further introduce and deploy a performance-oriented allocation policy founded on a machine learning-based performance evaluation method for library routines. Experimental verification and evaluation on two heterogeneous systems reveal that COALA can significantly improve application performance, with gains of up to 4.3x for numerical simulation software and 4.2x for machine learning applications, and enhance system utilization by up to 27.8%.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 7","pages":"1724-1737"},"PeriodicalIF":3.7,"publicationDate":"2024-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140575527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Incendio: Priority-Based Scheduling for Alleviating Cold Start in Serverless Computing Incendio:基于优先级的调度,缓解无服务器计算中的冷启动问题
IF 3.7 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-04-08 DOI: 10.1109/TC.2024.3386063
Xinquan Cai;Qianlong Sang;Chuang Hu;Yili Gong;Kun Suo;Xiaobo Zhou;Dazhao Cheng
In serverless computing, cold start results in long response latency. Existing approaches strive to alleviate the issue by reducing the number of cold starts. However, our measurement based on real-world production traces shows that the minimum number of cold starts does not equate to the minimum response latency, and solely focusing on optimizing the number of cold starts will lead to sub-optimal performance. The root cause is that functions have different priorities in terms of latency benefits by transferring a cold start to a warm start. In this paper, we propose Incendio, a serverless computing framework exploiting priority-based scheduling to minimize the overall response latency from the perspective of cloud providers. We reveal the priority of a function is correlated to multiple factors and design a priority model based on Spearman's rank correlation coefficient. We integrate a hybrid Prophet-LightGBM prediction model to dynamically manage runtime pools, which enables the system to prewarm containers in advance and terminate containers at the appropriate time. Furthermore, to satisfy the low-cost and high-accuracy requirements in serverless computing, we propose a Clustered Reinforcement Learning-based function scheduling strategy. The evaluations show that Incendio speeds up the native system by 1.4$times$, and achieves 23% and 14.8% latency reductions compared to two state-of-the-art approaches.
在无服务器计算中,冷启动会导致较长的响应延迟。现有方法致力于通过减少冷启动次数来缓解这一问题。然而,我们基于实际生产跟踪进行的测量表明,冷启动的最少次数并不等同于响应延迟的最少次数,而只关注优化冷启动的次数将导致性能达不到最优。其根本原因在于,函数在通过将冷启动转移到热启动而获得延迟收益方面具有不同的优先级。在本文中,我们提出了一个无服务器计算框架 Incendio,它利用基于优先级的调度,从云提供商的角度最大限度地减少了整体响应延迟。我们揭示了函数的优先级与多个因素相关,并设计了基于斯皮尔曼等级相关系数的优先级模型。我们集成了一个混合的 Prophet-LightGBM 预测模型来动态管理运行时池,从而使系统能够提前预热容器,并在适当的时候终止容器。此外,为了满足无服务器计算对低成本和高精度的要求,我们提出了一种基于集群强化学习的函数调度策略。评估结果表明,与两种最先进的方法相比,Incendio可将本机系统的速度提高1.4美元/次,并将延迟分别降低23%和14.8%。
{"title":"Incendio: Priority-Based Scheduling for Alleviating Cold Start in Serverless Computing","authors":"Xinquan Cai;Qianlong Sang;Chuang Hu;Yili Gong;Kun Suo;Xiaobo Zhou;Dazhao Cheng","doi":"10.1109/TC.2024.3386063","DOIUrl":"10.1109/TC.2024.3386063","url":null,"abstract":"In serverless computing, cold start results in long response latency. Existing approaches strive to alleviate the issue by reducing the number of cold starts. However, our measurement based on real-world production traces shows that the minimum number of cold starts does not equate to the minimum response latency, and solely focusing on optimizing the number of cold starts will lead to sub-optimal performance. The root cause is that functions have different priorities in terms of latency benefits by transferring a cold start to a warm start. In this paper, we propose \u0000<i>Incendio</i>\u0000, a serverless computing framework exploiting priority-based scheduling to minimize the overall response latency from the perspective of cloud providers. We reveal the priority of a function is correlated to multiple factors and design a priority model based on Spearman's rank correlation coefficient. We integrate a hybrid Prophet-LightGBM prediction model to dynamically manage runtime pools, which enables the system to prewarm containers in advance and terminate containers at the appropriate time. Furthermore, to satisfy the low-cost and high-accuracy requirements in serverless computing, we propose a Clustered Reinforcement Learning-based function scheduling strategy. The evaluations show that Incendio speeds up the native system by 1.4\u0000<inline-formula><tex-math>$times$</tex-math></inline-formula>\u0000, and achieves 23% and 14.8% latency reductions compared to two state-of-the-art approaches.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 7","pages":"1780-1794"},"PeriodicalIF":3.7,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140575541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Design, Implementation and Evaluation of a New Variable Latency Integer Division Scheme 新型可变延迟整除方案的设计、实施和评估
IF 3.7 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-04-08 DOI: 10.1109/TC.2024.3386060
Marco Angioli;Marcello Barbirotta;Abdallah Cheikh;Antonio Mastrandrea;Francesco Menichelli;Saeid Jamili;Mauro Olivieri
Integer division is key for various applications and often represents the performance bottleneck due to its inherent mathematical properties that limit its parallelization. This paper presents a new data-dependent variable latency division algorithm derived from the classic non-performing restoring method. The proposed technique exploits the relationship between the number of leading zeros in the divisor and in the partial remainder to dynamically detect and skip those iterations that result in a simple left shift. While a similar principle has been exploited in previous works, the proposed approach outperforms existing variable latency divider schemes in average latency and power consumption. We detail the algorithm and its implementation in four variants, offering versatility for the specific application requirements. For each variant, we report the average latency evaluated with different benchmarks, and we analyze the synthesis results for both FPGA and ASIC deployment, reporting clock speed, average execution time, hardware resources, and energy consumption, compared with existing fixed and variable latency dividers.
整数除法是各种应用的关键,由于其固有的数学特性限制了其并行化,因此经常成为性能瓶颈。本文提出了一种新的依赖数据的可变延迟除法算法,该算法源自经典的无性能还原法。所提出的技术利用被除数和部分余数中前导零的数量之间的关系,动态检测并跳过那些导致简单左移的迭代。虽然以前的工作也利用了类似的原理,但所提出的方法在平均延迟和功耗方面优于现有的可变延迟除法方案。我们详细介绍了该算法及其在四个变体中的实现,为特定应用需求提供了多功能性。对于每种变体,我们都报告了使用不同基准评估的平均延迟,并分析了 FPGA 和 ASIC 部署的综合结果,报告了与现有固定和可变延迟分频器相比的时钟速度、平均执行时间、硬件资源和能耗。
{"title":"Design, Implementation and Evaluation of a New Variable Latency Integer Division Scheme","authors":"Marco Angioli;Marcello Barbirotta;Abdallah Cheikh;Antonio Mastrandrea;Francesco Menichelli;Saeid Jamili;Mauro Olivieri","doi":"10.1109/TC.2024.3386060","DOIUrl":"10.1109/TC.2024.3386060","url":null,"abstract":"Integer division is key for various applications and often represents the performance bottleneck due to its inherent mathematical properties that limit its parallelization. This paper presents a new data-dependent variable latency division algorithm derived from the classic non-performing restoring method. The proposed technique exploits the relationship between the number of leading zeros in the divisor and in the partial remainder to dynamically detect and skip those iterations that result in a simple left shift. While a similar principle has been exploited in previous works, the proposed approach outperforms existing variable latency divider schemes in average latency and power consumption. We detail the algorithm and its implementation in four variants, offering versatility for the specific application requirements. For each variant, we report the average latency evaluated with different benchmarks, and we analyze the synthesis results for both FPGA and ASIC deployment, reporting clock speed, average execution time, hardware resources, and energy consumption, compared with existing fixed and variable latency dividers.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 7","pages":"1767-1779"},"PeriodicalIF":3.7,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10494681","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140575542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Error-Detection Schemes for Analog Content-Addressable Memories 模拟内容可寻址存储器的错误检测方案
IF 3.7 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-04-08 DOI: 10.1109/TC.2024.3386065
Ron M. Roth
Analog content-addressable memories (in short, a-CAMs) have been recently introduced as accelerators for machine-learning tasks, such as tree-based inference or implementation of nonlinear activation functions. The cells in these memories contain nanoscale memristive devices, which may be susceptible to various types of errors, such as manufacturing defects, inaccurate programming of the cells, or drifts in their contents over time. The objective of this work is to develop techniques for overcoming the reliability issues that are caused by such error events. To this end, several coding schemes are presented for the detection of errors in a-CAMs. These schemes consist of an encoding stage, a detection cycle (which is performed periodically), and some minor additions to the hardware. During encoding, redundancy symbols are programmed into a portion of the a-CAM (or, alternatively, are written into an external memory). During each detection cycle, a certain set of input vectors is applied to the a-CAM. The schemes differ in several ways, e.g., in the range of alphabet sizes that they are most suitable for, in the tradeoff that each provides between redundancy and hardware additions, or in the type of errors that they handle (Hamming metric versus $L_{1}$ metric).
模拟内容可寻址存储器(简称 a-CAM)最近被引入作为机器学习任务的加速器,例如基于树的推理或非线性激活函数的实现。这些存储器中的单元包含纳米级的忆阻器件,可能会受到各种类型错误的影响,如制造缺陷、单元编程不准确或内容随时间漂移。这项工作的目标是开发克服此类错误事件所造成的可靠性问题的技术。为此,介绍了几种用于检测 a-CAM 中错误的编码方案。这些方案包括一个编码阶段、一个检测周期(周期性执行)和对硬件的一些微小添加。在编码过程中,冗余符号被编入 a-CAM 的一部分(或写入外部存储器)。在每个检测周期中,一组特定的输入向量被应用到 a-CAM 中。这些方案在多个方面存在差异,例如最适合的字母大小范围、冗余与硬件添加之间的权衡,或处理的错误类型(汉明度量与 $L_{1}$ 度量)。
{"title":"Error-Detection Schemes for Analog Content-Addressable Memories","authors":"Ron M. Roth","doi":"10.1109/TC.2024.3386065","DOIUrl":"10.1109/TC.2024.3386065","url":null,"abstract":"Analog content-addressable memories (in short, a-CAMs) have been recently introduced as accelerators for machine-learning tasks, such as tree-based inference or implementation of nonlinear activation functions. The cells in these memories contain nanoscale memristive devices, which may be susceptible to various types of errors, such as manufacturing defects, inaccurate programming of the cells, or drifts in their contents over time. The objective of this work is to develop techniques for overcoming the reliability issues that are caused by such error events. To this end, several coding schemes are presented for the detection of errors in a-CAMs. These schemes consist of an encoding stage, a detection cycle (which is performed periodically), and some minor additions to the hardware. During encoding, redundancy symbols are programmed into a portion of the a-CAM (or, alternatively, are written into an external memory). During each detection cycle, a certain set of input vectors is applied to the a-CAM. The schemes differ in several ways, e.g., in the range of alphabet sizes that they are most suitable for, in the tradeoff that each provides between redundancy and hardware additions, or in the type of errors that they handle (Hamming metric versus \u0000<inline-formula><tex-math>$L_{1}$</tex-math></inline-formula>\u0000 metric).","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 7","pages":"1795-1808"},"PeriodicalIF":3.7,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140602791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-Grained Trace Collection, Analysis, and Management of Diverse Container Images 多种容器图像的多级跟踪收集、分析和管理
IF 3.7 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-04-08 DOI: 10.1109/TC.2024.3383966
Zhuo Huang;Qi Zhang;Hao Fan;Song Wu;Chen Yu;Hai Jin;Jun Deng;Jing Gu;Zhimin Tang
Container technology is getting popular in cloud environments due to its lightweight feature and convenient deployment. The container registry plays a critical role in container-based clouds, as many container startups involve downloading layer-structured container images from a container registry. However, the container registry is struggling to efficiently manage images (i.e., transfer and store) with the emergence of diverse services and new image formats. The reason is that the container registry manages images uniformly at layer granularity. On the one hand, such uniform layer-level management probably cannot fit the various requirements of different kinds of containerized services well. On the other hand, new image formats organizing data in blocks or files cannot benefit from such uniform layer-level image management. In this paper, we perform the first analysis of image traces at multiple granularities (i.e., image-, layer-, and file-level) for various services and provide an in-depth comparison of different image formats. The traces are collected from a production-level container registry, amounting to 24 million requests and involving more than 184 TB of transferred data. We provide a number of valuable insights, including request patterns of services, file-level access patterns, and bottlenecks associated with different image formats. Based on these insights, we also propose two optimizations to improve image transfer and application deployment.
容器技术因其轻量级特性和便捷的部署而在云环境中越来越受欢迎。在基于容器的云中,容器注册中心扮演着至关重要的角色,因为许多容器初创企业都需要从容器注册中心下载层结构的容器映像。然而,随着各种服务和新映像格式的出现,容器注册中心很难有效地管理映像(即传输和存储)。究其原因,是因为容器注册中心是按层粒度统一管理镜像的。一方面,这种统一的层级管理可能无法很好地满足不同类型容器化服务的各种要求。另一方面,以块或文件形式组织数据的新图像格式也无法从这种统一的层级图像管理中受益。在本文中,我们首次对各种服务的多种粒度(即图像级、层级和文件级)图像跟踪进行了分析,并对不同的图像格式进行了深入比较。跟踪数据是从生产级容器注册表中收集的,总请求次数达 2400 万次,涉及传输数据超过 184 TB。我们提供了许多有价值的见解,包括服务请求模式、文件级访问模式以及与不同镜像格式相关的瓶颈。基于这些见解,我们还提出了两项优化建议,以改进镜像传输和应用部署。
{"title":"Multi-Grained Trace Collection, Analysis, and Management of Diverse Container Images","authors":"Zhuo Huang;Qi Zhang;Hao Fan;Song Wu;Chen Yu;Hai Jin;Jun Deng;Jing Gu;Zhimin Tang","doi":"10.1109/TC.2024.3383966","DOIUrl":"10.1109/TC.2024.3383966","url":null,"abstract":"Container technology is getting popular in cloud environments due to its lightweight feature and convenient deployment. The container registry plays a critical role in container-based clouds, as many container startups involve downloading layer-structured container images from a container registry. However, the container registry is struggling to efficiently manage images (i.e., transfer and store) with the emergence of diverse services and new image formats. The reason is that the container registry manages images uniformly at layer granularity. On the one hand, such uniform layer-level management probably cannot fit the various requirements of different kinds of containerized services well. On the other hand, new image formats organizing data in blocks or files cannot benefit from such uniform layer-level image management. In this paper, we perform the first analysis of image traces at multiple granularities (i.e., image-, layer-, and file-level) for various services and provide an in-depth comparison of different image formats. The traces are collected from a production-level container registry, amounting to 24 million requests and involving more than 184 TB of transferred data. We provide a number of valuable insights, including request patterns of services, file-level access patterns, and bottlenecks associated with different image formats. Based on these insights, we also propose two optimizations to improve image transfer and application deployment.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 7","pages":"1698-1710"},"PeriodicalIF":3.7,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10494783","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140575933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Analysis and Mitigation of Shared Resource Contention on Heterogeneous Multicore: An Industrial Case Study 异构多核共享资源争用的分析与缓解:工业案例研究
IF 3.7 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-04-08 DOI: 10.1109/TC.2024.3386059
Michael Bechtel;Heechul Yun
In this paper, we present a solution to the industrial challenge put forth by ARM in 2022. We systematically analyze the effect of shared resource contention to an augmented reality head-up display (AR-HUD) case-study application of the industrial challenge on a heterogeneous multicore platform, NVIDIA Jetson Nano. We configure the AR-HUD application such that it can process incoming image frames in real-time at 20Hz on the platform. We use Microarchitectural Denial-of-Service (DoS) attacks as aggressor workloads of the challenge and show that they can dramatically impact the latency and accuracy of the AR-HUD application. This results in significant deviations of the estimated trajectories from known ground truths, despite our best effort to mitigate their influence by using cache partitioning and real-time scheduling of the AR-HUD application. To address the challenge, we propose RT-Gang++, a partitioned real-time gang scheduling framework with last-level cache (LLC) and integrated GPU bandwidth throttling capabilities. By applying RT-Gang++, we are able to achieve desired level of performance of the AR-HUD application even in the presence of fully loaded aggressor tasks.
在本文中,我们针对 ARM 于 2022 年提出的行业挑战提出了解决方案。我们在异构多核平台 NVIDIA Jetson Nano 上系统分析了共享资源争用对增强现实平视显示器(AR-HUD)案例研究应用的影响。我们对 AR-HUD 应用程序进行了配置,使其能够在该平台上以 20Hz 的频率实时处理传入的图像帧。我们使用微架构拒绝服务(DoS)攻击作为挑战的攻击性工作负载,结果表明它们会极大地影响 AR-HUD 应用程序的延迟和准确性。尽管我们通过使用缓存分区和 AR-HUD 应用程序的实时调度尽了最大努力来减轻它们的影响,但这还是导致估计轨迹与已知地面事实出现重大偏差。为了应对这一挑战,我们提出了 RT-Gang++,这是一个具有末级缓存(LLC)和集成 GPU 带宽节流功能的分区实时帮派调度框架。通过应用 RT-Gang++,即使在侵略者任务满载的情况下,我们也能使 AR-HUD 应用程序达到理想的性能水平。
{"title":"Analysis and Mitigation of Shared Resource Contention on Heterogeneous Multicore: An Industrial Case Study","authors":"Michael Bechtel;Heechul Yun","doi":"10.1109/TC.2024.3386059","DOIUrl":"10.1109/TC.2024.3386059","url":null,"abstract":"In this paper, we present a solution to the industrial challenge put forth by ARM in 2022. We systematically analyze the effect of shared resource contention to an augmented reality head-up display (AR-HUD) case-study application of the industrial challenge on a heterogeneous multicore platform, NVIDIA Jetson Nano. We configure the AR-HUD application such that it can process incoming image frames in real-time at 20Hz on the platform. We use Microarchitectural Denial-of-Service (DoS) attacks as aggressor workloads of the challenge and show that they can dramatically impact the latency and accuracy of the AR-HUD application. This results in significant deviations of the estimated trajectories from known ground truths, despite our best effort to mitigate their influence by using cache partitioning and real-time scheduling of the AR-HUD application. To address the challenge, we propose RT-Gang++, a partitioned real-time gang scheduling framework with last-level cache (LLC) and integrated GPU bandwidth throttling capabilities. By applying RT-Gang++, we are able to achieve desired level of performance of the AR-HUD application even in the presence of fully loaded aggressor tasks.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 7","pages":"1753-1766"},"PeriodicalIF":3.7,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140575457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Computers
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1