首页 > 最新文献

ACM Transactions on Reconfigurable Technology and Systems最新文献

英文 中文
R-Blocks: an Energy-Efficient, Flexible, and Programmable CGRA R-锁:高能效、灵活、可编程的 CGRA
IF 2.3 4区 计算机科学 Q1 Computer Science Pub Date : 2024-04-08 DOI: 10.1145/3656642
Barry de Bruin, Kanishkan Vadivel, Mark Wijtvliet, Pekka Jääskeläinen, Henk Corporaal

Emerging data-driven applications in the embedded, e-Health, and internet of things (IoT) domain require complex on-device signal analysis and data reduction to maximize energy efficiency on these energy-constrained devices. Coarse-grained reconfigurable architectures (CGRAs) have been proposed as a good compromise between flexibility and energy efficiency for ultra-low power (ULP) signal processing. Existing CGRAs are often specialized and domain-specific or can only accelerate simple kernels, which makes accelerating complete applications on a CGRA while maintaining high energy efficiency an open issue. Moreover, the lack of instruction set architecture (ISA) standardization across CGRAs makes code generation using current compiler technology a major challenge. This work introduces R-Blocks; a ULP CGRA with HW/SW co-design tool-flow based on the OpenASIP toolset. This CGRA is extremely flexible due to its well-established VLIW-SIMD execution model and support for flexible SIMD-processing, while maintaining an extremely high energy efficiency using software bypassing, optimized instruction delivery, and local scratchpad memories. R-Blocks is synthesized in a commercial 22-nm FD-SOI technology and achieves a full-system energy efficiency of 115 MOPS/mW on a common FFT benchmark, 1.45 × higher than a highly tuned embedded RISC-V processor. Comparable energy efficiency is obtained on multiple complex workloads, making R-Blocks a promising acceleration target for general-purpose computing.

嵌入式、电子医疗和物联网(IoT)领域中新兴的数据驱动型应用需要复杂的设备上信号分析和数据缩减,以最大限度地提高这些能源受限设备的能效。粗粒度可重构架构(CGRA)作为超低功耗(ULP)信号处理的灵活性和能效之间的良好折中方案已被提出。现有的粗粒度可重构架构通常是专用的、特定领域的,或者只能加速简单的内核,这使得在粗粒度可重构架构上加速完整的应用并保持高能效成为一个悬而未决的问题。此外,由于 CGRA 之间缺乏指令集架构(ISA)标准化,使用当前编译器技术生成代码成为一大挑战。这项工作引入了 R-Blocks;这是一种基于 OpenASIP 工具集的 ULP CGRA,具有硬件/软件协同设计工具流。这种 CGRA 具有极高的灵活性,因为它采用了成熟的 VLIW-SIMD 执行模型,支持灵活的 SIMD 处理,同时利用软件旁路、优化的指令传输和本地 scratchpad 存储器保持了极高的能效。R-Blocks 采用商用 22 纳米 FD-SOI 技术合成,在普通 FFT 基准上实现了 115 MOPS/mW 的全系统能效,比高度调整的嵌入式 RISC-V 处理器高 1.45 倍。在多种复杂的工作负载上,R-Blocks 获得了相当高的能效,成为通用计算领域前景广阔的加速目标。
{"title":"R-Blocks: an Energy-Efficient, Flexible, and Programmable CGRA","authors":"Barry de Bruin, Kanishkan Vadivel, Mark Wijtvliet, Pekka Jääskeläinen, Henk Corporaal","doi":"10.1145/3656642","DOIUrl":"https://doi.org/10.1145/3656642","url":null,"abstract":"<p>Emerging data-driven applications in the embedded, e-Health, and internet of things (IoT) domain require complex on-device signal analysis and data reduction to maximize energy efficiency on these energy-constrained devices. Coarse-grained reconfigurable architectures (CGRAs) have been proposed as a good compromise between flexibility and energy efficiency for ultra-low power (ULP) signal processing. Existing CGRAs are often specialized and domain-specific or can only accelerate simple kernels, which makes accelerating complete applications on a CGRA while maintaining high energy efficiency an open issue. Moreover, the lack of instruction set architecture (ISA) standardization across CGRAs makes code generation using current compiler technology a major challenge. This work introduces R-Blocks; a ULP CGRA with HW/SW co-design tool-flow based on the OpenASIP toolset. This CGRA is extremely flexible due to its well-established VLIW-SIMD execution model and support for flexible SIMD-processing, while maintaining an extremely high energy efficiency using software bypassing, optimized instruction delivery, and local scratchpad memories. R-Blocks is synthesized in a commercial 22-nm FD-SOI technology and achieves a full-system energy efficiency of 115 MOPS/mW on a common FFT benchmark, 1.45 × higher than a highly tuned embedded RISC-V processor. Comparable energy efficiency is obtained on multiple complex workloads, making R-Blocks a promising acceleration target for general-purpose computing.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140564201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference 了解基于 FPGA 空间加速的大型语言模型推理的潜力
IF 2.3 4区 计算机科学 Q1 Computer Science Pub Date : 2024-04-04 DOI: 10.1145/3656177
Hongzheng Chen, Jiahao Zhang, Yixiao Du, Shaojie Xiang, Zichao Yue, Niansong Zhang, Yaohui Cai, Zhiru Zhang

Recent advancements in large language models (LLMs) boasting billions of parameters have generated a significant demand for efficient deployment in inference workloads. While hardware accelerators for Transformer-based models have been extensively studied, the majority of existing approaches rely on temporal architectures that reuse hardware units for different network layers and operators. However, these methods often encounter challenges in achieving low latency due to considerable memory access overhead.

This paper investigates the feasibility and potential of model-specific spatial acceleration for LLM inference on FPGAs. Our approach involves the specialization of distinct hardware units for specific operators or layers, facilitating direct communication between them through a dataflow architecture while minimizing off-chip memory accesses. We introduce a comprehensive analytical model for estimating the performance of a spatial LLM accelerator, taking into account the on-chip compute and memory resources available on an FPGA. This model can be extended to multi-FPGA settings for distributed inference. Through our analysis, we can identify the most effective parallelization and buffering schemes for the accelerator and, crucially, determine the scenarios in which FPGA-based spatial acceleration can outperform its GPU-based counterpart.

To enable more productive implementations of an LLM model on FPGAs, we further provide a library of high-level synthesis (HLS) kernels that are composable and reusable. This library will be made available as open-source. To validate the effectiveness of both our analytical model and HLS library, we have implemented BERT and GPT2 on an AMD Xilinx Alveo U280 FPGA device. Experimental results demonstrate our approach can achieve up to 13.4 × speedup when compared to previous FPGA-based accelerators for the BERT model. For GPT generative inference, we attain a 2.2 × speedup compared to DFX, an FPGA overlay, in the prefill stage, while achieving a 1.9 × speedup and a 5.7 × improvement in energy efficiency compared to the NVIDIA A100 GPU in the decode stage.

最近,拥有数十亿个参数的大型语言模型(LLM)取得了长足进步,这对推理工作负载的高效部署提出了巨大需求。虽然针对基于 Transformer 的模型的硬件加速器已经得到了广泛的研究,但大多数现有方法都依赖于为不同网络层和运算符重复使用硬件单元的时序架构。然而,由于内存访问开销巨大,这些方法在实现低延迟方面经常遇到挑战。本文研究了在 FPGA 上对 LLM 推理进行特定模型空间加速的可行性和潜力。我们的方法涉及为特定算子或层专用不同的硬件单元,通过数据流架构促进它们之间的直接通信,同时最大限度地减少片外内存访问。考虑到 FPGA 上可用的片上计算和内存资源,我们引入了一个用于估算空间 LLM 加速器性能的综合分析模型。该模型可扩展到多 FPGA 设置,用于分布式推理。通过分析,我们可以确定加速器最有效的并行化和缓冲方案,更重要的是,确定基于 FPGA 的空间加速在哪些情况下优于基于 GPU 的空间加速。为了在 FPGA 上更有效地实现 LLM 模型,我们进一步提供了一个可组合、可重用的高级合成(HLS)内核库。该库将以开源形式提供。为了验证我们的分析模型和 HLS 库的有效性,我们在 AMD Xilinx Alveo U280 FPGA 设备上实现了 BERT 和 GPT2。实验结果表明,与之前基于 FPGA 的 BERT 模型加速器相比,我们的方法可实现高达 13.4 倍的速度提升。在 GPT 生成推理方面,与 DFX(一种 FPGA 叠加器)相比,我们在预填充阶段的速度提高了 2.2 倍,而与英伟达 A100 GPU 相比,我们在解码阶段的速度提高了 1.9 倍,能效提高了 5.7 倍。
{"title":"Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference","authors":"Hongzheng Chen, Jiahao Zhang, Yixiao Du, Shaojie Xiang, Zichao Yue, Niansong Zhang, Yaohui Cai, Zhiru Zhang","doi":"10.1145/3656177","DOIUrl":"https://doi.org/10.1145/3656177","url":null,"abstract":"<p>Recent advancements in large language models (LLMs) boasting billions of parameters have generated a significant demand for efficient deployment in inference workloads. While hardware accelerators for Transformer-based models have been extensively studied, the majority of existing approaches rely on temporal architectures that reuse hardware units for different network layers and operators. However, these methods often encounter challenges in achieving low latency due to considerable memory access overhead. </p><p>This paper investigates the feasibility and potential of model-specific spatial acceleration for LLM inference on FPGAs. Our approach involves the specialization of distinct hardware units for specific operators or layers, facilitating direct communication between them through a dataflow architecture while minimizing off-chip memory accesses. We introduce a comprehensive analytical model for estimating the performance of a spatial LLM accelerator, taking into account the on-chip compute and memory resources available on an FPGA. This model can be extended to multi-FPGA settings for distributed inference. Through our analysis, we can identify the most effective parallelization and buffering schemes for the accelerator and, crucially, determine the scenarios in which FPGA-based spatial acceleration can outperform its GPU-based counterpart. </p><p>To enable more productive implementations of an LLM model on FPGAs, we further provide a library of high-level synthesis (HLS) kernels that are composable and reusable. This library will be made available as open-source. To validate the effectiveness of both our analytical model and HLS library, we have implemented BERT and GPT2 on an AMD Xilinx Alveo U280 FPGA device. Experimental results demonstrate our approach can achieve up to 13.4 × speedup when compared to previous FPGA-based accelerators for the BERT model. For GPT generative inference, we attain a 2.2 × speedup compared to DFX, an FPGA overlay, in the prefill stage, while achieving a 1.9 × speedup and a 5.7 × improvement in energy efficiency compared to the NVIDIA A100 GPU in the decode stage.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2024-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140564194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DANSEN: Database Acceleration on Native Computational Storage by Exploiting NDP DANSEN:利用 NDP 在本地计算存储上加速数据库
IF 2.3 4区 计算机科学 Q1 Computer Science Pub Date : 2024-04-04 DOI: 10.1145/3655625
Sajjad Tamimi, Arthur Bernhardt, Florian Stock, Ilia Petrov, Andreas Koch

This paper introduces DANSEN, the hardware accelerator component for neoDBMS, a full-stack computational storage system designed to manage on-device execution of database queries/transactions as a Near-Data Processing (NDP)-operation. The proposed system enables Database Management Systems (DBMS) to offload NDP-operations to the storage while maintaining control over data through a native storage interface. DANSEN provides an NDP-engine that enables DBMS to perform both low-level database tasks, such as performing database administration, as well as high-level tasks like executing SQL, on the smart storage device while observing the DBMS concurrency control. Furthermore, DANSEN enables the incorporation of custom accelerators as an NDP-operation, e.g., to perform hardware-accelerated ML inference directly on the stored data. We built the DANSEN storage prototype and interface on an Ultrascale+HBM FPGA and fully integrated it with PostgreSQL 12. Experimental results demonstrate that the proposed NDP approach outperforms software-only PostgreSQL using a fast off-the-shelf NVMe drive, and significantly improves the end-to-end execution time of an aggregation operation (similar to Q6 from CH-benCHmark, 150 million records) by ≈ 10.6 ×. The versatility of the proposed approach is also validated by integrating a compute-intensive data analytics application with multi-row results, outperforming PostgreSQL by ≈ 1.5 ×.

本文介绍了neoDBMS的硬件加速组件DANSEN,这是一个全栈计算存储系统,旨在将数据库查询/交易的设备上执行作为近数据处理(NDP)操作进行管理。该系统可使数据库管理系统(DBMS)将 NDP 操作卸载到存储设备上,同时通过本地存储接口保持对数据的控制。DANSEN 提供的 NDP 引擎可使 DBMS 在智能存储设备上执行数据库管理等低级数据库任务和执行 SQL 等高级任务,同时遵守 DBMS 的并发控制。此外,DANSEN 还能将自定义加速器作为 NDP 操作进行整合,例如直接在存储数据上执行硬件加速的 ML 推理。我们在 Ultrascale+HBM FPGA 上构建了 DANSEN 存储原型和接口,并将其与 PostgreSQL 12 完全集成。实验结果表明,使用快速的现成 NVMe 驱动器,所提出的 NDP 方法优于纯软件 PostgreSQL,并将聚合操作(类似于 CH-benCHmark 中的 Q6,1.5 亿条记录)的端到端执行时间显著提高了 ≈ 10.6 倍。通过将计算密集型数据分析应用与多行结果集成,还验证了所提方法的多功能性,其性能比 PostgreSQL 高出 ≈ 1.5 倍。
{"title":"DANSEN: Database Acceleration on Native Computational Storage by Exploiting NDP","authors":"Sajjad Tamimi, Arthur Bernhardt, Florian Stock, Ilia Petrov, Andreas Koch","doi":"10.1145/3655625","DOIUrl":"https://doi.org/10.1145/3655625","url":null,"abstract":"<p>This paper introduces <sans-serif>DANSEN</sans-serif>, the hardware accelerator component for neoDBMS, a full-stack computational storage system designed to manage on-device execution of database queries/transactions as a Near-Data Processing (NDP)-operation. The proposed system enables Database Management Systems (DBMS) to offload NDP-operations to the storage while maintaining control over data through a <i>native storage interface</i>. <sans-serif>DANSEN</sans-serif> provides an NDP-engine that enables DBMS to perform both low-level database tasks, such as performing database administration, as well as high-level tasks like executing SQL, <i>on</i> the smart storage device while observing the DBMS concurrency control. Furthermore, <sans-serif>DANSEN</sans-serif> enables the incorporation of custom accelerators as an NDP-operation, e.g., to perform hardware-accelerated ML inference directly on the stored data. We built the <sans-serif>DANSEN</sans-serif> storage prototype and interface on an Ultrascale+HBM FPGA and fully integrated it with PostgreSQL 12. Experimental results demonstrate that the proposed NDP approach outperforms software-only PostgreSQL using a fast off-the-shelf NVMe drive, and significantly improves the end-to-end execution time of an aggregation operation (similar to Q6 from CH-benCHmark, 150 million records) by ≈ 10.6 ×. The versatility of the proposed approach is also validated by integrating a compute-intensive data analytics application with multi-row results, outperforming PostgreSQL by ≈ 1.5 ×.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2024-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140564207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HLPerf: Demystifying the Performance of HLS-based Graph Neural Networks with Dataflow Architectures HLPerf:利用数据流架构解密基于 HLS 的图神经网络性能
IF 2.3 4区 计算机科学 Q1 Computer Science Pub Date : 2024-04-02 DOI: 10.1145/3655627
Chenfeng Zhao, Clayton J. Faber, Roger D. Chamberlain, Xuan Zhang

The development of FPGA-based applications using HLS is fraught with performance pitfalls and large design space exploration times. These issues are exacerbated when the application is complicated and its performance is dependent on the input data set, as is often the case with graph neural network approaches to machine learning. Here, we introduce HLPerf, an open-source, simulation-based performance evaluation framework for dataflow architectures that both supports early exploration of the design space and shortens the performance evaluation cycle. We apply the methodology to GNNHLS, an HLS-based graph neural network benchmark containing 6 commonly used graph neural network models and 4 datasets with distinct topologies and scales. The results show that HLPerf achieves over 10 000 × average simulation acceleration relative to RTL simulation and over 400 × acceleration relative to state-of-the-art cycle-accurate tools at the cost of 7% mean error rate relative to actual FPGA implementation performance. This acceleration positions HLPerf as a viable component in the design cycle.

使用 HLS 开发基于 FPGA 的应用程序充满了性能隐患和漫长的设计空间探索时间。当应用复杂且其性能依赖于输入数据集时,这些问题就会更加严重,机器学习的图神经网络方法通常就是这种情况。在这里,我们介绍了 HLPerf,这是一个开源的、基于仿真的数据流架构性能评估框架,它既能支持设计空间的早期探索,又能缩短性能评估周期。我们将该方法应用于 GNNHLS,这是一个基于 HLS 的图神经网络基准,包含 6 个常用图神经网络模型和 4 个具有不同拓扑结构和规模的数据集。结果表明,相对于 RTL 仿真,HLPerf 实现了超过 10,000 倍的平均仿真加速度,相对于最先进的周期精确工具,实现了超过 400 倍的加速度,而代价是相对于实际 FPGA 实现性能的 7% 平均错误率。这种加速度将 HLPerf 定位为设计周期中的一个可行组件。
{"title":"HLPerf: Demystifying the Performance of HLS-based Graph Neural Networks with Dataflow Architectures","authors":"Chenfeng Zhao, Clayton J. Faber, Roger D. Chamberlain, Xuan Zhang","doi":"10.1145/3655627","DOIUrl":"https://doi.org/10.1145/3655627","url":null,"abstract":"<p>The development of FPGA-based applications using HLS is fraught with performance pitfalls and large design space exploration times. These issues are exacerbated when the application is complicated and its performance is dependent on the input data set, as is often the case with graph neural network approaches to machine learning. Here, we introduce HLPerf, an open-source, simulation-based performance evaluation framework for dataflow architectures that both supports early exploration of the design space and shortens the performance evaluation cycle. We apply the methodology to GNNHLS, an HLS-based graph neural network benchmark containing 6 commonly used graph neural network models and 4 datasets with distinct topologies and scales. The results show that HLPerf achieves over 10 000 × average simulation acceleration relative to RTL simulation and over 400 × acceleration relative to state-of-the-art cycle-accurate tools at the cost of 7% mean error rate relative to actual FPGA implementation performance. This acceleration positions HLPerf as a viable component in the design cycle.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2024-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140564204","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PTME: A Regular Expression Matching Engine Based on Speculation and Enumerative Computation on FPGA PTME:基于 FPGA 猜测和枚举计算的正则表达式匹配引擎
IF 2.3 4区 计算机科学 Q1 Computer Science Pub Date : 2024-04-01 DOI: 10.1145/3655626
Mingqian Sun, Guangwei Xie, Fan Zhang, Wei Guo, Xitian Fan, Tianyang Li, Li Chen, Jiayu Du

Fast regular expression matching is an essential task for deep packet inspection. In previous works, the regular expression matching engine on FPGA struggled to achieve an ideal balance between resource consumption and throughput. Speculation and enumerative computation exploits the statistical properties of deterministic finite automata, allowing for more efficient pattern matching. Existing related designs mostly revolve around vector instructions and multiple processors/cores or SIMD instruction sets, with a lack of implementation on FPGA platforms. We design a parallelized two-character matching engine on FPGA for efficiently fast filtering off fields with no pattern features. We transform the state transitions with sequential dependencies to the existing problem of elements in one set, enabling the proposed design to achieve high throughput with low resource consumption and support dynamic updates. Results show that compared with the traditional DFA matching, with a maximum resource consumption of 25% for on-chip FFs (74323/1045440) and LUTs (123902/522720), there is an improvement in throughput of 8.08-229.96 × speedup and 87.61-99.56% speed-up(percentage improvement) for normal traffic, and 11.73-39.59 × speedup and 91.47-97.47% speed-up(percentage improvement) for traffic with high-frequency match hits. Compared with the state-of-the-art similar implementation, our circuit on a single FPGA chip is superior to existing multi-core designs.

快速正则表达式匹配是深度数据包检测的一项基本任务。在以前的工作中,FPGA 上的正则表达式匹配引擎一直在努力实现资源消耗和吞吐量之间的理想平衡。猜测和枚举计算利用了确定性有限自动机的统计特性,可实现更高效的模式匹配。现有的相关设计大多围绕矢量指令和多处理器/内核或 SIMD 指令集,缺乏在 FPGA 平台上的实现。我们在 FPGA 上设计了一个并行化的双字符匹配引擎,可以高效快速地过滤掉没有模式特征的字段。我们将具有顺序依赖性的状态转换转换为现有的元素在一个集合中的问题,使所提出的设计能够以较低的资源消耗实现较高的吞吐量,并支持动态更新。结果表明,与传统的 DFA 匹配相比,在片上 FF(74323/1045440)和 LUT(123902/522720)的最大资源消耗为 25% 的情况下,正常流量的吞吐量提高了 8.08-229.96 倍,速度提高了 87.61-99.56%(百分比提高);高频匹配命中流量的吞吐量提高了 11.73-39.59 倍,速度提高了 91.47-97.47%(百分比提高)。与最先进的同类实现相比,我们在单 FPGA 芯片上的电路优于现有的多核设计。
{"title":"PTME: A Regular Expression Matching Engine Based on Speculation and Enumerative Computation on FPGA","authors":"Mingqian Sun, Guangwei Xie, Fan Zhang, Wei Guo, Xitian Fan, Tianyang Li, Li Chen, Jiayu Du","doi":"10.1145/3655626","DOIUrl":"https://doi.org/10.1145/3655626","url":null,"abstract":"<p>Fast regular expression matching is an essential task for deep packet inspection. In previous works, the regular expression matching engine on FPGA struggled to achieve an ideal balance between resource consumption and throughput. Speculation and enumerative computation exploits the statistical properties of deterministic finite automata, allowing for more efficient pattern matching. Existing related designs mostly revolve around vector instructions and multiple processors/cores or SIMD instruction sets, with a lack of implementation on FPGA platforms. We design a parallelized two-character matching engine on FPGA for efficiently fast filtering off fields with no pattern features. We transform the state transitions with sequential dependencies to the existing problem of elements in one set, enabling the proposed design to achieve high throughput with low resource consumption and support dynamic updates. Results show that compared with the traditional DFA matching, with a maximum resource consumption of 25% for on-chip FFs (74323/1045440) and LUTs (123902/522720), there is an improvement in throughput of 8.08-229.96 × <i>speedup</i> and 87.61-99.56% <i>speed-up(percentage improvement)</i> for normal traffic, and 11.73-39.59 × <i>speedup</i> and 91.47-97.47% <i>speed-up(percentage improvement)</i> for traffic with high-frequency match hits. Compared with the state-of-the-art similar implementation, our circuit on a single FPGA chip is superior to existing multi-core designs.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140564202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Design and implementation of hardware-software architecture based on hashes for SPHINCS+ 为 SPHINCS+ 设计和实施基于哈希值的软硬件架构
IF 2.3 4区 计算机科学 Q1 Computer Science Pub Date : 2024-03-27 DOI: 10.1145/3653459
Jonathan López-Valdivieso, René Cumplido

Advances in quantum computing have posed a future threat to today’s cryptography. With the advent of these quantum computers, security could be compromised. Therefore, the National Institute of Standards and Technology (NIST) has issued a request for proposals to standardize algorithms for post-quantum cryptography (PQC), which is considered difficult to solve for both classical and quantum computers. Among the proposed technologies, the most popular choices are lattice-based (shortest vector problem) and hash-based approaches. Other important categories are public key cryptography (PKE) and digital signatures.

Within the realm of digital signatures lies SPHINCS+. However, there are few implementations of this scheme in hardware architectures. In this article, we present a hardware-software architecture for the SPHINCS+ scheme. We utilized a free RISC-V (Reduced Instruction Set Computer) processor synthesized on a Field Programmable Gate Array (FPGA), primarily integrating two accelerator modules for Keccak-1600 and the Haraka hash function. Additionally, modifications were made to the processor to accommodate the execution of these added modules. Our implementation yielded a 15-fold increase in performance with the SHAKE-256 function and nearly 90-fold improvement when using Haraka, compared to the reference software. Moreover, it is more compact compared to related works. This implementation was realized on a Xilinx FPGA Arty S7: Spartan-7.

量子计算的进步对当今的密码学构成了未来的威胁。随着这些量子计算机的出现,安全性可能会受到损害。因此,美国国家标准与技术研究院(NIST)发布了一份提案征集书,以规范后量子密码学(PQC)的算法,因为经典计算机和量子计算机都认为后量子密码学很难解决。在提议的技术中,最受欢迎的是基于网格的方法(最短向量问题)和基于哈希的方法。其他重要类别包括公钥加密(PKE)和数字签名。SPHINCS+ 属于数字签名领域。然而,该方案在硬件架构中的实现却很少。在本文中,我们介绍了 SPHINCS+ 方案的硬件软件架构。我们利用在现场可编程门阵列(FPGA)上合成的免费 RISC-V(精简指令集计算机)处理器,主要集成了 Keccak-1600 和 Haraka 哈希函数的两个加速器模块。此外,还对处理器进行了修改,以适应这些新增模块的执行。与参考软件相比,我们使用 SHAKE-256 函数实现的性能提高了 15 倍,使用 Haraka 实现的性能提高了近 90 倍。此外,与相关作品相比,它的结构更加紧凑。该实现是在 Xilinx FPGA Arty S7: Spartan-7 上实现的。
{"title":"Design and implementation of hardware-software architecture based on hashes for SPHINCS+","authors":"Jonathan López-Valdivieso, René Cumplido","doi":"10.1145/3653459","DOIUrl":"https://doi.org/10.1145/3653459","url":null,"abstract":"<p>Advances in quantum computing have posed a future threat to today’s cryptography. With the advent of these quantum computers, security could be compromised. Therefore, the National Institute of Standards and Technology (NIST) has issued a request for proposals to standardize algorithms for post-quantum cryptography (PQC), which is considered difficult to solve for both classical and quantum computers. Among the proposed technologies, the most popular choices are lattice-based (shortest vector problem) and hash-based approaches. Other important categories are public key cryptography (PKE) and digital signatures. </p><p>Within the realm of digital signatures lies SPHINCS+. However, there are few implementations of this scheme in hardware architectures. In this article, we present a hardware-software architecture for the SPHINCS+ scheme. We utilized a free RISC-V (Reduced Instruction Set Computer) processor synthesized on a Field Programmable Gate Array (FPGA), primarily integrating two accelerator modules for Keccak-1600 and the Haraka hash function. Additionally, modifications were made to the processor to accommodate the execution of these added modules. Our implementation yielded a 15-fold increase in performance with the SHAKE-256 function and nearly 90-fold improvement when using Haraka, compared to the reference software. Moreover, it is more compact compared to related works. This implementation was realized on a Xilinx FPGA Arty S7: Spartan-7.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140316824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FADO: Floorplan-Aware Directive Optimization Based on Synthesis and Analytical Models for High-Level Synthesis Designs on Multi-Die FPGAs FADO:基于合成和分析模型的平面图感知指令优化,适用于多芯片 FPGA 上的高层合成设计
IF 2.3 4区 计算机科学 Q1 Computer Science Pub Date : 2024-03-20 DOI: 10.1145/3653458
Linfeng Du, Tingyuan Liang, Xiaofeng Zhou, Jinming Ge, Shangkun Li, Sharad Sinha, Jieru Zhao, Zhiyao Xie, Wei Zhang

Multi-die FPGAs are widely adopted for large-scale accelerators, but optimizing high-level synthesis designs on these FPGAs faces two challenges. First, the delay caused by die-crossing nets creates an NP-hard floorplanning problem. Second, traditional directive optimization cannot consider resource constraints on each die or the timing issue incurred by the die-crossings. Furthermore, the high algorithmic complexity and the large scale lead to extended runtime for legalizing the floorplan of HLS designs under different directive configurations.

To co-optimize the directives and floorplan of HLS designs on multi-die FPGAs, we formulate the co-search based on bin-packing variants and present two iterative optimization flows. The first (FADO 1.0) relies on a pre-built QoR library. It involves a greedy, latency-bottleneck-guided directive search and an incremental floorplan legalization. Compared with a global floorplanning solution, it takes 693X ∼ 4925X shorter search time and achieves 1.16X ∼ 8.78X better design performance, measured in workload execution time.

To remove the time-consuming QoR library generation, the second flow (FADO 2.0) integrates an analytical QoR model and redesigns the directive search to accelerate convergence. Through experiments on mixed dataflow and non-dataflow designs, compared with 1.0, FADO 2.0 further yields a 1.40X better design performance on average after implementation on the Alveo U250 FPGA.

大规模加速器广泛采用多芯片 FPGA,但在这些 FPGA 上优化高级综合设计面临两个挑战。首先,裸片交叉网引起的延迟造成了一个 NP 难的平面规划问题。其次,传统的指令优化无法考虑每个芯片上的资源限制或芯片交叉带来的时序问题。此外,算法复杂度高、规模大,导致在不同指令配置下,HLS 设计平面图合法化的运行时间延长。为了在多芯片 FPGA 上共同优化 HLS 设计的指令和平面图,我们制定了基于 bin-packing 变体的共同搜索,并提出了两个迭代优化流程。第一种流程(FADO 1.0)依赖于预构建的 QoR 库。它包括贪婪的、延迟瓶颈引导的指令搜索和增量平面图合法化。与全局平面规划解决方案相比,它的搜索时间缩短了 693X ~ 4925X,设计性能提高了 1.16X ~ 8.78X(以工作负载执行时间计算)。为了消除耗时的 QoR 库生成,第二个流程(FADO 2.0)集成了分析 QoR 模型,并重新设计了指令搜索以加速收敛。通过对混合数据流和非数据流设计的实验,与 1.0 相比,FADO 2.0 在 Alveo U250 FPGA 上实现后,设计性能平均提高了 1.40 倍。
{"title":"FADO: Floorplan-Aware Directive Optimization Based on Synthesis and Analytical Models for High-Level Synthesis Designs on Multi-Die FPGAs","authors":"Linfeng Du, Tingyuan Liang, Xiaofeng Zhou, Jinming Ge, Shangkun Li, Sharad Sinha, Jieru Zhao, Zhiyao Xie, Wei Zhang","doi":"10.1145/3653458","DOIUrl":"https://doi.org/10.1145/3653458","url":null,"abstract":"<p>Multi-die FPGAs are widely adopted for large-scale accelerators, but optimizing high-level synthesis designs on these FPGAs faces two challenges. First, the delay caused by die-crossing nets creates an NP-hard floorplanning problem. Second, traditional directive optimization cannot consider resource constraints on each die or the timing issue incurred by the die-crossings. Furthermore, the high algorithmic complexity and the large scale lead to extended runtime for legalizing the floorplan of HLS designs under different directive configurations. </p><p>To co-optimize the directives and floorplan of HLS designs on multi-die FPGAs, we formulate the co-search based on bin-packing variants and present two iterative optimization flows. The first (FADO 1.0) relies on a pre-built QoR library. It involves a greedy, latency-bottleneck-guided directive search and an incremental floorplan legalization. Compared with a global floorplanning solution, it takes 693X ∼ 4925X shorter search time and achieves 1.16X ∼ 8.78X better design performance, measured in workload execution time. </p><p>To remove the time-consuming QoR library generation, the second flow (FADO 2.0) integrates an analytical QoR model and redesigns the directive search to accelerate convergence. Through experiments on mixed dataflow and non-dataflow designs, compared with 1.0, FADO 2.0 further yields a 1.40X better design performance on average after implementation on the Alveo U250 FPGA.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2024-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140167500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Designing an IEEE-compliant FPU that supports configurable precision for soft processors 为软处理器设计支持可配置精度的 IEEE 兼容型 FPU
IF 2.3 4区 计算机科学 Q1 Computer Science Pub Date : 2024-03-15 DOI: 10.1145/3650036
Chris Keilbart, Yuhui Gao, Martin Chua, Eric Matthews, Steven J.E. Wilton, Lesley Shannon

Field Programmable Gate Arrays (FPGAs) are commonly used to accelerate floating-point (FP) applications. Although researchers have extensively studied FPGA FP implementations, existing work has largely focused on standalone operators and frequency-optimized designs. These works are not suitable for FPGA soft processors which are more sensitive to latency, impose a lower frequency ceiling, and require IEEE FP standard compliance. We present an open-source floating-point unit (FPU) for FPGA RISC-V soft processors that is fully IEEE compliant with configurable levels of FP precision. Our design emphasizes runtime performance with 25% lower latency in the most common instructions compared to previous works while maintaining efficient resource utilization.

Our FPU also allows users to explore various mantissa widths without having to rewrite or recompile their algorithms. We use this to investigate the scalability of our reduced-precision FPU across numerous microbenchmark functions as well as more complex case studies. Our experiments show that applications like the discrete cosine transformation and the Black-Scholes model can realize a speedup of more than 1.35x in conjunction with a 43% and 35% reduction in lookup table and flip-flop resources while experiencing less than a 0.025% average loss in numerical accuracy with a 16-bit mantissa width.

现场可编程门阵列(FPGA)通常用于加速浮点(FP)应用。虽然研究人员对 FPGA FP 实现进行了广泛研究,但现有工作主要集中在独立运算器和频率优化设计上。这些工作不适合 FPGA 软处理器,因为软处理器对延迟更敏感,频率上限更低,而且需要符合 IEEE FP 标准。我们为 FPGA RISC-V 软处理器提出了一种开源浮点运算单元 (FPU),它完全符合 IEEE 标准,具有可配置的 FP 精度水平。我们的设计强调运行时性能,与以前的作品相比,最常用指令的延迟降低了 25%,同时保持了高效的资源利用率。我们的 FPU 还允许用户探索各种尾数宽度,而无需重写或重新编译算法。我们借此研究了我们的减精度 FPU 在众多微基准函数以及更复杂的案例研究中的可扩展性。我们的实验表明,离散余弦变换和布莱克-斯科尔斯模型等应用的速度提高了 1.35 倍以上,同时查找表和触发器资源分别减少了 43% 和 35%,而 16 位尾数宽度的数值精度平均损失不到 0.025%。
{"title":"Designing an IEEE-compliant FPU that supports configurable precision for soft processors","authors":"Chris Keilbart, Yuhui Gao, Martin Chua, Eric Matthews, Steven J.E. Wilton, Lesley Shannon","doi":"10.1145/3650036","DOIUrl":"https://doi.org/10.1145/3650036","url":null,"abstract":"<p>Field Programmable Gate Arrays (FPGAs) are commonly used to accelerate floating-point (FP) applications. Although researchers have extensively studied FPGA FP implementations, existing work has largely focused on standalone operators and frequency-optimized designs. These works are not suitable for FPGA soft processors which are more sensitive to latency, impose a lower frequency ceiling, and require IEEE FP standard compliance. We present an open-source floating-point unit (FPU) for FPGA RISC-V soft processors that is fully IEEE compliant with configurable levels of FP precision. Our design emphasizes runtime performance with 25% lower latency in the most common instructions compared to previous works while maintaining efficient resource utilization. </p><p>Our FPU also allows users to explore various mantissa widths without having to rewrite or recompile their algorithms. We use this to investigate the scalability of our reduced-precision FPU across numerous microbenchmark functions as well as more complex case studies. Our experiments show that applications like the discrete cosine transformation and the Black-Scholes model can realize a speedup of more than 1.35x in conjunction with a 43% and 35% reduction in lookup table and flip-flop resources while experiencing less than a 0.025% average loss in numerical accuracy with a 16-bit mantissa width.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2024-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140152043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
L-FNNG: Accelerating Large-Scale KNN Graph Construction on CPU-FPGA Heterogeneous Platform L-FNNG:在 CPU-FPGA 异构平台上加速大规模 KNN 图构建
IF 2.3 4区 计算机科学 Q1 Computer Science Pub Date : 2024-03-14 DOI: 10.1145/3652609
Chaoqiang Liu, Xiaofei Liao, Long Zheng, Yu Huang, Haifeng Liu, Yi Zhang, Haiheng He, Haoyan Huang, Jingyi Zhou, Hai Jin

Due to the high complexity of constructing exact k-nearest neighbor graphs, approximate construction has become a popular research topic. The NN-Descent algorithm is one of the representative in-memory algorithms. To effectively handle large datasets, existing state-of-the-art solutions combine the divide-and-conquer approach and the NN-Descent algorithm, where large datasets are divided into multiple partitions, and a subgraph is constructed for each partition before all the subgraphs are merged, reducing the memory pressure significantly. However, such solutions fail to address inefficiencies in large-scale k-nearest neighbor graph construction. In this paper, we propose L-FNNG, a novel solution for accelerating large-scale k-nearest neighbor graph construction on CPU-FPGA heterogeneous platform. The CPU is responsible for dividing data and determining the order of partition processing, while the FPGA executes all construction tasks to utilize the acceleration capability fully. To accelerate the execution of construction tasks, we design an efficient FPGA accelerator, which includes the Block-based Scheduling (BS) and Useless Computation Aborting (UCA) techniques to address the problems of memory access and computation in the NN-Descent algorithm. We also propose an efficient scheduling strategy that includes a KD-tree-based data partitioning method and a hierarchical processing method to address scheduling inefficiency. We evaluate L-FNNG on a Xilinx Alveo U280 board hosted by a 64-core Xeon server. On multiple large-scale datasets, L-FNNG achieves, on average, 2.3 × construction speedup over the state-of-the-art GPU-based solution.

由于构建精确的 k 近邻图非常复杂,近似构建已成为一个热门研究课题。NN-Descent 算法是具有代表性的内存算法之一。为了有效处理大型数据集,现有的先进解决方案结合了分而治之法和 NN-Descent 算法,即将大型数据集划分为多个分区,并在合并所有子图之前为每个分区构建一个子图,从而大大降低了内存压力。然而,这类解决方案无法解决大规模 k 近邻图构建中的低效问题。在本文中,我们提出了 L-FNNG,一种在 CPU-FPGA 异构平台上加速大规模 k 近邻图构建的新型解决方案。CPU 负责划分数据和确定分区处理顺序,而 FPGA 则执行所有构建任务,以充分发挥加速能力。为了加速构建任务的执行,我们设计了一种高效的 FPGA 加速器,其中包括基于块的调度(BS)和无用计算中止(UCA)技术,以解决 NN-Descent 算法中的内存访问和计算问题。我们还提出了一种高效的调度策略,包括基于 KD 树的数据分区方法和分层处理方法,以解决调度效率低下的问题。我们在由 64 核至强服务器托管的赛灵思 Alveo U280 板上对 L-FNNG 进行了评估。在多个大规模数据集上,L-FNNG 与最先进的基于 GPU 的解决方案相比,平均实现了 2.3 倍的计算速度提升。
{"title":"L-FNNG: Accelerating Large-Scale KNN Graph Construction on CPU-FPGA Heterogeneous Platform","authors":"Chaoqiang Liu, Xiaofei Liao, Long Zheng, Yu Huang, Haifeng Liu, Yi Zhang, Haiheng He, Haoyan Huang, Jingyi Zhou, Hai Jin","doi":"10.1145/3652609","DOIUrl":"https://doi.org/10.1145/3652609","url":null,"abstract":"<p>Due to the high complexity of constructing exact <i>k</i>-nearest neighbor graphs, approximate construction has become a popular research topic. The NN-Descent algorithm is one of the representative in-memory algorithms. To effectively handle large datasets, existing state-of-the-art solutions combine the divide-and-conquer approach and the NN-Descent algorithm, where large datasets are divided into multiple partitions, and a subgraph is constructed for each partition before all the subgraphs are merged, reducing the memory pressure significantly. However, such solutions fail to address inefficiencies in large-scale <i>k</i>-nearest neighbor graph construction. In this paper, we propose L-FNNG, a novel solution for accelerating large-scale <i>k</i>-nearest neighbor graph construction on CPU-FPGA heterogeneous platform. The CPU is responsible for dividing data and determining the order of partition processing, while the FPGA executes all construction tasks to utilize the acceleration capability fully. To accelerate the execution of construction tasks, we design an efficient FPGA accelerator, which includes the <i>Block-based Scheduling</i> (BS) and <i>Useless Computation Aborting</i> (UCA) techniques to address the problems of memory access and computation in the NN-Descent algorithm. We also propose an efficient scheduling strategy that includes a KD-tree-based data partitioning method and a hierarchical processing method to address scheduling inefficiency. We evaluate L-FNNG on a Xilinx Alveo U280 board hosted by a 64-core Xeon server. On multiple large-scale datasets, L-FNNG achieves, on average, 2.3 × construction speedup over the state-of-the-art GPU-based solution.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2024-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140125751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Introduction to the Special Issue on FPL 2022 FPL 2022 特刊简介
IF 2.3 4区 计算机科学 Q1 Computer Science Pub Date : 2024-03-13 DOI: 10.1145/3643474
Andreas Koch, Kentaro Sano
{"title":"Introduction to the Special Issue on FPL 2022","authors":"Andreas Koch, Kentaro Sano","doi":"10.1145/3643474","DOIUrl":"https://doi.org/10.1145/3643474","url":null,"abstract":"","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2024-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140245587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
ACM Transactions on Reconfigurable Technology and Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1