首页 > 最新文献

IEEE Computer Architecture Letters最新文献

英文 中文
FPGA-Accelerated Data Preprocessing for Personalized Recommendation Systems 用于个性化推荐系统的 FPGA 加速数据预处理
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-11-28 DOI: 10.1109/LCA.2023.3336841
Hyeseong Kim;Yunjae Lee;Minsoo Rhu
Deep neural network (DNN)-based recommendation systems (RecSys) are one of the most successfully deployed machine learning applications in commercial services for predicting ad click-through rates or rankings. While numerous prior work explored hardware and software solutions to reduce the training time of RecSys, its end-to-end training pipeline including the data preprocessing stage has received little attention. In this work, we provide a comprehensive analysis of RecSys data preprocessing, root-causing the feature generation and normalization steps to cause a major performance bottleneck. Based on our characterization, we explore the efficacy of an FPGA-accelerated RecSys preprocessing system that achieves a significant 3.4–12.1× end-to-end speedup compared to the baseline CPU-based RecSys preprocessing system.
基于深度神经网络(DNN)的推荐系统(RecSys)是商业服务中最成功的机器学习应用之一,用于预测广告点击率或排名。虽然之前有大量工作探索了缩短 RecSys 训练时间的硬件和软件解决方案,但包括数据预处理阶段在内的端到端训练流水线却很少受到关注。在这项工作中,我们对 RecSys 的数据预处理进行了全面分析,从根本上找出了导致主要性能瓶颈的特征生成和归一化步骤。基于我们的分析,我们探索了 FPGA 加速 RecSys 预处理系统的功效,与基于 CPU 的基线 RecSys 预处理系统相比,该系统的端到端速度显著提高了 3.4-12.1 倍。
{"title":"FPGA-Accelerated Data Preprocessing for Personalized Recommendation Systems","authors":"Hyeseong Kim;Yunjae Lee;Minsoo Rhu","doi":"10.1109/LCA.2023.3336841","DOIUrl":"https://doi.org/10.1109/LCA.2023.3336841","url":null,"abstract":"Deep neural network (DNN)-based recommendation systems (RecSys) are one of the most successfully deployed machine learning applications in commercial services for predicting ad click-through rates or rankings. While numerous prior work explored hardware and software solutions to reduce the training time of RecSys, its end-to-end training pipeline including the data preprocessing stage has received little attention. In this work, we provide a comprehensive analysis of RecSys data preprocessing, root-causing the feature generation and normalization steps to cause a major performance bottleneck. Based on our characterization, we explore the efficacy of an FPGA-accelerated RecSys preprocessing system that achieves a significant 3.4–12.1× end-to-end speedup compared to the baseline CPU-based RecSys preprocessing system.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"7-10"},"PeriodicalIF":2.3,"publicationDate":"2023-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139504430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Redundant Array of Independent Memory Devices 独立内存设备冗余阵列
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-11-20 DOI: 10.1109/LCA.2023.3334989
Peiyun Wu;Trung Le;Zhichun Zhu;Zhao Zhang
DRAM memory reliability is increasingly a concern as recent studies found. In this letter, we propose RAIMD (Redundant Array of Independent Memory Devices), an energy-efficient memory organization with RAID-like error protection. In this organization, each memory device works as an independent memory module to serve a whole memory request and to support error detection and error recovery. It relies on the high data rate of modern memory device to minimize the performance impact of increased data transfer time. RAIMD provides chip-level error protection similar to Chipkill but with significant energy savings. Our simulation results indicate that RAIMD can save memory energy by 26.3% on average with a small performance overhead of 5.3% on DDR5-4800 memory systems for SPEC2017 multi-core workloads.
最近的研究发现,DRAM存储器的可靠性越来越受到关注。在这封信中,我们提出RAIMD(独立存储器设备冗余阵列),一种具有类似raid的错误保护的节能存储器组织。在这种组织中,每个存储设备作为一个独立的存储模块来服务于整个内存请求,并支持错误检测和错误恢复。它依赖于现代存储设备的高数据速率,以尽量减少增加的数据传输时间对性能的影响。RAIMD提供类似Chipkill的芯片级错误保护,但具有显著的节能效果。我们的仿真结果表明,在SPEC2017多核工作负载的DDR5-4800内存系统上,RAIMD可以平均节省26.3%的内存能量,而性能开销仅为5.3%。
{"title":"Redundant Array of Independent Memory Devices","authors":"Peiyun Wu;Trung Le;Zhichun Zhu;Zhao Zhang","doi":"10.1109/LCA.2023.3334989","DOIUrl":"https://doi.org/10.1109/LCA.2023.3334989","url":null,"abstract":"DRAM memory reliability is increasingly a concern as recent studies found. In this letter, we propose RAIMD (Redundant Array of Independent Memory Devices), an energy-efficient memory organization with RAID-like error protection. In this organization, each memory device works as an independent memory module to serve a whole memory request and to support error detection and error recovery. It relies on the high data rate of modern memory device to minimize the performance impact of increased data transfer time. RAIMD provides chip-level error protection similar to Chipkill but with significant energy savings. Our simulation results indicate that RAIMD can save memory energy by 26.3% on average with a small performance overhead of 5.3% on DDR5-4800 memory systems for SPEC2017 multi-core workloads.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"181-184"},"PeriodicalIF":2.3,"publicationDate":"2023-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138633920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Ramulator 2.0: A Modern, Modular, and Extensible DRAM Simulator Ramulator 2.0:现代、模块化、可扩展的 DRAM 仿真器
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-11-17 DOI: 10.1109/LCA.2023.3333759
Haocong Luo;Yahya Can Tuğrul;F. Nisa Bostancı;Ataberk Olgun;A. Giray Yağlıkçı;Onur Mutlu
We present Ramulator 2.0, a highly modular and extensible DRAM simulator that enables rapid and agile implementation and evaluation of design changes in the memory controller and DRAM to meet the increasing research effort in improving the performance, security, and reliability of memory systems. Ramulator 2.0 abstracts and models key components in a DRAM-based memory system and their interactions into shared interfaces and independent implementations. Doing so enables easy modification and extension of the modeled functions of the memory controller and DRAM in Ramulator 2.0. The DRAM specification syntax of Ramulator 2.0 is concise and human-readable, facilitating easy modifications and extensions. Ramulator 2.0 implements a library of reusable templated lambda functions to model the functionalities of DRAM commands to simplify the implementation of new DRAM standards, including DDR5, LPDDR5, HBM3, and GDDR6. We showcase Ramulator 2.0's modularity and extensibility by implementing and evaluating a wide variety of RowHammer mitigation techniques that require different memory controller design changes. These techniques are added modularly as separate implementations without changing any code in the baseline memory controller implementation. Ramulator 2.0 is rigorously validated and maintains a fast simulation speed compared to existing cycle-accurate DRAM simulators.
我们推出的 Ramulator 2.0 是一款高度模块化和可扩展的 DRAM 仿真器,能够快速敏捷地实现和评估内存控制器和 DRAM 的设计变更,以满足在提高内存系统性能、安全性和可靠性方面日益增长的研究需求。Ramulator 2.0 将基于 DRAM 的内存系统中的关键组件及其相互作用抽象为共享接口和独立实现。这样,就可以在 Ramulator 2.0 中轻松修改和扩展内存控制器和 DRAM 的建模功能。Ramulator 2.0 的 DRAM 规范语法简明易懂,便于修改和扩展。Ramulator 2.0 实现了一个可重复使用的模板化 lambda 函数库,以模拟 DRAM 命令的功能,从而简化新 DRAM 标准的实现,包括 DDR5、LPDDR5、HBM3 和 GDDR6。我们通过实现和评估各种需要改变内存控制器设计的 RowHammer 缓解技术,展示了 Ramulator 2.0 的模块化和可扩展性。这些技术以模块化的方式添加到独立的实现中,无需更改基线内存控制器实现中的任何代码。Ramulator 2.0 经过严格验证,与现有周期精确的 DRAM 模拟器相比,模拟速度更快。
{"title":"Ramulator 2.0: A Modern, Modular, and Extensible DRAM Simulator","authors":"Haocong Luo;Yahya Can Tuğrul;F. Nisa Bostancı;Ataberk Olgun;A. Giray Yağlıkçı;Onur Mutlu","doi":"10.1109/LCA.2023.3333759","DOIUrl":"https://doi.org/10.1109/LCA.2023.3333759","url":null,"abstract":"We present Ramulator 2.0, a highly modular and extensible DRAM simulator that enables rapid and agile implementation and evaluation of design changes in the memory controller and DRAM to meet the increasing research effort in improving the performance, security, and reliability of memory systems. Ramulator 2.0 abstracts and models key components in a DRAM-based memory system and their interactions into shared \u0000<italic>interfaces</i>\u0000 and independent \u0000<italic>implementations</i>\u0000. Doing so enables easy modification and extension of the modeled functions of the memory controller and DRAM in Ramulator 2.0. The DRAM specification syntax of Ramulator 2.0 is concise and human-readable, facilitating easy modifications and extensions. Ramulator 2.0 implements a library of reusable templated lambda functions to model the functionalities of DRAM commands to simplify the implementation of new DRAM standards, including DDR5, LPDDR5, HBM3, and GDDR6. We showcase Ramulator 2.0's modularity and extensibility by implementing and evaluating a wide variety of RowHammer mitigation techniques that require \u0000<italic>different</i>\u0000 memory controller design changes. These techniques are added modularly as separate implementations \u0000<italic>without</i>\u0000 changing \u0000<italic>any</i>\u0000 code in the baseline memory controller implementation. Ramulator 2.0 is rigorously validated and maintains a fast simulation speed compared to existing cycle-accurate DRAM simulators.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"112-116"},"PeriodicalIF":2.3,"publicationDate":"2023-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140822019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards an Accelerator for Differential and Algebraic Equations Useful to Scientists 开发对科学家有用的微分方程和代数方程加速器
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-11-13 DOI: 10.1109/LCA.2023.3332318
Jonathan Garcia-Mallen;Shuohao Ping;Alex Miralles-Cordal;Ian Martin;Mukund Ramakrishnan;Yipeng Huang
We discuss our preliminary results in building a configurable accelerator for differential equation time stepping and iterative methods for algebraic equations. Relative to prior efforts in building hardware accelerators for numerical methods, our focus is on the following: 1) Demonstrating a higher order of numerical convergence that is needed to actually support existing numerical algorithms. 2) Providing the capacity for wide vectors of variables by keeping the hardware design components as simple as possible. 3) Demonstrating configurable hardware support for a variety of numerical algorithms that form the core of scientific computation libraries. These efforts are toward the goal of making the accelerator democratically accessible by computational scientists.
我们讨论了我们在建立可配置的微分方程时间步进加速器和代数方程迭代方法方面的初步结果。相对于之前为数值方法构建硬件加速器的努力,我们的重点是:1)证明实际支持现有数值算法所需的更高阶数值收敛。2)通过保持硬件设计组件尽可能简单来提供广泛变量向量的能力。3)演示对构成科学计算库核心的各种数值算法的可配置硬件支持。这些努力的目标是让计算科学家能够民主地使用加速器。
{"title":"Towards an Accelerator for Differential and Algebraic Equations Useful to Scientists","authors":"Jonathan Garcia-Mallen;Shuohao Ping;Alex Miralles-Cordal;Ian Martin;Mukund Ramakrishnan;Yipeng Huang","doi":"10.1109/LCA.2023.3332318","DOIUrl":"10.1109/LCA.2023.3332318","url":null,"abstract":"We discuss our preliminary results in building a configurable accelerator for differential equation time stepping and iterative methods for algebraic equations. Relative to prior efforts in building hardware accelerators for numerical methods, our focus is on the following: 1) Demonstrating a higher order of numerical convergence that is needed to actually support existing numerical algorithms. 2) Providing the capacity for wide vectors of variables by keeping the hardware design components as simple as possible. 3) Demonstrating configurable hardware support for a variety of numerical algorithms that form the core of scientific computation libraries. These efforts are toward the goal of making the accelerator democratically accessible by computational scientists.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"185-188"},"PeriodicalIF":2.3,"publicationDate":"2023-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135612504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
gem5-accel: A Pre-RTL Simulation Toolchain for Accelerator Architecture Validation gem5-accel:用于加速器架构验证的预 RTL 仿真工具链
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-11-01 DOI: 10.1109/LCA.2023.3329443
João Vieira;Nuno Roma;Gabriel Falcao;Pedro Tomás
Attaining the performance and efficiency levels required by modern applications often requires the use of application-specific accelerators. However, writing synthesizable Register-Transfer Level code for such accelerators is a complex, expensive, and time-consuming process, which is cumbersome for early architecture development phases. To tackle this issue, a pre-synthesis simulation toolchain is herein proposed that facilitates the early architectural evaluation of complex accelerators aggregated to multi-level memory hierarchies. To demonstrate its usefulness, the proposed gem5-accel is used to model a tensor accelerator based on Gemmini, showing that it can successfully anticipate the results of complex hardware accelerators executing deep Neural Networks.
要达到现代应用所需的性能和效率水平,往往需要使用特定应用加速器。然而,为这类加速器编写可综合的寄存器传输级代码是一个复杂、昂贵且耗时的过程,对于早期架构开发阶段来说非常麻烦。为解决这一问题,本文提出了一种合成前仿真工具链,有助于对聚合到多级存储器层次结构的复杂加速器进行早期架构评估。为了证明该工具的实用性,我们使用所提出的 gem5-accel 对基于 Gemmini 的张量加速器进行建模,结果表明它能成功预测执行深度神经网络的复杂硬件加速器的结果。
{"title":"gem5-accel: A Pre-RTL Simulation Toolchain for Accelerator Architecture Validation","authors":"João Vieira;Nuno Roma;Gabriel Falcao;Pedro Tomás","doi":"10.1109/LCA.2023.3329443","DOIUrl":"10.1109/LCA.2023.3329443","url":null,"abstract":"Attaining the performance and efficiency levels required by modern applications often requires the use of application-specific accelerators. However, writing synthesizable Register-Transfer Level code for such accelerators is a complex, expensive, and time-consuming process, which is cumbersome for early architecture development phases. To tackle this issue, a pre-synthesis simulation toolchain is herein proposed that facilitates the early architectural evaluation of complex accelerators aggregated to multi-level memory hierarchies. To demonstrate its usefulness, the proposed gem5-accel is used to model a tensor accelerator based on Gemmini, showing that it can successfully anticipate the results of complex hardware accelerators executing deep Neural Networks.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"1-4"},"PeriodicalIF":2.3,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135361792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Reducing the Silicon Area Overhead of Counter-Based Rowhammer Mitigations 减少基于计数器的行锤缓解措施的硅面积开销
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-10-31 DOI: 10.1109/LCA.2023.3328824
Loïc France;Florent Bruguier;David Novo;Maria Mushtaq;Pascal Benoit
Modern computer memories have shown to have reliability issues. The main memory is the target of a security threat called Rowhammer, which causes bit flips in adjacent victim cells of aggressor rows. Numerous countermeasures have been proposed, some of the most efficient ones relying on row access counters, with different techniques to reduce the impact on performance, energy consumption and silicon area. In these proposals, the number of counters is calculated using the maximum number of row activations that can be issued to the protected bank. As reducing the number of counters results in lower silicon area and energy overheads, this can have a direct impact on the production and usage costs. In this work, we demonstrate that two of the most efficient countermeasures can have their silicon area overhead reduced by approximately 50% without impacting the protection level by changing their counting granularity.
现代计算机内存已显示出可靠性问题。主存储器是一种名为 "行锤"(Rowhammer)的安全威胁的目标,它会导致攻击行中相邻受害单元的位翻转。人们提出了许多对策,其中一些最有效的对策依赖于行访问计数器,并采用不同的技术来减少对性能、能耗和硅面积的影响。在这些建议中,计数器的数量是根据可向受保护行组发出的最大行激活次数来计算的。由于减少计数器数量可以降低硅面积和能耗开销,因此会对生产和使用成本产生直接影响。在这项工作中,我们展示了两种最有效的对策,通过改变其计数粒度,可将硅面积开销减少约 50%,而不会影响保护级别。
{"title":"Reducing the Silicon Area Overhead of Counter-Based Rowhammer Mitigations","authors":"Loïc France;Florent Bruguier;David Novo;Maria Mushtaq;Pascal Benoit","doi":"10.1109/LCA.2023.3328824","DOIUrl":"10.1109/LCA.2023.3328824","url":null,"abstract":"Modern computer memories have shown to have reliability issues. The main memory is the target of a security threat called Rowhammer, which causes bit flips in adjacent victim cells of aggressor rows. Numerous countermeasures have been proposed, some of the most efficient ones relying on row access counters, with different techniques to reduce the impact on performance, energy consumption and silicon area. In these proposals, the number of counters is calculated using the maximum number of row activations that can be issued to the protected bank. As reducing the number of counters results in lower silicon area and energy overheads, this can have a direct impact on the production and usage costs. In this work, we demonstrate that two of the most efficient countermeasures can have their silicon area overhead reduced by approximately 50% without impacting the protection level by changing their counting granularity.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"61-64"},"PeriodicalIF":2.3,"publicationDate":"2023-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135263777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Architectural Security Regulation 《建筑保安规例》
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-10-31 DOI: 10.1109/LCA.2023.3327952
Adam Hastings;Ryan Piersma;Simha Sethumadhavan
Across the world, governments are instituting regulations with the goal of improving the state of computer security. In this paper, we propose how security regulation can be formulated and implemented at the architectural level. Our proposal, called FAIRSHARE, requires architects to spend a pre-determined fraction of system resources (e.g., execution cycles) towards security but leaves the decision of how and where to spend this budget up to the architects of these systems. We discuss how this can elevate security and outline the key architectural support necessary to implement such a solution. Our work is the first work at the intersection of architecture and regulation.
世界各地的政府都在制定法规,目的是改善计算机安全状况。在本文中,我们提出了如何在体系结构层面制定和实施安全法规。我们的建议称为FAIRSHARE,要求架构师将预先确定的一部分系统资源(例如,执行周期)用于安全性,但将如何以及在何处花费该预算的决定留给这些系统的架构师。我们将讨论如何提高安全性,并概述实现此类解决方案所需的关键体系结构支持。我们的工作是第一个在建筑和法规的交叉点工作。
{"title":"Architectural Security Regulation","authors":"Adam Hastings;Ryan Piersma;Simha Sethumadhavan","doi":"10.1109/LCA.2023.3327952","DOIUrl":"10.1109/LCA.2023.3327952","url":null,"abstract":"Across the world, governments are instituting regulations with the goal of improving the state of computer security. In this paper, we propose how security regulation can be formulated and implemented at the architectural level. Our proposal, called FAIRSHARE, requires architects to spend a pre-determined fraction of system resources (e.g., execution cycles) towards security but leaves the decision of how and where to spend this budget up to the architects of these systems. We discuss how this can elevate security and outline the key architectural support necessary to implement such a solution. Our work is the first work at the intersection of architecture and regulation.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"173-176"},"PeriodicalIF":2.3,"publicationDate":"2023-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135263775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Quantum Computer Trusted Execution Environment 量子计算机可信执行环境
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-10-19 DOI: 10.1109/LCA.2023.3325852
Theodoros Trochatos;Chuanqi Xu;Sanjay Deshpande;Yao Lu;Yongshan Ding;Jakub Szefer
We present the first architecture for a trusted execution environment for quantum computers. In the architecture, to protect the user's circuits, they are obfuscated with decoy control pulses added during circuit transpilation by the user. The decoy pulses are removed, i.e. attenuated, by the trusted hardware inside the superconducting quantum computer's fridge before they reach the qubits. This preliminary work demonstrates that protection from possibly malicious cloud providers is feasible with minimal hardware cost.
我们提出了量子计算机可信执行环境的第一个体系结构。在该体系结构中,为了保护用户电路,用户在电路编译过程中加入诱饵控制脉冲对其进行混淆。诱骗脉冲在到达量子位之前,被超导量子计算机冰箱内的可信硬件移除,即衰减。这项初步工作表明,以最小的硬件成本保护可能恶意的云提供商是可行的。
{"title":"A Quantum Computer Trusted Execution Environment","authors":"Theodoros Trochatos;Chuanqi Xu;Sanjay Deshpande;Yao Lu;Yongshan Ding;Jakub Szefer","doi":"10.1109/LCA.2023.3325852","DOIUrl":"10.1109/LCA.2023.3325852","url":null,"abstract":"We present the first architecture for a trusted execution environment for quantum computers. In the architecture, to protect the user's circuits, they are obfuscated with decoy control pulses added during circuit transpilation by the user. The decoy pulses are removed, i.e. attenuated, by the trusted hardware inside the superconducting quantum computer's fridge before they reach the qubits. This preliminary work demonstrates that protection from possibly malicious cloud providers is feasible with minimal hardware cost.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"177-180"},"PeriodicalIF":2.3,"publicationDate":"2023-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135056635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Architectural Implications of GNN Aggregation Programming Abstractions GNN 聚合编程抽象的架构影响
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-10-19 DOI: 10.1109/LCA.2023.3326170
Yingjie Qi;Jianlei Yang;Ao Zhou;Tong Qiao;Chunming Hu
Graph neural networks (GNNs) have gained significant popularity due to the powerful capability to extract useful representations from graph data. As the need for efficient GNN computation intensifies, a variety of programming abstractions designed for optimizing GNN Aggregation have emerged to facilitate acceleration. However, there is no comprehensive evaluation and analysis upon existing abstractions, thus no clear consensus on which approach is better. In this letter, we classify existing programming abstractions for GNN Aggregation by the dimension of data organization and propagation method. By constructing these abstractions on a state-of-the-art GNN library, we perform a thorough and detailed characterization study to compare their performance and efficiency, and provide several insights on future GNN acceleration based on our analysis.
图形神经网络(GNN)具有从图形数据中提取有用表征的强大功能,因此大受欢迎。随着高效 GNN 计算需求的增加,出现了各种旨在优化 GNN 聚合的编程抽象,以促进 GNN 的加速。然而,目前还没有对现有抽象进行全面评估和分析,因此对于哪种方法更好还没有明确的共识。在这封信中,我们从数据组织和传播方法两个维度对现有的 GNN 聚合编程抽象进行了分类。通过在最先进的 GNN 库上构建这些抽象,我们进行了深入细致的特性研究,比较了它们的性能和效率,并根据我们的分析为未来的 GNN 加速提供了一些启示。
{"title":"Architectural Implications of GNN Aggregation Programming Abstractions","authors":"Yingjie Qi;Jianlei Yang;Ao Zhou;Tong Qiao;Chunming Hu","doi":"10.1109/LCA.2023.3326170","DOIUrl":"10.1109/LCA.2023.3326170","url":null,"abstract":"Graph neural networks (GNNs) have gained significant popularity due to the powerful capability to extract useful representations from graph data. As the need for efficient GNN computation intensifies, a variety of programming abstractions designed for optimizing GNN Aggregation have emerged to facilitate acceleration. However, there is no comprehensive evaluation and analysis upon existing abstractions, thus no clear consensus on which approach is better. In this letter, we classify existing programming abstractions for GNN Aggregation by the dimension of data organization and propagation method. By constructing these abstractions on a state-of-the-art GNN library, we perform a thorough and detailed characterization study to compare their performance and efficiency, and provide several insights on future GNN acceleration based on our analysis.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"125-128"},"PeriodicalIF":2.3,"publicationDate":"2023-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135056990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Hardware-Friendly Tiled Singular-Value Decomposition-Based Matrix Multiplication for Transformer-Based Models 基于变压器模型的一种硬件友好的平铺奇异值分解矩阵乘法
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-10-13 DOI: 10.1109/LCA.2023.3323482
Hailong Li;Jaewan Choi;Yongsuk Kwon;Jung Ho Ahn
Transformer-based models have become the backbone of numerous state-of-the-art natural language processing (NLP) tasks, including large language models. Matrix multiplication, a fundamental operation in the Transformer-based models, accounts for most of the execution time. While singular value decomposition (SVD) can accelerate this operation by reducing the amount of computation and memory footprints through rank size reduction, it leads to degraded model quality due to challenges in preserving important information. Moreover, this method does not effectively utilize the resources of modern GPUs. In this paper, we propose a hardware-friendly approach: matrix multiplication based on tiled singular value decomposition (TSVD). TSVD divides a matrix into multiple tiles and performs matrix factorization on each tile using SVD. By breaking down the process into smaller regions, TSVD mitigates the loss of important data. We apply the matrices decomposed by TSVD for matrix multiplication, and our TSVD-based matrix multiplication (TSVD-matmul) leverages GPU resources more efficiently compared to the SVD approach. As a result, TSVD-matmul achieved a speedup of 1.03× to 3.24× compared to the SVD approach at compression ratios ranging from 2 to 8. When deployed to GPT-2, TSVD not only performs competitively with a full fine-tuning on the E2E NLG task but also achieves a speedup of 1.06× to 1.24× at 2 to 8 compression ratios while increasing accuracy by up to 1.5 BLEU score.
基于变压器的模型已经成为许多最先进的自然语言处理(NLP)任务的支柱,包括大型语言模型。矩阵乘法是基于transformer的模型中的一个基本操作,它占用了大部分的执行时间。虽然奇异值分解(SVD)可以通过减少秩大小减少计算量和内存占用来加速该操作,但由于在保留重要信息方面存在挑战,它会导致模型质量下降。此外,这种方法不能有效地利用现代gpu的资源。在本文中,我们提出了一种硬件友好的方法:基于平块奇异值分解(TSVD)的矩阵乘法。TSVD将矩阵划分为多个块,并使用奇异值分解对每个块进行矩阵分解。通过将过程分解成更小的区域,TSVD减轻了重要数据的丢失。我们将TSVD分解的矩阵用于矩阵乘法,与SVD方法相比,我们的基于TSVD的矩阵乘法(TSVD- matl)方法在gpu上表现出更高的效率,特别是对于小问题规模或具有高瘦形状的矩阵。这是因为TSVD-matmul更有效地利用了GPU资源。因此,与SVD方法相比,TSVD-matmul在压缩比为2到8的情况下实现了1.03到3.24倍的加速。当部署到GPT-2时,TSVD不仅在E2E NLG任务上进行了全面微调,而且在2到8压缩比下实现了1.06到1.24倍的加速,同时将精度提高了1.5 BLEU分数。
{"title":"A Hardware-Friendly Tiled Singular-Value Decomposition-Based Matrix Multiplication for Transformer-Based Models","authors":"Hailong Li;Jaewan Choi;Yongsuk Kwon;Jung Ho Ahn","doi":"10.1109/LCA.2023.3323482","DOIUrl":"10.1109/LCA.2023.3323482","url":null,"abstract":"Transformer-based models have become the backbone of numerous state-of-the-art natural language processing (NLP) tasks, including large language models. Matrix multiplication, a fundamental operation in the Transformer-based models, accounts for most of the execution time. While singular value decomposition (SVD) can accelerate this operation by reducing the amount of computation and memory footprints through rank size reduction, it leads to degraded model quality due to challenges in preserving important information. Moreover, this method does not effectively utilize the resources of modern GPUs. In this paper, we propose a hardware-friendly approach: matrix multiplication based on tiled singular value decomposition (TSVD). TSVD divides a matrix into multiple tiles and performs matrix factorization on each tile using SVD. By breaking down the process into smaller regions, TSVD mitigates the loss of important data. We apply the matrices decomposed by TSVD for matrix multiplication, and our TSVD-based matrix multiplication (TSVD-matmul) leverages GPU resources more efficiently compared to the SVD approach. As a result, TSVD-matmul achieved a speedup of 1.03× to 3.24× compared to the SVD approach at compression ratios ranging from 2 to 8. When deployed to GPT-2, TSVD not only performs competitively with a full fine-tuning on the E2E NLG task but also achieves a speedup of 1.06× to 1.24× at 2 to 8 compression ratios while increasing accuracy by up to 1.5 BLEU score.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"169-172"},"PeriodicalIF":2.3,"publicationDate":"2023-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136305409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Computer Architecture Letters
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1