首页 > 最新文献

IEEE Computer Architecture Letters最新文献

英文 中文
Direct-Coding DNA With Multilevel Parallelism 多级并行的 DNA 直接编码技术
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-01-17 DOI: 10.1109/LCA.2024.3355109
Caden Corontzos;Eitan Frachtenberg
The cost and time to sequence entire genomes have been on a steady and rapid decline since the early 2000s, leading to an explosion of genomic data. In contrast, the growth rates for digital storage device capacity, CPU clock speed, and networking bandwidth have been much more moderate. This gap means that the need for storing, transmitting, and processing sequenced genomic data is outpacing the capacities of the underlying technologies. Compounding the problem is the fact that traditional data compression techniques used for natural language or images are not optimal for genomic data. To address this challenge, many data-compression techniques have been developed, offering a range of tradeoffs between compression ratio, computation time, memory requirements, and complexity. This paper focuses on a specific technique on one extreme of this tradeoff, namely two-bit coding, wherein every base in a genomic sequence is compressed from its original 8-bit ASCII representation to a unique two-bit binary representation. Even for this simple direct-coding scheme, current implementations leave room for significant performance improvements. Here, we show that this encoding can exploit multiple levels of parallelism in modern computer architectures to maximize encoding and decoding efficiency. Our open-source implementation achieves encoding and decoding rates of billions of bases per second, which are much higher than previously reported results. In fact, our measured throughput is typically limited only by the speed of the underlying storage media.
自 2000 年代初以来,全基因组测序的成本和时间一直在稳步快速下降,导致基因组数据激增。相比之下,数字存储设备容量、CPU 时钟速度和网络带宽的增长率则要温和得多。这种差距意味着,存储、传输和处理基因组测序数据的需求正在超过基础技术的能力。使问题更加复杂的是,用于自然语言或图像的传统数据压缩技术并不适合基因组数据。为了应对这一挑战,人们开发了许多数据压缩技术,在压缩率、计算时间、内存要求和复杂性之间进行了一系列权衡。本文将重点讨论这种权衡的一个极端的具体技术,即双位编码,其中基因组序列中的每个碱基都从其原始的 8 位 ASCII 表示压缩为唯一的双位二进制表示。即使是这种简单的直接编码方案,目前的实现方法也还有很大的改进空间。在这里,我们展示了这种编码可以利用现代计算机体系结构中的多级并行性,最大限度地提高编码和解码效率。我们的开源实现达到了每秒数十亿碱基的编码和解码率,远远高于之前报告的结果。事实上,我们测得的吞吐量通常只受到底层存储介质速度的限制。
{"title":"Direct-Coding DNA With Multilevel Parallelism","authors":"Caden Corontzos;Eitan Frachtenberg","doi":"10.1109/LCA.2024.3355109","DOIUrl":"10.1109/LCA.2024.3355109","url":null,"abstract":"The cost and time to sequence entire genomes have been on a steady and rapid decline since the early 2000s, leading to an explosion of genomic data. In contrast, the growth rates for digital storage device capacity, CPU clock speed, and networking bandwidth have been much more moderate. This gap means that the need for storing, transmitting, and processing sequenced genomic data is outpacing the capacities of the underlying technologies. Compounding the problem is the fact that traditional data compression techniques used for natural language or images are not optimal for genomic data. To address this challenge, many data-compression techniques have been developed, offering a range of tradeoffs between compression ratio, computation time, memory requirements, and complexity. This paper focuses on a specific technique on one extreme of this tradeoff, namely two-bit coding, wherein every base in a genomic sequence is compressed from its original 8-bit ASCII representation to a unique two-bit binary representation. Even for this simple direct-coding scheme, current implementations leave room for significant performance improvements. Here, we show that this encoding can exploit multiple levels of parallelism in modern computer architectures to maximize encoding and decoding efficiency. Our open-source implementation achieves encoding and decoding rates of billions of bases per second, which are much higher than previously reported results. In fact, our measured throughput is typically limited only by the speed of the underlying storage media.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"21-24"},"PeriodicalIF":2.3,"publicationDate":"2024-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139955521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
UDIR: Towards a Unified Compiler Framework for Reconfigurable Dataflow Architectures UDIR:面向可重构数据流架构的统一编译器框架
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-12-13 DOI: 10.1109/LCA.2023.3342130
Nikhil Agarwal;Mitchell Fream;Souradip Ghosh;Brian C. Schwedock;Nathan Beckmann
Specialized hardware accelerators have gained traction as a means to improve energy efficiency over inefficient von Neumann cores. However, as specialized hardware is limited to a few applications, there is increasing interest in programmable, non-von Neumann architectures to improve efficiency on a wider range of programs. Reconfigurable dataflow architectures (RDAs) are a promising design, but the design space is fragmented and, in particular, existing compiler and software stacks are ad hoc and hard to use. Without a robust, mature software ecosystem, RDAs lose much of their advantage over specialized hardware. This letter proposes a unifying dataflow intermediate representation (UDIR) for RDA compilers. Popular von Neumann compiler representations are inadequate for dataflow architectures because they do not represent the dataflow control paradigm, which is the target of many common compiler analyses and optimizations. UDIR introduces contexts to break regions of instruction reuse in programs. Contexts generalize prior dataflow control paradigms, representing where in the program tokens must be synchronized. We evaluate UDIR on four prior dataflow architectures, providing simple rewrite rules to lower UDIR to their respective machine-specific representations, and demonstrate a case study of using UDIR to optimize memory ordering.
与低效的冯-诺依曼内核相比,专用硬件加速器作为一种提高能效的手段,已经获得了广泛的关注。然而,由于专用硬件仅限于少数应用,人们对可编程、非冯-诺依曼架构的兴趣与日俱增,以提高更多程序的效率。可重构数据流架构(RDA)是一种前景广阔的设计,但其设计空间非常分散,尤其是现有的编译器和软件栈都是临时性的,很难使用。如果没有一个强大、成熟的软件生态系统,RDA 与专用硬件相比就会失去很多优势。这封信为 RDA 编译器提出了一种统一的数据流中间表示法(UDIR)。流行的冯-诺依曼编译器表示法不适合数据流架构,因为它们不能表示数据流控制范式,而数据流控制范式是许多常见编译器分析和优化的目标。UDIR 引入了上下文,以打破程序中的指令重用区域。上下文概括了之前的数据流控制范式,代表了程序中必须同步的标记位置。我们在四种先前的数据流架构上对 UDIR 进行了评估,提供了简单的重写规则,将 UDIR 降低到各自特定的机器表示形式,并演示了使用 UDIR 优化内存排序的案例研究。
{"title":"UDIR: Towards a Unified Compiler Framework for Reconfigurable Dataflow Architectures","authors":"Nikhil Agarwal;Mitchell Fream;Souradip Ghosh;Brian C. Schwedock;Nathan Beckmann","doi":"10.1109/LCA.2023.3342130","DOIUrl":"https://doi.org/10.1109/LCA.2023.3342130","url":null,"abstract":"Specialized hardware accelerators have gained traction as a means to improve energy efficiency over inefficient von Neumann cores. However, as specialized hardware is limited to a few applications, there is increasing interest in programmable, non-von Neumann architectures to improve efficiency on a wider range of programs. Reconfigurable dataflow architectures (RDAs) are a promising design, but the design space is fragmented and, in particular, existing compiler and software stacks are ad hoc and hard to use. Without a robust, mature software ecosystem, RDAs lose much of their advantage over specialized hardware. This letter proposes a unifying dataflow intermediate representation (UDIR) for RDA compilers. Popular von Neumann compiler representations are inadequate for dataflow architectures because they do not represent the dataflow control paradigm, which is the target of many common compiler analyses and optimizations. UDIR introduces \u0000<italic>contexts</i>\u0000 to break regions of instruction reuse in programs. Contexts generalize prior dataflow control paradigms, representing where in the program tokens must be synchronized. We evaluate UDIR on four prior dataflow architectures, providing simple rewrite rules to lower UDIR to their respective machine-specific representations, and demonstrate a case study of using UDIR to optimize memory ordering.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"99-103"},"PeriodicalIF":2.3,"publicationDate":"2023-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140818795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DRAMA: Commodity DRAM Based Content Addressable Memory DRAMA:基于商品 DRAM 的内容可寻址存储器
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-12-12 DOI: 10.1109/LCA.2023.3341830
L. Yavits
Fast parallel search capabilities on large datasets provided by content addressable memories (CAM) are required across multiple application domains. However compared to RAM, CAMs feature high area overhead and power consumption, and as a result, they scale poorly. The proposed solution, DRAMA, enables CAM, ternary CAM (TCAM) and approximate (similarity) search CAM functionalities in unmodified commodity DRAM. DRAMA performs compare operation in a bit-serial fashion, where the search pattern (query) is coded in DRAM addresses. A single bit compare (XNOR) in DRAMA is identical to a regular DRAM read. AND and OR operations required for NAND CAM and NOR CAM respectively are implemented using nonstandard DRAM timing. We evaluate DRAMA on bacterial DNA classification and show that DRAMA can achieve 3.6$ times $ higher performance and 19.6$ times $ lower power consumption compared to state-of-the-art CMOS CAM based genome classification accelerator.
多个应用领域都需要内容可寻址存储器(CAM)在大型数据集上提供快速并行搜索功能。然而,与 RAM 相比,CAM 的面积开销大、功耗高,因此扩展性较差。所提出的 DRAMA 解决方案可在未修改的商品 DRAM 中实现 CAM、三元 CAM (TCAM) 和近似(相似性)搜索 CAM 功能。DRAMA 以位串行方式执行比较操作,其中搜索模式(查询)以 DRAM 地址编码。DRAMA 中的单比特比较 (XNOR) 与常规 DRAM 读取相同。NAND CAM 和 NOR CAM 所需的 AND 和 OR 运算分别使用非标准 DRAM 时序实现。我们对 DRAMA 进行了细菌 DNA 分类评估,结果表明,与基于 CMOS CAM 的最先进基因组分类加速器相比,DRAMA 的性能提高了 3.6 倍,功耗降低了 19.6 倍。
{"title":"DRAMA: Commodity DRAM Based Content Addressable Memory","authors":"L. Yavits","doi":"10.1109/LCA.2023.3341830","DOIUrl":"10.1109/LCA.2023.3341830","url":null,"abstract":"Fast parallel search capabilities on large datasets provided by content addressable memories (CAM) are required across multiple application domains. However compared to RAM, CAMs feature high area overhead and power consumption, and as a result, they scale poorly. The proposed solution, DRAMA, enables CAM, ternary CAM (TCAM) and approximate (similarity) search CAM functionalities in unmodified commodity DRAM. DRAMA performs compare operation in a bit-serial fashion, where the search pattern (query) is coded in DRAM addresses. A single bit compare (XNOR) in DRAMA is identical to a regular DRAM read. AND and OR operations required for NAND CAM and NOR CAM respectively are implemented using nonstandard DRAM timing. We evaluate DRAMA on bacterial DNA classification and show that DRAMA can achieve 3.6\u0000<inline-formula><tex-math>$ times $</tex-math></inline-formula>\u0000 higher performance and 19.6\u0000<inline-formula><tex-math>$ times $</tex-math></inline-formula>\u0000 lower power consumption compared to state-of-the-art CMOS CAM based genome classification accelerator.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"65-68"},"PeriodicalIF":2.3,"publicationDate":"2023-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139160798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accelerating Deep Reinforcement Learning via Phase-Level Parallelism for Robotics Applications 通过阶段级并行性加速机器人应用中的深度强化学习
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-12-11 DOI: 10.1109/LCA.2023.3341152
Yang-Gon Kim;Yun-Ki Han;Jae-Kang Shin;Jun-Kyum Kim;Lee-Sup Kim
Deep Reinforcement Learning (DRL) plays a critical role in controlling future intelligent machines like robots and drones. Constantly retrained by newly arriving real-world data, DRL provides optimal autonomous control solutions for adapting to ever-changing environments. However, DRL repeats inference and training that are computationally expensive on resource-constraint mobile/embedded platforms. Even worse, DRL produces a severe hardware underutilization problem due to its unique execution pattern. To overcome the inefficiency of DRL, we propose Train Early Start, a new execution pattern for building the efficient DRL algorithm. Train Early Start parallelizes the inference and training execution, hiding the serialized performance bottleneck and improving the hardware utilization dramatically. Compared to the state-of-the-art mobile SoC, Train Early Start achieves 1.42x speedup and 1.13x energy efficiency.
深度强化学习(DRL)在控制机器人和无人机等未来智能机器方面发挥着至关重要的作用。DRL 不断根据新到达的真实世界数据进行训练,为适应不断变化的环境提供最佳自主控制解决方案。然而,在资源受限的移动/嵌入式平台上,DRL 需要重复推理和训练,计算成本高昂。更糟糕的是,由于 DRL 独特的执行模式,会产生严重的硬件利用率不足问题。为了克服 DRL 的低效问题,我们提出了一种新的执行模式--Train Early Start,用于构建高效的 DRL 算法。Train Early Start 将推理和训练执行并行化,隐藏了串行化的性能瓶颈,显著提高了硬件利用率。与最先进的移动 SoC 相比,Train Early Start 的速度提高了 1.42 倍,能效提高了 1.13 倍。
{"title":"Accelerating Deep Reinforcement Learning via Phase-Level Parallelism for Robotics Applications","authors":"Yang-Gon Kim;Yun-Ki Han;Jae-Kang Shin;Jun-Kyum Kim;Lee-Sup Kim","doi":"10.1109/LCA.2023.3341152","DOIUrl":"https://doi.org/10.1109/LCA.2023.3341152","url":null,"abstract":"Deep Reinforcement Learning (DRL) plays a critical role in controlling future intelligent machines like robots and drones. Constantly retrained by newly arriving real-world data, DRL provides optimal autonomous control solutions for adapting to ever-changing environments. However, DRL repeats inference and training that are computationally expensive on resource-constraint mobile/embedded platforms. Even worse, DRL produces a severe hardware underutilization problem due to its unique execution pattern. To overcome the inefficiency of DRL, we propose \u0000<italic>Train Early Start</i>\u0000, a new execution pattern for building the efficient DRL algorithm. \u0000<italic>Train Early Start</i>\u0000 parallelizes the inference and training execution, hiding the serialized performance bottleneck and improving the hardware utilization dramatically. Compared to the state-of-the-art mobile SoC, \u0000<italic>Train Early Start</i>\u0000 achieves 1.42x speedup and 1.13x energy efficiency.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"41-44"},"PeriodicalIF":2.3,"publicationDate":"2023-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140063484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Supporting a Virtual Vector Instruction Set on a Commercial Compute-in-SRAM Accelerator 在商用内存计算加速器上支持虚拟矢量指令集
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-12-11 DOI: 10.1109/LCA.2023.3341389
Courtney Golden;Dan Ilan;Caroline Huang;Niansong Zhang;Zhiru Zhang;Christopher Batten
Recent work has explored compute-in-SRAM as a promising approach to overcome the traditional processor-memory performance gap. The recently released Associative Processing Unit (APU) from GSI Technology is, to our knowledge, the first commercial compute-in-SRAM accelerator. Prior work on this platform has focused on domain-specific acceleration using direct microcode programming and/or specialized libraries. In this letter, we demonstrate the potential for supporting a more general-purpose vector abstraction on the APU. We implement a virtual vector instruction set based on the recently proposed RISC-V Vector (RVV) extensions, analyze tradeoffs in instruction implementations, and perform detailed instruction microbenchmarking to identify performance benefits and overheads. This work is a first step towards general-purpose computing on domain-specific compute-in-SRAM accelerators.
最近的工作探索了一种有希望克服传统处理器与内存性能差距的 "SRAM 内计算 "方法。据我们所知,GSI Technology 公司最近发布的关联处理单元(APU)是第一款商用 SRAM 内计算加速器。此前有关该平台的工作主要集中在使用直接微代码编程和/或专用库进行特定领域加速。在这封信中,我们展示了在 APU 上支持更通用矢量抽象的潜力。我们基于最近提出的 RISC-V 向量 (RVV) 扩展实现了虚拟向量指令集,分析了指令实现中的权衡,并进行了详细的指令微基准测试,以确定性能优势和开销。这项工作是在特定领域的 SRAM 计算加速器上实现通用计算的第一步。
{"title":"Supporting a Virtual Vector Instruction Set on a Commercial Compute-in-SRAM Accelerator","authors":"Courtney Golden;Dan Ilan;Caroline Huang;Niansong Zhang;Zhiru Zhang;Christopher Batten","doi":"10.1109/LCA.2023.3341389","DOIUrl":"https://doi.org/10.1109/LCA.2023.3341389","url":null,"abstract":"Recent work has explored compute-in-SRAM as a promising approach to overcome the traditional processor-memory performance gap. The recently released Associative Processing Unit (APU) from GSI Technology is, to our knowledge, the first commercial compute-in-SRAM accelerator. Prior work on this platform has focused on domain-specific acceleration using direct microcode programming and/or specialized libraries. In this letter, we demonstrate the potential for supporting a more general-purpose vector abstraction on the APU. We implement a virtual vector instruction set based on the recently proposed RISC-V Vector (RVV) extensions, analyze tradeoffs in instruction implementations, and perform detailed instruction microbenchmarking to identify performance benefits and overheads. This work is a first step towards general-purpose computing on domain-specific compute-in-SRAM accelerators.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"29-32"},"PeriodicalIF":2.3,"publicationDate":"2023-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139976194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploiting Intrinsic Redundancies in Dynamic Graph Neural Networks for Processing Efficiency 利用动态图神经网络的内在冗余提高处理效率
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-12-07 DOI: 10.1109/LCA.2023.3340504
Deniz Gurevin;Caiwen Ding;Omer Khan
Modern dynamical systems are rapidly incorporating artificial intelligence to improve the efficiency and quality of complex predictive analytics. To efficiently operate on increasingly large datasets and intrinsically dynamic non-euclidean data structures, the computing community has turned to Graph Neural Networks (GNNs). We make a key observation that existing GNN processing frameworks do not efficiently handle the intrinsic dynamics in modern GNNs. The dynamic processing of GNN operates on the complete static graph at each time step, leading to repetitive redundant computations that introduce tremendous under-utilization of system resources. We propose a novel dynamic graph neural network (DGNN) processing framework that captures the dynamically evolving dataflow of the GNN semantics, i.e., graph embeddings and sparse connections between graph nodes. The framework identifies intrinsic redundancies in node-connections and captures representative node-sparse graph information that is readily ingested for processing by the system. Our evaluation on an NVIDIA GPU shows up to 3.5× speedup over the baseline setup that processes all nodes at each time step.
现代动态系统正在迅速融入人工智能,以提高复杂预测分析的效率和质量。为了高效地处理日益庞大的数据集和内在动态的非欧几里得数据结构,计算界已转向图神经网络(GNN)。我们发现一个关键问题,即现有的图神经网络处理框架无法有效处理现代图神经网络的内在动态性。GNN 的动态处理在每个时间步都对完整的静态图进行操作,导致重复的冗余计算,造成系统资源的极大利用不足。我们提出了一种新颖的动态图神经网络(DGNN)处理框架,它能捕捉动态图神经网络语义的动态演化数据流,即图嵌入和图节点之间的稀疏连接。该框架可识别节点连接中的内在冗余,并捕捉具有代表性的节点稀疏图信息,以便系统随时进行处理。我们在英伟达™(NVIDIA®)图形处理器上进行的评估显示,与在每个时间步处理所有节点的基线设置相比,速度提高了 3.5 倍。
{"title":"Exploiting Intrinsic Redundancies in Dynamic Graph Neural Networks for Processing Efficiency","authors":"Deniz Gurevin;Caiwen Ding;Omer Khan","doi":"10.1109/LCA.2023.3340504","DOIUrl":"10.1109/LCA.2023.3340504","url":null,"abstract":"Modern dynamical systems are rapidly incorporating artificial intelligence to improve the efficiency and quality of complex predictive analytics. To efficiently operate on increasingly large datasets and intrinsically dynamic non-euclidean data structures, the computing community has turned to Graph Neural Networks (GNNs). We make a key observation that existing GNN processing frameworks do not efficiently handle the intrinsic dynamics in modern GNNs. The dynamic processing of GNN operates on the complete static graph at each time step, leading to repetitive redundant computations that introduce tremendous under-utilization of system resources. We propose a novel dynamic graph neural network (DGNN) processing framework that captures the dynamically evolving dataflow of the GNN semantics, i.e., graph embeddings and sparse connections between graph nodes. The framework identifies intrinsic redundancies in node-connections and captures representative node-sparse graph information that is readily ingested for processing by the system. Our evaluation on an NVIDIA GPU shows up to 3.5× speedup over the baseline setup that processes all nodes at each time step.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"170-174"},"PeriodicalIF":1.4,"publicationDate":"2023-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142183932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhancing the Reach and Reliability of Quantum Annealers by Pruning Longer Chains 通过修剪长链提高量子退火器的覆盖范围和可靠性
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-12-06 DOI: 10.1109/LCA.2023.3340030
Ramin Ayanzadeh;Moinuddin Qureshi
Analog Quantum Computers (QCs), such as D-Wave's Quantum Annealers (QAs) and QuEra's neutral atom platform, rival their digital counterparts in computing power. Existing QAs boast over 5,700 qubits, but their single-instruction operation model prevents using SWAP operations for making physically distant qubits adjacent. Instead, QAs use an embedding process to chain multiple physical qubits together, representing a program qubit with higher connectivity and reducing effective QA capacity by up to 33x. We observe that, post-embedding, nearly 25% of physical qubits remain unused, becoming trapped between chains. Additionally, we observe a “Power-Law” distribution in the chain lengths, where a few dominant chains possess significantly more qubits, thereby exerting a considerably more significant impact on both qubit utilization and isolation. Leveraging these insights, we propose Skipper, a software technique designed to enhance the capacity and fidelity of QAs by skipping dominant chains and substituting their program qubit with two measurement outcomes. Using a 5761-qubit QA, we observed that by skipping up to eleven chains, the capacity increased by up to 59% (avg 28%), and the error decreased by up to 44% (avg 33%).
模拟量子计算机(QC),如 D-Wave 的量子退火器(QAs)和 QuEra 的中性原子平台,在计算能力上可与数字量子计算机相媲美。现有的 QA 拥有超过 5,700 个量子比特,但它们的单指令操作模型无法使用 SWAP 操作使物理距离较远的量子比特相邻。取而代之的是,QA 使用嵌入过程将多个物理量子比特链在一起,代表了具有更高连通性的程序量子比特,并将 QA 的有效容量最多降低了 33 倍。我们观察到,嵌入后,近 25% 的物理量子比特仍未使用,被困在链之间。此外,我们还观察到了链长度的 "幂律 "分布,其中少数占主导地位的链拥有更多的量子比特,从而对量子比特利用率和隔离度产生了更为显著的影响。利用这些洞察力,我们提出了 Skipper,这是一种软件技术,旨在通过跳过优势链并用两个测量结果替代其程序量子比特来增强 QA 的容量和保真度。通过使用 5761 个量子比特的 QA,我们观察到通过跳过多达 11 个链,容量增加了多达 59%(平均 28%),误差减少了多达 44%(平均 33%)。
{"title":"Enhancing the Reach and Reliability of Quantum Annealers by Pruning Longer Chains","authors":"Ramin Ayanzadeh;Moinuddin Qureshi","doi":"10.1109/LCA.2023.3340030","DOIUrl":"https://doi.org/10.1109/LCA.2023.3340030","url":null,"abstract":"Analog Quantum Computers (QCs), such as D-Wave's \u0000<italic>Quantum Annealers</i>\u0000 (\u0000<italic>QAs</i>\u0000) and QuEra's neutral atom platform, rival their digital counterparts in computing power. Existing QAs boast over 5,700 qubits, but their single-instruction operation model prevents using SWAP operations for making physically distant qubits adjacent. Instead, QAs use an \u0000<italic>embedding</i>\u0000 process to chain multiple \u0000<italic>physical qubits</i>\u0000 together, representing a \u0000<italic>program qubit</i>\u0000 with higher connectivity and reducing effective QA capacity by up to 33x. We observe that, post-embedding, nearly 25% of physical qubits remain unused, becoming trapped between chains. Additionally, we observe a “Power-Law” distribution in the chain lengths, where a few \u0000<italic>dominant chains</i>\u0000 possess significantly more qubits, thereby exerting a considerably more significant impact on both qubit utilization and isolation. Leveraging these insights, we propose \u0000<italic>Skipper</i>\u0000, a software technique designed to enhance the capacity and fidelity of QAs by skipping dominant chains and substituting their program qubit with two measurement outcomes. Using a 5761-qubit QA, we observed that by skipping up to eleven chains, the capacity increased by up to 59% (avg 28%), and the error decreased by up to 44% (avg 33%).","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"25-28"},"PeriodicalIF":2.3,"publicationDate":"2023-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139976212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Tulip: Turn-Free Low-Power Network-on-Chip 郁金香免转低功耗片上网络
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-12-05 DOI: 10.1109/LCA.2023.3339646
Atiyeh Gheibi-Fetrat;Negar Akbarzadeh;Shaahin Hessabi;Hamid Sarbazi-Azad
The semiconductor industry has seen significant technological advancements, leading to an increase in the number of processing cores in a system-on-chip (SoC). To facilitate communication among the numerous on-chip cores, a network-on-chip (NoC) is employed. One of the main challenges of designing NoCs is power management since the NoC consumes a significant portion of the total power of the SoC. Among the power-intensive components of the NoC, routers stand out. We observe that some power-intensive components of routers, responsible for implementing turn in the mesh topology, are underutilized compared to others. Therefore, we propose Tulip, a turn-free low-power network-in-chip, that avoids within-router turns by removing the corresponding components from the router structure. On a turn (e.g., at the end of the current dimension), Tulip forces the packet to be ejected and then reinjects it to the next dimension channel (i.e., the beginning of the path along the next dimension). Due to its deadlock-free nature, Tulip's scheme may be used orthogonally with any deterministic, partially-adaptive, and fully-adaptive routing algorithms, and can easily be extended for any n-dimensional mesh topology. Our analysis reveals that Tulip can reduce the static power and area by 24%−50% and 25%-55%, respectively, for 2D-5D mesh routers.
半导体行业取得了重大技术进步,导致片上系统(SoC)中的处理内核数量不断增加。为了促进众多片上内核之间的通信,采用了片上网络(NoC)。设计 NoC 的主要挑战之一是功耗管理,因为 NoC 消耗了 SoC 总功耗的很大一部分。在 NoC 的功耗密集型组件中,路由器尤为突出。我们发现,路由器中负责在网状拓扑中实现转向的一些功耗密集型组件与其他组件相比利用率较低。因此,我们提出了无转向低功耗网络芯片 Tulip,通过从路由器结构中移除相应的组件来避免路由器内部的转向。在转弯时(例如,在当前维度的末端),Tulip 会强制弹出数据包,然后将其重新弹入下一维度通道(即沿下一维度路径的起点)。由于其无死锁特性,Tulip 方案可与任何确定性、部分自适应和全自适应路由算法正交使用,并可轻松扩展到任何 n 维网格拓扑。我们的分析表明,对于 2D-5D 网状路由器,Tulip 可以将静态功耗和面积分别降低 24%-50% 和 25%-55%。
{"title":"Tulip: Turn-Free Low-Power Network-on-Chip","authors":"Atiyeh Gheibi-Fetrat;Negar Akbarzadeh;Shaahin Hessabi;Hamid Sarbazi-Azad","doi":"10.1109/LCA.2023.3339646","DOIUrl":"https://doi.org/10.1109/LCA.2023.3339646","url":null,"abstract":"The semiconductor industry has seen significant technological advancements, leading to an increase in the number of processing cores in a system-on-chip (SoC). To facilitate communication among the numerous on-chip cores, a network-on-chip (NoC) is employed. One of the main challenges of designing NoCs is power management since the NoC consumes a significant portion of the total power of the SoC. Among the power-intensive components of the NoC, routers stand out. We observe that some power-intensive components of routers, responsible for implementing turn in the mesh topology, are underutilized compared to others. Therefore, we propose Tulip, a turn-free low-power network-in-chip, that avoids within-router turns by removing the corresponding components from the router structure. On a turn (e.g., at the end of the current dimension), Tulip forces the packet to be ejected and then reinjects it to the next dimension channel (i.e., the beginning of the path along the next dimension). Due to its deadlock-free nature, Tulip's scheme may be used orthogonally with any deterministic, partially-adaptive, and fully-adaptive routing algorithms, and can easily be extended for any n-dimensional mesh topology. Our analysis reveals that Tulip can reduce the static power and area by 24%−50% and 25%-55%, respectively, for 2D-5D mesh routers.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"5-8"},"PeriodicalIF":2.3,"publicationDate":"2023-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139060173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FPGA-Accelerated Data Preprocessing for Personalized Recommendation Systems 用于个性化推荐系统的 FPGA 加速数据预处理
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-11-28 DOI: 10.1109/LCA.2023.3336841
Hyeseong Kim;Yunjae Lee;Minsoo Rhu
Deep neural network (DNN)-based recommendation systems (RecSys) are one of the most successfully deployed machine learning applications in commercial services for predicting ad click-through rates or rankings. While numerous prior work explored hardware and software solutions to reduce the training time of RecSys, its end-to-end training pipeline including the data preprocessing stage has received little attention. In this work, we provide a comprehensive analysis of RecSys data preprocessing, root-causing the feature generation and normalization steps to cause a major performance bottleneck. Based on our characterization, we explore the efficacy of an FPGA-accelerated RecSys preprocessing system that achieves a significant 3.4–12.1× end-to-end speedup compared to the baseline CPU-based RecSys preprocessing system.
基于深度神经网络(DNN)的推荐系统(RecSys)是商业服务中最成功的机器学习应用之一,用于预测广告点击率或排名。虽然之前有大量工作探索了缩短 RecSys 训练时间的硬件和软件解决方案,但包括数据预处理阶段在内的端到端训练流水线却很少受到关注。在这项工作中,我们对 RecSys 的数据预处理进行了全面分析,从根本上找出了导致主要性能瓶颈的特征生成和归一化步骤。基于我们的分析,我们探索了 FPGA 加速 RecSys 预处理系统的功效,与基于 CPU 的基线 RecSys 预处理系统相比,该系统的端到端速度显著提高了 3.4-12.1 倍。
{"title":"FPGA-Accelerated Data Preprocessing for Personalized Recommendation Systems","authors":"Hyeseong Kim;Yunjae Lee;Minsoo Rhu","doi":"10.1109/LCA.2023.3336841","DOIUrl":"https://doi.org/10.1109/LCA.2023.3336841","url":null,"abstract":"Deep neural network (DNN)-based recommendation systems (RecSys) are one of the most successfully deployed machine learning applications in commercial services for predicting ad click-through rates or rankings. While numerous prior work explored hardware and software solutions to reduce the training time of RecSys, its end-to-end training pipeline including the data preprocessing stage has received little attention. In this work, we provide a comprehensive analysis of RecSys data preprocessing, root-causing the feature generation and normalization steps to cause a major performance bottleneck. Based on our characterization, we explore the efficacy of an FPGA-accelerated RecSys preprocessing system that achieves a significant 3.4–12.1× end-to-end speedup compared to the baseline CPU-based RecSys preprocessing system.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"7-10"},"PeriodicalIF":2.3,"publicationDate":"2023-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139504430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Redundant Array of Independent Memory Devices 独立内存设备冗余阵列
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-11-20 DOI: 10.1109/LCA.2023.3334989
Peiyun Wu;Trung Le;Zhichun Zhu;Zhao Zhang
DRAM memory reliability is increasingly a concern as recent studies found. In this letter, we propose RAIMD (Redundant Array of Independent Memory Devices), an energy-efficient memory organization with RAID-like error protection. In this organization, each memory device works as an independent memory module to serve a whole memory request and to support error detection and error recovery. It relies on the high data rate of modern memory device to minimize the performance impact of increased data transfer time. RAIMD provides chip-level error protection similar to Chipkill but with significant energy savings. Our simulation results indicate that RAIMD can save memory energy by 26.3% on average with a small performance overhead of 5.3% on DDR5-4800 memory systems for SPEC2017 multi-core workloads.
最近的研究发现,DRAM存储器的可靠性越来越受到关注。在这封信中,我们提出RAIMD(独立存储器设备冗余阵列),一种具有类似raid的错误保护的节能存储器组织。在这种组织中,每个存储设备作为一个独立的存储模块来服务于整个内存请求,并支持错误检测和错误恢复。它依赖于现代存储设备的高数据速率,以尽量减少增加的数据传输时间对性能的影响。RAIMD提供类似Chipkill的芯片级错误保护,但具有显著的节能效果。我们的仿真结果表明,在SPEC2017多核工作负载的DDR5-4800内存系统上,RAIMD可以平均节省26.3%的内存能量,而性能开销仅为5.3%。
{"title":"Redundant Array of Independent Memory Devices","authors":"Peiyun Wu;Trung Le;Zhichun Zhu;Zhao Zhang","doi":"10.1109/LCA.2023.3334989","DOIUrl":"https://doi.org/10.1109/LCA.2023.3334989","url":null,"abstract":"DRAM memory reliability is increasingly a concern as recent studies found. In this letter, we propose RAIMD (Redundant Array of Independent Memory Devices), an energy-efficient memory organization with RAID-like error protection. In this organization, each memory device works as an independent memory module to serve a whole memory request and to support error detection and error recovery. It relies on the high data rate of modern memory device to minimize the performance impact of increased data transfer time. RAIMD provides chip-level error protection similar to Chipkill but with significant energy savings. Our simulation results indicate that RAIMD can save memory energy by 26.3% on average with a small performance overhead of 5.3% on DDR5-4800 memory systems for SPEC2017 multi-core workloads.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"181-184"},"PeriodicalIF":2.3,"publicationDate":"2023-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138633920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Computer Architecture Letters
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1