首页 > 最新文献

IEEE Computer Architecture Letters最新文献

英文 中文
Containerized In-Storage Processing Model and Hardware Acceleration for Fully-Flexible Computational SSDs 用于全灵活计算固态硬盘的容器化存储内处理模型和硬件加速
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-06-27 DOI: 10.1109/lca.2023.3289828
Donghyun Gouk, Miryeong Kwon, Hanyeoreum Bae, Myoungsoo Jung
In-storage processing (ISP) efficiently examines large datasets but faces performance and security challenges. We introduce DockerSSD, a flexible ISP model that runs various applications near flash without modification. It employs lightweight OS-level virtualization in modern SSDs for faster ISP and better storage intelligence with a high flexiblity. DockerSSD reuses existing Docker container images for real-time data processing without altering the storage interface or runtime. Our design includes a new communication method and virtual firmware, alongside automated container-related network and I/O handling hardware. DockerSSD achieves a 2× speed improvement and reduces system-level power by 35.7%, on average.
存储内处理(ISP)可有效检查大型数据集,但面临着性能和安全方面的挑战。我们引入了 DockerSSD,这是一种灵活的 ISP 模式,可在闪存附近运行各种应用,无需修改。它在现代固态硬盘中采用了轻量级操作系统级虚拟化技术,以实现更快的 ISP 和更高灵活性的存储智能。DockerSSD 可重复使用现有的 Docker 容器镜像进行实时数据处理,而无需更改存储接口或运行时。我们的设计包括新的通信方法和虚拟固件,以及与容器相关的自动化网络和 I/O 处理硬件。DockerSSD 的速度提高了 2 倍,系统级功耗平均降低了 35.7%。
{"title":"Containerized In-Storage Processing Model and Hardware Acceleration for Fully-Flexible Computational SSDs","authors":"Donghyun Gouk, Miryeong Kwon, Hanyeoreum Bae, Myoungsoo Jung","doi":"10.1109/lca.2023.3289828","DOIUrl":"https://doi.org/10.1109/lca.2023.3289828","url":null,"abstract":"In-storage processing (ISP) efficiently examines large datasets but faces performance and security challenges. We introduce DockerSSD, a flexible ISP model that runs various applications near flash without modification. It employs lightweight OS-level virtualization in modern SSDs for faster ISP and better storage intelligence with a high flexiblity. DockerSSD reuses existing Docker container images for real-time data processing without altering the storage interface or runtime. Our design includes a new communication method and virtual firmware, alongside automated container-related network and I/O handling hardware. DockerSSD achieves a 2× speed improvement and reduces system-level power by 35.7%, on average.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"26 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142183968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Guard Cache: Creating Noisy Side-Channels 保护缓存:创建有噪声的侧通道
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-06-27 DOI: 10.1109/LCA.2023.3289710
Fernando Mosquera;Krishna Kavi;Gayatri Mehta;Lizy John
Microarchitectural innovations such as deep cache hierarchies, out-of-order execution, branch prediction and speculative execution have made possible the design of processors that meet ever-increasing demands for performance. However, these innovations have inadvertently introduced vulnerabilities, which are exploited by side-channel attacks and attacks relying on speculative executions. Mitigating the attacks while preserving the performance has been a challenge. In this letter we present an approach to obfuscate cache timing, making it more difficult for side-channel attacks to succeed. We create false cache hits using a small Guard Cache with randomization, and false cache misses by randomly evicting cache lines. We show that our false hits and false misses cause very minimal performance penalties and our obfuscation can make it difficult for common side-channel attacks such as Prime &Probe, Flush &Reload or Evict &Time to succeed.
微体系结构创新,如深度缓存层次结构、无序执行、分支预测和推测执行,使处理器的设计能够满足不断增长的性能需求。然而,这些创新无意中引入了漏洞,这些漏洞被侧通道攻击和依赖推测执行的攻击所利用。在保持性能的同时减少攻击一直是一个挑战。在这封信中,我们提出了一种模糊缓存定时的方法,使侧通道攻击更难成功。我们使用带有随机化的小型保护缓存创建虚假缓存命中,并通过随机驱逐缓存行创建虚假缓存未命中。我们表明,我们的错误命中和错误未命中造成的性能损失非常小,我们的混淆会使常见的侧通道攻击(如Prime&Probe、Flush&Reload或Evict&Time)难以成功。
{"title":"Guard Cache: Creating Noisy Side-Channels","authors":"Fernando Mosquera;Krishna Kavi;Gayatri Mehta;Lizy John","doi":"10.1109/LCA.2023.3289710","DOIUrl":"10.1109/LCA.2023.3289710","url":null,"abstract":"Microarchitectural innovations such as deep cache hierarchies, out-of-order execution, branch prediction and speculative execution have made possible the design of processors that meet ever-increasing demands for performance. However, these innovations have inadvertently introduced vulnerabilities, which are exploited by side-channel attacks and attacks relying on speculative executions. Mitigating the attacks while preserving the performance has been a challenge. In this letter we present an approach to obfuscate cache timing, making it more difficult for side-channel attacks to succeed. We create \u0000<italic>false cache hits</i>\u0000 using a small \u0000<italic>Guard Cache</i>\u0000 with randomization, and \u0000<italic>false cache misses</i>\u0000 by randomly evicting cache lines. We show that our \u0000<italic>false hits</i>\u0000 and \u0000<italic>false misses</i>\u0000 cause very minimal performance penalties and our obfuscation can make it difficult for common side-channel attacks such as Prime &Probe, Flush &Reload or Evict &Time to succeed.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"97-100"},"PeriodicalIF":2.3,"publicationDate":"2023-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41685733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TURBULENCE: Complexity-Effective Out-of-Order Execution on GPU With Distance-Based ISA TURBULENCE:利用基于距离的 ISA 在 GPU 上进行具有完备性的无序执行
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-06-26 DOI: 10.1109/LCA.2023.3289317
Reoma Matsuo;Toru Koizumi;Hidetsugu Irie;Shuichi Sakai;Ryota Shioya
A graphics processing unit (GPU) is a processor that achieves high throughput by exploiting data parallelism. We found that many GPU workloads also contain instruction-level parallelism that can be extracted through out-of-order execution to provide additional performance improvement opportunities. We propose the TURBULENCE architecture for very low-cost out-of-order execution on GPUs. TURBULENCE consists of a novel ISA that introduces the concept of referencing operands by inter-instruction distance instead of register numbers, and a novel microarchitecture that executes the novel ISA. This distance-based operand has the property of not causing false dependencies. By exploiting this property, we achieve cost-effective out-of-order execution on GPUs without introducing expensive hardware such as a rename logic and a load-store queue. Simulation results show that TURBULENCE improves performance by 17.6% without increasing energy consumption over an existing GPU.
图形处理器(GPU)是一种通过利用数据并行性实现高吞吐量的处理器。我们发现,许多 GPU 工作负载也包含指令级并行性,可以通过失序执行来提取指令级并行性,从而提供额外的性能提升机会。我们提出了 TURBULENCE 架构,用于在 GPU 上实现极低成本的失序执行。TURBULENCE 由一个新颖的 ISA 和一个执行新颖 ISA 的新颖微体系结构组成,前者引入了通过指令间距离而不是寄存器编号来引用操作数的概念。这种基于距离的操作数具有不会造成错误依赖的特性。利用这一特性,我们无需引入重命名逻辑和加载存储队列等昂贵的硬件,就能在 GPU 上实现经济高效的无序执行。仿真结果表明,与现有的 GPU 相比,TURBULENCE 在不增加能耗的情况下将性能提高了 17.6%。
{"title":"TURBULENCE: Complexity-Effective Out-of-Order Execution on GPU With Distance-Based ISA","authors":"Reoma Matsuo;Toru Koizumi;Hidetsugu Irie;Shuichi Sakai;Ryota Shioya","doi":"10.1109/LCA.2023.3289317","DOIUrl":"10.1109/LCA.2023.3289317","url":null,"abstract":"A graphics processing unit (GPU) is a processor that achieves high throughput by exploiting data parallelism. We found that many GPU workloads also contain instruction-level parallelism that can be extracted through out-of-order execution to provide additional performance improvement opportunities. We propose the TURBULENCE architecture for very low-cost out-of-order execution on GPUs. TURBULENCE consists of a novel ISA that introduces the concept of referencing operands by inter-instruction distance instead of register numbers, and a novel microarchitecture that executes the novel ISA. This distance-based operand has the property of not causing false dependencies. By exploiting this property, we achieve cost-effective out-of-order execution on GPUs without introducing expensive hardware such as a rename logic and a load-store queue. Simulation results show that TURBULENCE improves performance by 17.6% without increasing energy consumption over an existing GPU.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"175-178"},"PeriodicalIF":1.4,"publicationDate":"2023-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142183933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Toward Practical 128-Bit General Purpose Microarchitectures 面向实用的128位通用微体系结构
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-06-20 DOI: 10.1109/LCA.2023.3287762
Chandana S. Deshpande;Arthur Perais;Frédéric Pétrot
Intel introduced 5-level paging mode to support 57-bit virtual address space in 2017. This, coupled to paradigms where backup storage can be accessed through load and store instructions (e.g., non volatile memories), lets us envision a future in which a 64-bit address space has become insufficient. In that event, the straightforward solution would be to adopt a flat 128-bit address space. In this early stage letter, we conduct high-level experiments that lead us to suggest a possible general-purpose processor micro-architecture providing 128-bit support with limited hardware cost.
英特尔在2017年推出了5级分页模式,以支持57位虚拟地址空间。这一点,再加上可以通过加载和存储指令(例如,非易失性存储器)访问备份存储的模式,让我们可以设想64位地址空间不足的未来。在这种情况下,直接的解决方案是采用平坦的128位地址空间。在这封早期的信中,我们进行了高级别的实验,从而提出了一种可能的通用处理器微架构,该架构以有限的硬件成本提供128位支持。
{"title":"Toward Practical 128-Bit General Purpose Microarchitectures","authors":"Chandana S. Deshpande;Arthur Perais;Frédéric Pétrot","doi":"10.1109/LCA.2023.3287762","DOIUrl":"10.1109/LCA.2023.3287762","url":null,"abstract":"Intel introduced 5-level paging mode to support 57-bit virtual address space in 2017. This, coupled to paradigms where backup storage can be accessed through load and store instructions (e.g., non volatile memories), lets us envision a future in which a 64-bit address space has become insufficient. In that event, the straightforward solution would be to adopt a flat 128-bit address space. In this early stage letter, we conduct high-level experiments that lead us to suggest a possible general-purpose processor micro-architecture providing 128-bit support with limited hardware cost.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"81-84"},"PeriodicalIF":2.3,"publicationDate":"2023-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48523174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DVFaaS: Leveraging DVFS for FaaS Workflows DVFaaS:为FaaS工作流利用DVFS
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-06-20 DOI: 10.1109/LCA.2023.3288089
Achilleas Tzenetopoulos;Dimosthenis Masouros;Dimitrios Soudris;Sotirios Xydis
In this letter, we propose DVFaaS, a per-core DVFS framework that utilizes control systems theory to assign just-enough frequency for the purpose of addressing the QoS requirements on serverless workflows comprising unseen functions. DVFaaS exploits the intermittent nature of serverless workflows, which enables staged control on distinguishable functions, which jointly contribute to the end-to-end latency. Our results show that DVFaaS considerably outperforms related work, reducing power consumption by up to 22%, with 2x fewer QoS violations.
在这封信中,我们提出了DVFaaS,这是一个基于核心的DVFS框架,它利用控制系统理论来分配足够的频率,以满足包括看不见的功能的无服务器工作流的QoS要求。DVFaaS利用了无服务器工作流的间歇性,实现了对可区分功能的分阶段控制,这共同导致了端到端延迟。我们的结果表明,DVFaaS显著优于相关工作,功耗降低了22%,QoS违规减少了2倍。
{"title":"DVFaaS: Leveraging DVFS for FaaS Workflows","authors":"Achilleas Tzenetopoulos;Dimosthenis Masouros;Dimitrios Soudris;Sotirios Xydis","doi":"10.1109/LCA.2023.3288089","DOIUrl":"10.1109/LCA.2023.3288089","url":null,"abstract":"In this letter, we propose \u0000<italic>DVFaaS</i>\u0000, a per-core DVFS framework that utilizes control systems theory to assign \u0000<italic>just-enough</i>\u0000 frequency for the purpose of addressing the QoS requirements on serverless workflows comprising unseen functions. \u0000<italic>DVFaaS</i>\u0000 exploits the intermittent nature of serverless workflows, which enables staged control on distinguishable functions, which jointly contribute to the end-to-end latency. Our results show that \u0000<italic>DVFaaS</i>\u0000 considerably outperforms related work, reducing power consumption by up to 22%, with 2x fewer QoS violations.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"85-88"},"PeriodicalIF":2.3,"publicationDate":"2023-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46172046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Design of a High-Performance, High-Endurance Key-Value SSD for Large-Key Workloads 面向大密钥工作负载的高性能、高持久键值SSD设计
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-06-02 DOI: 10.1109/LCA.2023.3282276
Chanyoung Park;Chun-Yi Liu;Kyungtae Kang;Mahmut Kandemir;Wonil Choi
Current KV-SSD design assumes a specific range of typical workloads, where the size of values is quite large while that of keys is relatively small. However, we find that (i) there exist another spectrum of workloads, whose key sizes are relatively large, compared to their value sizes, and (ii) the current KV-SSD design suffers from long tail latencies and low storage utilization under such large-key workloads. To this end, we present novel design of a KV-SSD (called LK-SSD), which can reduce tail latences and increase storage utilization under large-key workloads, and add an enhancement to it for longer device lifetime. Through extensive experiments, we show that LK-SSD is more suitable design for the large-key workloads, and also available for the typical workloads.
当前的KV-SSD设计假设了特定范围的典型工作负载,其中值的大小相当大,而键的大小相对较小。然而,我们发现(i)存在另一类工作负载,其密钥大小与其值大小相比相对较大;(ii)当前的KV-SSD设计在这种大密钥工作负载下存在长尾延迟和低存储利用率。为此,我们提出了一种新颖的KV-SSD(称为LK-SSD)设计,它可以减少尾部延迟,提高大密钥工作负载下的存储利用率,并对其进行增强,以延长设备寿命。通过大量的实验,我们证明LK-SSD更适合于大密钥工作负载的设计,也适用于典型的工作负载。
{"title":"Design of a High-Performance, High-Endurance Key-Value SSD for Large-Key Workloads","authors":"Chanyoung Park;Chun-Yi Liu;Kyungtae Kang;Mahmut Kandemir;Wonil Choi","doi":"10.1109/LCA.2023.3282276","DOIUrl":"https://doi.org/10.1109/LCA.2023.3282276","url":null,"abstract":"Current KV-SSD design assumes a specific range of typical workloads, where the size of values is quite large while that of keys is relatively small. However, we find that (i) there exist another spectrum of workloads, whose key sizes are relatively large, compared to their value sizes, and (ii) the current KV-SSD design suffers from long tail latencies and low storage utilization under such large-key workloads. To this end, we present novel design of a KV-SSD (called LK-SSD), which can reduce tail latences and increase storage utilization under large-key workloads, and add an enhancement to it for longer device lifetime. Through extensive experiments, we show that LK-SSD is more suitable design for the large-key workloads, and also available for the typical workloads.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"149-152"},"PeriodicalIF":2.3,"publicationDate":"2023-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49962230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Kobold: Simplified Cache Coherence for Cache-Attached Accelerators Kobold:简化缓存连接加速器的缓存一致性
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-04-21 DOI: 10.1109/LCA.2023.3269399
Jennifer Brana;Brian C. Schwedock;Yatin A. Manerkar;Nathan Beckmann
The ever-increasing cost of data movement in computer systems is driving a new era of data-centric computing. One of the most common data-centric paradigms is near-data computing (NDC), where accelerators are placed inside the memory hierarchy to avoid the costly transfer of data to the core. NDC systems show immense potential to improve performance and energy efficiency. Unfortunately, adding accelerators into the memory hierarchy incurs significant complexity for system integration because accelerators often require cache-coherent access to memory. The complex coherence protocols required to handle both cores and cache-attached accelerators result in significantly higher verification costs as well as an increase in directory state and on-chip network traffic. Furthermore, these mechanisms can cause cache pollution and worsen baseline processor performance. To simplify the integration of cache-attached accelerators, we present Kobold, a new coherence protocol and implementation which restricts the added complexity of an accelerator to its local tile. Kobold introduces a new directory structure within the L2 cache to track the accelerator's private cache and maintain coherence between the core and accelerator. A minor modification to the LLC protocol also enables accelerators to improve performance by bypassing the local L2. We verified Kobold's stable-state coherence protocols using the Murphi model checker and estimated area overhead using Cacti 7. Kobold simplifies integration of cache-attached accelerators, adds only 0.09% area over the baseline caches, and provides clear performance advantages versus naïve extensions of existing directory coherence protocols.
计算机系统中不断增长的数据移动成本正在推动一个以数据为中心的计算的新时代。最常见的以数据为中心的模式之一是近数据计算(NDC),其中加速器被放置在内存层次结构中,以避免将数据传输到核心的成本高昂。NDC系统在提高性能和能源效率方面显示出巨大潜力。不幸的是,将加速器添加到内存层次结构中会导致系统集成非常复杂,因为加速器通常需要对内存进行高速缓存一致访问。处理核心和高速缓存连接的加速器所需的复杂一致性协议导致显著更高的验证成本以及目录状态和片上网络流量的增加。此外,这些机制可能会导致缓存污染,并恶化基线处理器性能。为了简化连接缓存的加速器的集成,我们提出了Kobold,这是一种新的一致性协议和实现,它将加速器的复杂性限制在其本地瓦片上。Kobold在二级缓存中引入了一种新的目录结构,以跟踪加速器的专用缓存,并保持内核和加速器之间的一致性。对LLC协议的微小修改也使加速器能够通过绕过本地L2来提高性能。我们使用Murphi模型检查器验证了Kobold的稳定状态一致性协议,并使用Cacti 7估计了区域开销。Kobold简化了缓存连接加速器的集成,仅比基线缓存增加0.09%的面积,与现有目录一致性协议的幼稚扩展相比,提供了明显的性能优势。
{"title":"Kobold: Simplified Cache Coherence for Cache-Attached Accelerators","authors":"Jennifer Brana;Brian C. Schwedock;Yatin A. Manerkar;Nathan Beckmann","doi":"10.1109/LCA.2023.3269399","DOIUrl":"10.1109/LCA.2023.3269399","url":null,"abstract":"The ever-increasing cost of data movement in computer systems is driving a new era of data-centric computing. One of the most common data-centric paradigms is near-data computing (NDC), where accelerators are placed \u0000<italic>inside</i>\u0000 the memory hierarchy to avoid the costly transfer of data to the core. NDC systems show immense potential to improve performance and energy efficiency. Unfortunately, adding accelerators into the memory hierarchy incurs significant complexity for system integration because accelerators often require cache-coherent access to memory. The complex coherence protocols required to handle both cores and cache-attached accelerators result in significantly higher verification costs as well as an increase in directory state and on-chip network traffic. Furthermore, these mechanisms can cause cache pollution and worsen baseline processor performance. To simplify the integration of cache-attached accelerators, we present Kobold, a new coherence protocol and implementation which restricts the added complexity of an accelerator to its local tile. Kobold introduces a new directory structure within the L2 cache to track the accelerator's private cache and maintain coherence between the core and accelerator. A minor modification to the LLC protocol also enables accelerators to improve performance by bypassing the local L2. We verified Kobold's stable-state coherence protocols using the Murphi model checker and estimated area overhead using Cacti 7. Kobold simplifies integration of cache-attached accelerators, adds only 0.09% area over the baseline caches, and provides clear performance advantages versus naïve extensions of existing directory coherence protocols.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 1","pages":"41-44"},"PeriodicalIF":2.3,"publicationDate":"2023-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43340299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Canal: A Flexible Interconnect Generator for Coarse-Grained Reconfigurable Arrays Canal:用于粗粒度可重构阵列的柔性互连生成器
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-04-19 DOI: 10.1109/LCA.2023.3268126
Jackson Melchert;Keyi Zhang;Yuchen Mei;Mark Horowitz;Christopher Torng;Priyanka Raina
The architecture of a coarse-grained reconfigurable array (CGRA) interconnect has a significant effect on not only the flexibility of the resulting accelerator, but also its power, performance, and area. Design decisions that have complex trade-offs need to be explored to maintain efficiency and performance across a variety of evolving applications. This paper presents Canal, a Python-embedded domain-specific language (eDSL) and compiler for specifying and generating reconfigurable interconnects for CGRAs. Canal uses a graph-based intermediate representation (IR) that allows for easy hardware generation and tight integration with place and route tools. We evaluate Canal by constructing both a fully static interconnect and a hybrid interconnect with ready-valid signaling, and by conducting design space exploration of the interconnect architecture by modifying the switch box topology, the number of routing tracks, and the interconnect tile connections. Through the use of a graph-based IR for CGRA interconnects, the eDSL, and the interconnect generation system, Canal enables fast design space exploration and creation of CGRA interconnects.
粗粒度可重构阵列(CGRA)互连的体系结构不仅对所得到的加速器的灵活性有重要影响,而且对其功率、性能和面积也有重要影响。需要探索具有复杂权衡的设计决策,以便在各种不断发展的应用程序中保持效率和性能。本文介绍了一种python嵌入式领域特定语言(eDSL)和编译器,用于为CGRAs指定和生成可重构互连。Canal使用一种基于图形的中间表示(IR),它允许简单的硬件生成和与位置和路由工具的紧密集成。我们通过构建一个完全静态的互连和一个具有现成有效信号的混合互连来评估Canal,并通过修改开关箱拓扑、路由轨道数量和互连瓦片连接来进行互连架构的设计空间探索。通过对CGRA互连、eDSL和互连生成系统使用基于图形的IR, Canal实现了CGRA互连的快速设计空间探索和创建。
{"title":"Canal: A Flexible Interconnect Generator for Coarse-Grained Reconfigurable Arrays","authors":"Jackson Melchert;Keyi Zhang;Yuchen Mei;Mark Horowitz;Christopher Torng;Priyanka Raina","doi":"10.1109/LCA.2023.3268126","DOIUrl":"10.1109/LCA.2023.3268126","url":null,"abstract":"The architecture of a coarse-grained reconfigurable array (CGRA) interconnect has a significant effect on not only the flexibility of the resulting accelerator, but also its power, performance, and area. Design decisions that have complex trade-offs need to be explored to maintain efficiency and performance across a variety of evolving applications. This paper presents Canal, a Python-embedded domain-specific language (eDSL) and compiler for specifying and generating reconfigurable interconnects for CGRAs. Canal uses a graph-based intermediate representation (IR) that allows for easy hardware generation and tight integration with place and route tools. We evaluate Canal by constructing both a fully static interconnect and a hybrid interconnect with ready-valid signaling, and by conducting design space exploration of the interconnect architecture by modifying the switch box topology, the number of routing tracks, and the interconnect tile connections. Through the use of a graph-based IR for CGRA interconnects, the eDSL, and the interconnect generation system, Canal enables fast design space exploration and creation of CGRA interconnects.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 1","pages":"45-48"},"PeriodicalIF":2.3,"publicationDate":"2023-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43724888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
SmartIndex: Learning to Index Caches to Improve Performance SmartIndex:学习索引缓存以提高性能
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-04-05 DOI: 10.1109/LCA.2023.3264478
Kevin Weston;Farabi Mahmud;Vahid Janfaza;Abdullah Muzahid
Modern computers rely heavily on caches to achieve higher performance. Unfortunately, a cache indexing scheme can often cause an uneven distribution of addresses across cache sets resulting in many evictions of useful cache blocks. To address this issue, we propose SmartIndex, a self-optimized indexing scheme that leverages machine learning to actively learn the memory access pattern and dynamically adjust indexes to evenly distribute the cache lines across all sets in the cache, thereby reducing cache misses. Experimental results on a set of 26 memory-intensive applications show that for non-uniform applications, SmartIndex can reduce the misses per kilo instructions (MPKI) of a direct mapped cache by up to 39%, translating into an IPC speedup of 7.23% compared to the conventional power-of-two indexing scheme. Our experiments also show that SmartIndex can work with any cache associativity.
现代计算机严重依赖缓存来实现更高的性能。不幸的是,缓存索引方案通常会导致地址在缓存集之间的不均匀分布,从而导致许多有用的缓存块被移除。为了解决这个问题,我们提出了SmartIndex,这是一种自优化的索引方案,它利用机器学习来主动学习内存访问模式,并动态调整索引,以将缓存线均匀分布在缓存中的所有集合中,从而减少缓存未命中。在一组26个内存密集型应用程序上的实验结果表明,对于非均匀应用程序,SmartIndex可以将直接映射缓存的每千指令未命中率(MPKI)降低39%,与传统的二次幂索引方案相比,IPC加速率为7.23%。我们的实验还表明,SmartIndex可以与任何缓存关联性一起工作。
{"title":"SmartIndex: Learning to Index Caches to Improve Performance","authors":"Kevin Weston;Farabi Mahmud;Vahid Janfaza;Abdullah Muzahid","doi":"10.1109/LCA.2023.3264478","DOIUrl":"10.1109/LCA.2023.3264478","url":null,"abstract":"Modern computers rely heavily on caches to achieve higher performance. Unfortunately, a cache indexing scheme can often cause an uneven distribution of addresses across cache sets resulting in many evictions of useful cache blocks. To address this issue, we propose \u0000<sc>SmartIndex</small>\u0000, a self-optimized indexing scheme that leverages machine learning to actively learn the memory access pattern and dynamically adjust indexes to evenly distribute the cache lines across all sets in the cache, thereby reducing cache misses. Experimental results on a set of 26 memory-intensive applications show that for non-uniform applications, \u0000<sc>SmartIndex</small>\u0000 can reduce the misses per kilo instructions (MPKI) of a direct mapped cache by up to 39%, translating into an IPC speedup of 7.23% compared to the conventional power-of-two indexing scheme. Our experiments also show that \u0000<sc>SmartIndex</small>\u0000 can work with any cache associativity.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 1","pages":"33-36"},"PeriodicalIF":2.3,"publicationDate":"2023-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48921816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Intermediate Language for General Sparse Format Customization 通用稀疏格式自定义的中间语言
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-03-28 DOI: 10.1109/LCA.2023.3262610
Jie Liu;Zhongyuan Zhao;Zijian Ding;Benjamin Brock;Hongbo Rong;Zhiru Zhang
The inevitable trend of hardware specialization drives an increasing use of custom data formats in processing sparse workloads, which are typically memory-bound. These formats facilitate the automated generation of target-aware data layouts to improve memory access latency and bandwidth utilization. However, existing sparse tensor programming models and compilers offer little or no support for productively customizing the sparse formats. Moreover, since these frameworks adopt an attribute-based approach for format abstraction, they cannot easily be extended to support general format customization. To overcome this deficiency, we propose UniSparse, an intermediate language that provides a unified abstraction for representing and customizing sparse formats. We also develop a compiler leveraging the MLIR infrastructure, which supports adaptive customization of formats. We demonstrate the efficacy of our approach through experiments running commonly-used sparse linear algebra operations with hybrid formats on multiple different hardware targets, including an Intel CPU, an NVIDIA GPU, and a simulated processing-in-memory (PIM) device.
硬件专门化的必然趋势推动了在处理稀疏工作负载时越来越多地使用自定义数据格式,这些工作负载通常是内存受限的。这些格式有助于自动生成目标感知数据布局,从而改善内存访问延迟和带宽利用率。然而,现有的稀疏张量编程模型和编译器很少或根本不支持有效地定制稀疏格式。此外,由于这些框架采用基于属性的方法进行格式抽象,因此不容易对它们进行扩展以支持一般格式定制。为了克服这一缺陷,我们提出了UniSparse,这是一种中间语言,它为表示和定制稀疏格式提供了统一的抽象。我们还开发了一个利用MLIR基础设施的编译器,它支持自适应自定义格式。我们通过在多个不同的硬件目标(包括Intel CPU、NVIDIA GPU和模拟内存处理(PIM)设备)上以混合格式运行常用的稀疏线性代数操作的实验,证明了我们方法的有效性。
{"title":"An Intermediate Language for General Sparse Format Customization","authors":"Jie Liu;Zhongyuan Zhao;Zijian Ding;Benjamin Brock;Hongbo Rong;Zhiru Zhang","doi":"10.1109/LCA.2023.3262610","DOIUrl":"https://doi.org/10.1109/LCA.2023.3262610","url":null,"abstract":"The inevitable trend of hardware specialization drives an increasing use of custom data formats in processing sparse workloads, which are typically memory-bound. These formats facilitate the automated generation of target-aware data layouts to improve memory access latency and bandwidth utilization. However, existing sparse tensor programming models and compilers offer little or no support for productively customizing the sparse formats. Moreover, since these frameworks adopt an attribute-based approach for format abstraction, they cannot easily be extended to support general format customization. To overcome this deficiency, we propose UniSparse, an intermediate language that provides a unified abstraction for representing and customizing sparse formats. We also develop a compiler leveraging the MLIR infrastructure, which supports adaptive customization of formats. We demonstrate the efficacy of our approach through experiments running commonly-used sparse linear algebra operations with hybrid formats on multiple different hardware targets, including an Intel CPU, an NVIDIA GPU, and a simulated processing-in-memory (PIM) device.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"153-156"},"PeriodicalIF":2.3,"publicationDate":"2023-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49962233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Computer Architecture Letters
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1