首页 > 最新文献

2021 IEEE 39th International Conference on Computer Design (ICCD)最新文献

英文 中文
Understanding and Optimizing Hybrid SSD with High-Density and Low-Cost Flash Memory 高密度低成本闪存混合SSD的理解与优化
Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00046
Liang Shi, Longfei Luo, Yina Lv, Shicheng Li, Changlong Li, E. Sha
With the development of NAND flash technology, hybrid SSDs with high-density and low-cost flash memory have become the mainstream of the existing SSD architecture. In this architecture, two flash modes can be dynamically switched, such as single-level cell (SLC) mode and quad-level cell (QLC) mode. Based on evaluations and analysis of multiple real devices, this paper presents two interesting findings. They demonstrate that the coordination between the two flash-modes is not well-designed in existing architectures. This paper proposes HyFlex, which redesigns the strategies of data placement and flash-mode management of hybrid SSDs in a flexible approach. Specifically, two novel optimization strategies are proposed: velocity-based I/O scheduling (VIS) and garbage collection (GC)-aware capacity tuning (GCT). Experimental results show that HyFlex achieves encouraging performance and endurance improvement.
随着NAND闪存技术的发展,具有高密度和低成本闪存的混合SSD已成为现有SSD架构的主流。在该架构中,可以动态切换两种闪存模式,如单电平单元(SLC)模式和四电平单元(QLC)模式。基于对多个真实装置的评估和分析,本文提出了两个有趣的发现。它们表明,在现有的架构中,两种闪存模式之间的协调没有得到很好的设计。HyFlex以一种灵活的方式重新设计了混合ssd的数据放置和闪存模式管理策略。具体来说,提出了两种新的优化策略:基于速度的I/O调度(VIS)和垃圾收集感知容量调优(GCT)。实验结果表明,HyFlex取得了令人鼓舞的性能和续航能力的提高。
{"title":"Understanding and Optimizing Hybrid SSD with High-Density and Low-Cost Flash Memory","authors":"Liang Shi, Longfei Luo, Yina Lv, Shicheng Li, Changlong Li, E. Sha","doi":"10.1109/ICCD53106.2021.00046","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00046","url":null,"abstract":"With the development of NAND flash technology, hybrid SSDs with high-density and low-cost flash memory have become the mainstream of the existing SSD architecture. In this architecture, two flash modes can be dynamically switched, such as single-level cell (SLC) mode and quad-level cell (QLC) mode. Based on evaluations and analysis of multiple real devices, this paper presents two interesting findings. They demonstrate that the coordination between the two flash-modes is not well-designed in existing architectures. This paper proposes HyFlex, which redesigns the strategies of data placement and flash-mode management of hybrid SSDs in a flexible approach. Specifically, two novel optimization strategies are proposed: velocity-based I/O scheduling (VIS) and garbage collection (GC)-aware capacity tuning (GCT). Experimental results show that HyFlex achieves encouraging performance and endurance improvement.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124951125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
HyperData: A Data Transfer Accelerator for Software Data Planes Based on Targeted Prefetching HyperData:基于目标预取的软件数据平面数据传输加速器
Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00059
Hossein Golestani, T. Wenisch
Datacenter systems rely on fast, efficient I/O soft-ware stacks—Software Data Planes (SDPs)—to coordinate frequent interaction among myriad processes (or VMs) and I/O devices (NICs, SSDs, etc.). Thanks to the impressive and ever-growing speed of today’s I/O devices and μs-scale computation due to hyper-tenancy and microservice-based applications, SDPs play a crucial role in overall system performance and efficiency. In this work, we aim to enhance data transfer among the SDP, I/O devices, and applications/VMs by designing the HyperData accelerator. Data items in SDP systems, such as network packets or storage blocks, are transferred through shared memory queues. Consumer cores typically access the data from DRAM or, thanks to technologies like Intel DDIO, from the (shared) last-level cache. Today, consumers cannot effectively prefetch such data to nearer caches due to the lack of a proper arrival notification mechanism and the complex access pattern of data buffers. HyperData is designed to perform targeted prefetching, wherein the exact data items (or a required subset) are prefetched to the L1 cache of the consumer core. Furthermore, HyperData is applicable to both core–device and core–core data communication, and it supports complex queue formats like Virtio and multi-consumer queues. HyperData is realized with a per-core programmable prefetcher, which issues the prefetch requests, and a system-level monitoring set, which monitors queues for data arrival and triggers prefetch operations. We show that HyperData improves processing latency by 1.20-2.42× in a simulation of a state-of-the-art SDP, with only a few hundred bytes of per-core overhead.
数据中心系统依赖于快速、高效的I/O软件堆栈——软件数据平面(sdp)——来协调无数进程(或虚拟机)和I/O设备(网卡、ssd等)之间频繁的交互。由于当今I/O设备的速度令人印象深刻且不断增长,以及基于超租户和微服务的应用程序所带来的μs级计算,sdp在整体系统性能和效率方面发挥着至关重要的作用。在这项工作中,我们的目标是通过设计HyperData加速器来增强SDP, I/O设备和应用程序/ vm之间的数据传输。SDP系统中的数据项(如网络数据包或存储块)通过共享内存队列传输。消费级核心通常从DRAM访问数据,或者(得益于英特尔DDIO等技术)从(共享的)最后一级缓存访问数据。目前,由于缺乏适当的到达通知机制和数据缓冲区的复杂访问模式,消费者无法有效地将这些数据预取到更近的缓存中。HyperData被设计用于执行目标预取,其中精确的数据项(或所需的子集)被预取到消费者核心的L1缓存中。此外,HyperData既适用于核心设备之间的数据通信,也适用于核心设备之间的数据通信,它支持复杂的队列格式,如Virtio和多消费者队列。HyperData是通过一个每核可编程预取器和一个系统级监控集实现的,前者负责发出预取请求,后者负责监控数据到达队列并触发预取操作。在最先进的SDP模拟中,我们展示了HyperData将处理延迟提高了1.20-2.42倍,而每核开销只有几百字节。
{"title":"HyperData: A Data Transfer Accelerator for Software Data Planes Based on Targeted Prefetching","authors":"Hossein Golestani, T. Wenisch","doi":"10.1109/ICCD53106.2021.00059","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00059","url":null,"abstract":"Datacenter systems rely on fast, efficient I/O soft-ware stacks—Software Data Planes (SDPs)—to coordinate frequent interaction among myriad processes (or VMs) and I/O devices (NICs, SSDs, etc.). Thanks to the impressive and ever-growing speed of today’s I/O devices and μs-scale computation due to hyper-tenancy and microservice-based applications, SDPs play a crucial role in overall system performance and efficiency. In this work, we aim to enhance data transfer among the SDP, I/O devices, and applications/VMs by designing the HyperData accelerator. Data items in SDP systems, such as network packets or storage blocks, are transferred through shared memory queues. Consumer cores typically access the data from DRAM or, thanks to technologies like Intel DDIO, from the (shared) last-level cache. Today, consumers cannot effectively prefetch such data to nearer caches due to the lack of a proper arrival notification mechanism and the complex access pattern of data buffers. HyperData is designed to perform targeted prefetching, wherein the exact data items (or a required subset) are prefetched to the L1 cache of the consumer core. Furthermore, HyperData is applicable to both core–device and core–core data communication, and it supports complex queue formats like Virtio and multi-consumer queues. HyperData is realized with a per-core programmable prefetcher, which issues the prefetch requests, and a system-level monitoring set, which monitors queues for data arrival and triggers prefetch operations. We show that HyperData improves processing latency by 1.20-2.42× in a simulation of a state-of-the-art SDP, with only a few hundred bytes of per-core overhead.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132626079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Accelerating Sub-Block Erase in 3D NAND Flash Memory 3D NAND闪存中的加速子块擦除
Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00045
H. Gong, Zhirong Shen, J. Shu
3D flash memory removes scaling limitations of planar flash memory, yet it is still plagued by the tedious GC process due to the “big block problem”. In this paper, we propose SpeedupGC, a framework that incorporates the characteristics of data updates into existing sub-block erase designs. The main idea of SpeedupGC is to guide the hotly-updated data to the blocks that are about to be erased, so as to speculatively produce more invalid pages and suppress the relocation overhead. We conduct extensive trace-driven experiments, showing that SpeedupGC can averagely reduce 64.7% of the GC latency, 21.8% of the read latency, 17.7% of the write latency, and 11.5% of the write amplification when compared to state-of-the-art designs.
3D闪存消除了平面闪存的可扩展性限制,但由于“大块问题”,它仍然受到繁琐的GC过程的困扰。在本文中,我们提出了SpeedupGC,这是一个将数据更新特征整合到现有子块擦除设计中的框架。SpeedupGC的主要思想是将热更新的数据引导到即将被擦除的块中,从而推测产生更多的无效页面并抑制重定位开销。我们进行了大量的跟踪驱动实验,结果表明,与最先进的设计相比,SpeedupGC平均可以减少64.7%的GC延迟、21.8%的读延迟、17.7%的写延迟和11.5%的写放大。
{"title":"Accelerating Sub-Block Erase in 3D NAND Flash Memory","authors":"H. Gong, Zhirong Shen, J. Shu","doi":"10.1109/ICCD53106.2021.00045","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00045","url":null,"abstract":"3D flash memory removes scaling limitations of planar flash memory, yet it is still plagued by the tedious GC process due to the “big block problem”. In this paper, we propose SpeedupGC, a framework that incorporates the characteristics of data updates into existing sub-block erase designs. The main idea of SpeedupGC is to guide the hotly-updated data to the blocks that are about to be erased, so as to speculatively produce more invalid pages and suppress the relocation overhead. We conduct extensive trace-driven experiments, showing that SpeedupGC can averagely reduce 64.7% of the GC latency, 21.8% of the read latency, 17.7% of the write latency, and 11.5% of the write amplification when compared to state-of-the-art designs.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125077432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Special Session: When Dataflows Converge: Reconfigurable and Approximate Computing for Emerging Neural Networks 专题会议:当数据流收敛:新兴神经网络的可重构和近似计算
Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00014
Di Wu, Joshua San Miguel
Deep Neural Networks (DNNs) have gained significant attention in both academia and industry due to the superior application-level accuracy. As DNNs rely on compute- or memory-intensive general matrix multiply (GEMM) operations, approximate computing has been widely explored across the computing stack to mitigate the hardware overheads. However, better-performing DNNs are emerging with growing complexity in their use of nonlinear operations, which incurs even more hardware cost. In this work, we address this challenge by proposing a reconfigurable systolic array to execute both GEMM and nonlinear operations via approximation with distinguished dataflows. Experiments demonstrate that such converging of dataflows significantly saves the hardware cost of emerging DNN inference.
深度神经网络(Deep Neural Networks, dnn)由于其优越的应用级精度,在学术界和工业界都受到了极大的关注。由于深度神经网络依赖于计算或内存密集型的一般矩阵乘法(GEMM)操作,因此在计算堆栈上广泛探索近似计算以减少硬件开销。然而,性能更好的深度神经网络正在出现,其使用的非线性运算越来越复杂,这导致了更多的硬件成本。在这项工作中,我们提出了一个可重构的收缩阵列来执行GEMM和非线性操作,通过近似不同的数据流来解决这一挑战。实验表明,这种数据流的收敛大大节省了新兴深度神经网络推理的硬件成本。
{"title":"Special Session: When Dataflows Converge: Reconfigurable and Approximate Computing for Emerging Neural Networks","authors":"Di Wu, Joshua San Miguel","doi":"10.1109/ICCD53106.2021.00014","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00014","url":null,"abstract":"Deep Neural Networks (DNNs) have gained significant attention in both academia and industry due to the superior application-level accuracy. As DNNs rely on compute- or memory-intensive general matrix multiply (GEMM) operations, approximate computing has been widely explored across the computing stack to mitigate the hardware overheads. However, better-performing DNNs are emerging with growing complexity in their use of nonlinear operations, which incurs even more hardware cost. In this work, we address this challenge by proposing a reconfigurable systolic array to execute both GEMM and nonlinear operations via approximation with distinguished dataflows. Experiments demonstrate that such converging of dataflows significantly saves the hardware cost of emerging DNN inference.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131663301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
POMI: Polling-Based Memory Interface for Hybrid Memory System 基于轮询的混合存储系统接口
Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00076
Trung Le, Zhao Zhang, Zhichun Zhu
Modern conventional DRAM main memory system will no longer satisfy the growing demand for capacity and bandwidth on today’s data-intensive applications. Non-Volatile Memory (NVM) has been extensively researched as the alter-native for DRAM-based system due to its higher density and non-volatile characteristics. Hybrid memory system benefits from both DRAM and NVM technologies, however traditional Memory Controller (MC) cannot efficiently track and schedule operations for all the memory devices in heterogeneous systems due to different timing requirements and complex architecture supports of various memory technologies. To address this issue, we propose a hybrid memory architecture framework called POMI. It uses a small buffer chip inserted on each DIMM to decouple operation scheduling from the controller to enable the support for diverse memory technologies in the system. Unlike the conventional DRAM-based system, which relies on the main MC to govern all DIMMs, POMI uses polling-based memory bus protocol for communication and to resolve any bus conflicts between memory modules. The buffer chip on each DIMM will provide feedback information to the main MC so that the polling overhead is trivial. This gives several benefits: technology-independent memory system, higher parallelism, and better scalability. Our experimental results using octa-core workloads show that POMI can efficiently support heterogeneous systems and it outperforms an existing interface for hybrid memory systems by 22.0% on average for memory-intensive workloads.
现代传统的DRAM主存系统已不能满足当今数据密集型应用对容量和带宽日益增长的需求。非易失性存储器(Non-Volatile Memory, NVM)由于具有更高的存储密度和非易失性,作为基于dram的存储系统的替代方案得到了广泛的研究。混合存储系统受益于DRAM和NVM技术,但由于各种存储技术的不同时序要求和复杂的架构支持,传统的内存控制器(MC)无法有效地跟踪和调度异构系统中所有存储设备的操作。为了解决这个问题,我们提出了一个称为POMI的混合内存架构框架。它在每个DIMM上插入一个小的缓冲芯片,将操作调度与控制器解耦,从而支持系统中的各种存储技术。传统的基于内存的系统依赖于主MC来管理所有内存,而POMI使用基于轮询的内存总线协议进行通信,并解决内存模块之间的总线冲突。每个DIMM上的缓冲芯片将向主MC提供反馈信息,因此轮询开销是微不足道的。这有几个好处:技术无关的内存系统、更高的并行性和更好的可伸缩性。我们使用八核工作负载的实验结果表明,POMI可以有效地支持异构系统,并且在内存密集型工作负载下,它比混合内存系统的现有接口平均高出22.0%。
{"title":"POMI: Polling-Based Memory Interface for Hybrid Memory System","authors":"Trung Le, Zhao Zhang, Zhichun Zhu","doi":"10.1109/ICCD53106.2021.00076","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00076","url":null,"abstract":"Modern conventional DRAM main memory system will no longer satisfy the growing demand for capacity and bandwidth on today’s data-intensive applications. Non-Volatile Memory (NVM) has been extensively researched as the alter-native for DRAM-based system due to its higher density and non-volatile characteristics. Hybrid memory system benefits from both DRAM and NVM technologies, however traditional Memory Controller (MC) cannot efficiently track and schedule operations for all the memory devices in heterogeneous systems due to different timing requirements and complex architecture supports of various memory technologies. To address this issue, we propose a hybrid memory architecture framework called POMI. It uses a small buffer chip inserted on each DIMM to decouple operation scheduling from the controller to enable the support for diverse memory technologies in the system. Unlike the conventional DRAM-based system, which relies on the main MC to govern all DIMMs, POMI uses polling-based memory bus protocol for communication and to resolve any bus conflicts between memory modules. The buffer chip on each DIMM will provide feedback information to the main MC so that the polling overhead is trivial. This gives several benefits: technology-independent memory system, higher parallelism, and better scalability. Our experimental results using octa-core workloads show that POMI can efficiently support heterogeneous systems and it outperforms an existing interface for hybrid memory systems by 22.0% on average for memory-intensive workloads.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"179 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133725373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
NRHI: A Concurrent Non-Rehashing Hash Index for Persistent Memory NRHI:用于持久内存的并发非重哈希哈希索引
Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00033
Xinyu Li, Huimin Cui, Lei Liu
Persistent memory (PM) featured with data persistence, byte-addressability, and DRAM-like performance has been commercially available with the advent of Intel® Optane™ DC persistent memory. The DRAM-like performance and disk-like persistence invite shifting hashing-based index schemes, which are important building blocks of today’s internet service infrastructures to provide fast queries, from DRAM onto persistent memory. Numerous hash indexes for persistent memory have been proposed to optimize writes and crash consistency, but with poor scalability under resizing. Generally, resizing consists of allocating a new hash table and rehashing items from the old table into the new one. We argue that resizing with rehashing performed in either blocking or non-blocking way can degrade the overall performance and limit the scalability.In order to mitigate the limitation of resizing, this paper proposes a Non-Rehashing Hash Index (NRHI) scheme to perform resizing with no necessity of rehashing items. NRHI leverages a layered structure to link hash tables without moving key-value pairs across layers, thus reducing the time spent on rehashing in blocking way and alleviating slots contention occurred in non-blocking way. Furthermore, the compare-and-swap primitive is utilized to support concurrent lock-free hashing operations. Experimental results on real PM hardware show that NRHI outperforms the state-of-the-art PM hash indexes by 1.7× to 3.59×, and scales linearly with the number of threads.
随着英特尔®Optane™DC持久内存的出现,具有数据持久性、字节可寻址性和类似dram性能的持久内存(PM)已经商业化。类似DRAM的性能和类似磁盘的持久性需要基于散列的索引方案的移动,这是当今互联网服务基础设施的重要组成部分,可以提供从DRAM到持久内存的快速查询。已经提出了用于持久内存的许多散列索引来优化写入和崩溃一致性,但是在调整大小时可伸缩性很差。通常,调整大小包括分配一个新的哈希表,并将旧表中的项重新散列到新表中。我们认为,以阻塞或非阻塞方式执行的重哈希调整大小会降低整体性能并限制可扩展性。为了减轻调整大小的局限性,本文提出了一种非重新哈希哈希索引(Non-Rehashing Hash Index, NRHI)方案,在不需要重新哈希的情况下执行调整大小。NRHI利用分层结构来链接哈希表,而无需跨层移动键值对,从而减少了以阻塞方式重新哈希所花费的时间,并减轻了以非阻塞方式发生的槽争用。此外,还利用比较-交换原语来支持并发无锁哈希操作。在实际PM硬件上的实验结果表明,NRHI比最先进的PM哈希索引高出1.7到3.59倍,并且随着线程数的增加呈线性扩展。
{"title":"NRHI: A Concurrent Non-Rehashing Hash Index for Persistent Memory","authors":"Xinyu Li, Huimin Cui, Lei Liu","doi":"10.1109/ICCD53106.2021.00033","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00033","url":null,"abstract":"Persistent memory (PM) featured with data persistence, byte-addressability, and DRAM-like performance has been commercially available with the advent of Intel® Optane™ DC persistent memory. The DRAM-like performance and disk-like persistence invite shifting hashing-based index schemes, which are important building blocks of today’s internet service infrastructures to provide fast queries, from DRAM onto persistent memory. Numerous hash indexes for persistent memory have been proposed to optimize writes and crash consistency, but with poor scalability under resizing. Generally, resizing consists of allocating a new hash table and rehashing items from the old table into the new one. We argue that resizing with rehashing performed in either blocking or non-blocking way can degrade the overall performance and limit the scalability.In order to mitigate the limitation of resizing, this paper proposes a Non-Rehashing Hash Index (NRHI) scheme to perform resizing with no necessity of rehashing items. NRHI leverages a layered structure to link hash tables without moving key-value pairs across layers, thus reducing the time spent on rehashing in blocking way and alleviating slots contention occurred in non-blocking way. Furthermore, the compare-and-swap primitive is utilized to support concurrent lock-free hashing operations. Experimental results on real PM hardware show that NRHI outperforms the state-of-the-art PM hash indexes by 1.7× to 3.59×, and scales linearly with the number of threads.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"325 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133504170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improved GPU Implementations of the Pair-HMM Forward Algorithm for DNA Sequence Alignment DNA序列比对Pair-HMM前向算法的改进GPU实现
Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00055
Enliang Li, Subho Sankar Banerjee, Sitao Huang, R. Iyer, Deming Chen
With the rise of Next-Generation Sequencing (NGS) technology, clinical sequencing services become more accessible but are also facing new challenges. The surging demand motivates developments of more efficient algorithms for computational genomics and their hardware acceleration. In this work, we use GPU to accelerate the DNA variant calling and its related alignment problem. The Pair-Hidden Markov Model (Pair-HMM) is one of the most popular and compute-intensive models used in variant calling. As a critical part of the Pair-HMM, the forward algorithm is not only a computational but data-intensive algorithm. Multiple previous works have been done in efforts to accelerate the computation of the forward algorithm by the massive parallelization of the workload. In this paper, we bring advanced GPU implementations with various optimizations, such as efficient host-device communication, task parallelization, pipelining, and memory management, to tackle this challenging task. Our design has shown a speedup of 783X comparing to the Java baseline on Intel single-core CPU, 31.88X to the C++ baseline on IBM Power8 multicore CPU, and 1.53X - 2.21X to the previous state-of-the-art GPU implementations over various genomics datasets.
随着新一代测序(NGS)技术的兴起,临床测序服务变得更容易获得,但也面临着新的挑战。激增的需求推动了更有效的计算基因组学算法及其硬件加速的发展。在这项工作中,我们使用GPU来加速DNA变体调用及其相关的比对问题。对隐马尔可夫模型(Pair-Hidden Markov Model, Pair-HMM)是变量调用中最常用的计算密集型模型之一。前向算法作为Pair-HMM的关键部分,是一种计算量大、数据量大的算法。为了加快前向算法的计算速度,人们已经做了大量的工作来并行化工作负载。在本文中,我们带来了先进的GPU实现与各种优化,如高效的主机设备通信,任务并行化,流水线和内存管理,以解决这个具有挑战性的任务。与Intel单核CPU上的Java基准相比,我们的设计显示了783X的加速,与IBM Power8多核CPU上的c++基准相比,速度提高了31.88X,与各种基因组数据集上以前最先进的GPU实现相比,速度提高了1.53X - 2.21X。
{"title":"Improved GPU Implementations of the Pair-HMM Forward Algorithm for DNA Sequence Alignment","authors":"Enliang Li, Subho Sankar Banerjee, Sitao Huang, R. Iyer, Deming Chen","doi":"10.1109/ICCD53106.2021.00055","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00055","url":null,"abstract":"With the rise of Next-Generation Sequencing (NGS) technology, clinical sequencing services become more accessible but are also facing new challenges. The surging demand motivates developments of more efficient algorithms for computational genomics and their hardware acceleration. In this work, we use GPU to accelerate the DNA variant calling and its related alignment problem. The Pair-Hidden Markov Model (Pair-HMM) is one of the most popular and compute-intensive models used in variant calling. As a critical part of the Pair-HMM, the forward algorithm is not only a computational but data-intensive algorithm. Multiple previous works have been done in efforts to accelerate the computation of the forward algorithm by the massive parallelization of the workload. In this paper, we bring advanced GPU implementations with various optimizations, such as efficient host-device communication, task parallelization, pipelining, and memory management, to tackle this challenging task. Our design has shown a speedup of 783X comparing to the Java baseline on Intel single-core CPU, 31.88X to the C++ baseline on IBM Power8 multicore CPU, and 1.53X - 2.21X to the previous state-of-the-art GPU implementations over various genomics datasets.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122683774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Functional Locking through Omission: From HLS to Obfuscated Design 通过省略实现功能锁定:从HLS到模糊设计
Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00095
Z. Wang, S. Mohammed, Y. Makris, Benjamin Carrión Schäfer
VLSI design companies are now mainly fabless and spend large amount of resources to develop their Intellectual Property (IP). It is therefore paramount to protect their IPs from being stolen and illegally reversed engineered. The main approach so far to protect the IP has been to add additional locking logic such that the circuit does not meet the given specifications if the user does not apply the correct key. The main problem with this approach is that the fabless company has to submit the entire design, including the locking circuitry, to the fab. Moreover, these companies often subcontract the VLSI design back-end to a third-party. This implies that the third-party company or fab could potentially tamper with the locking mechanism. One alternative approach is to lock through omission. The main idea is to judiciously select a portion of the design and map it onto an embedded FPGA (eFPGA). In this case, the bitstream acts as the logic key. Third party company nor the fab will, in this case, have access to the locking mechanism as the eFPGA is left un-programmed. This is obviously a more secure way to lock the circuit. The main problem with this approach is the area, power, and delay overhead associated with it. To address this, in this work, we present a framework that takes as input an untimed behavioral description for High-Level Synthesis (HLS) and automatically extracts a portion of the circuit to the eFPGA such that the area overhead is minimized while the original timing constraint is not violated. The main advantage of starting at the behavioral level is that partitioning the design at this stage allows the HLS process to fully re-optimize the circuit, thus, reducing the overhead introduced by this obfuscation mechanism. We also developed a framework to test our proposed approach and plan to release it to the community to encourage the community to find new techniques to break the proposed obfuscation method.
VLSI设计公司现在主要是无晶圆厂,并花费大量资源来开发他们的知识产权(IP)。因此,保护他们的ip不被窃取和非法逆向工程是至关重要的。到目前为止,保护IP的主要方法是增加额外的锁定逻辑,这样如果用户没有应用正确的密钥,电路就不符合给定的规格。这种方法的主要问题是,无晶圆厂公司必须将整个设计提交给晶圆厂,包括锁定电路。此外,这些公司经常将VLSI设计后端分包给第三方。这意味着第三方公司或工厂可能会篡改锁定机制。另一种方法是通过省略进行锁定。主要思想是明智地选择设计的一部分并将其映射到嵌入式FPGA (eFPGA)上。在这种情况下,比特流充当逻辑密钥。在这种情况下,由于eFPGA未编程,第三方公司或晶圆厂都无法访问锁定机制。这显然是一种更安全的锁定电路的方法。这种方法的主要问题是与之相关的面积、功率和延迟开销。为了解决这个问题,在这项工作中,我们提出了一个框架,该框架将高级合成(HLS)的非定时行为描述作为输入,并自动提取电路的一部分到eFPGA,以便在不违反原始时序约束的情况下将面积开销最小化。从行为级别开始的主要优点是,在此阶段划分设计允许HLS过程完全重新优化电路,从而减少这种混淆机制引入的开销。我们还开发了一个框架来测试我们提出的方法,并计划将其发布给社区,以鼓励社区找到新的技术来打破提出的混淆方法。
{"title":"Functional Locking through Omission: From HLS to Obfuscated Design","authors":"Z. Wang, S. Mohammed, Y. Makris, Benjamin Carrión Schäfer","doi":"10.1109/ICCD53106.2021.00095","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00095","url":null,"abstract":"VLSI design companies are now mainly fabless and spend large amount of resources to develop their Intellectual Property (IP). It is therefore paramount to protect their IPs from being stolen and illegally reversed engineered. The main approach so far to protect the IP has been to add additional locking logic such that the circuit does not meet the given specifications if the user does not apply the correct key. The main problem with this approach is that the fabless company has to submit the entire design, including the locking circuitry, to the fab. Moreover, these companies often subcontract the VLSI design back-end to a third-party. This implies that the third-party company or fab could potentially tamper with the locking mechanism. One alternative approach is to lock through omission. The main idea is to judiciously select a portion of the design and map it onto an embedded FPGA (eFPGA). In this case, the bitstream acts as the logic key. Third party company nor the fab will, in this case, have access to the locking mechanism as the eFPGA is left un-programmed. This is obviously a more secure way to lock the circuit. The main problem with this approach is the area, power, and delay overhead associated with it. To address this, in this work, we present a framework that takes as input an untimed behavioral description for High-Level Synthesis (HLS) and automatically extracts a portion of the circuit to the eFPGA such that the area overhead is minimized while the original timing constraint is not violated. The main advantage of starting at the behavioral level is that partitioning the design at this stage allows the HLS process to fully re-optimize the circuit, thus, reducing the overhead introduced by this obfuscation mechanism. We also developed a framework to test our proposed approach and plan to release it to the community to encourage the community to find new techniques to break the proposed obfuscation method.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129719944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Compiling and Optimizing Real-world Programs for STRAIGHT ISA 编译和优化现实世界的程序为直ISA
Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00070
Toru Koizumi, Shu Sugita, Ryota Shioya, J. Kadomoto, H. Irie, S. Sakai
The renaming unit of a superscalar processor is a very expensive module. It consumes large amounts of power and limits the front-end bandwidth. To overcome this problem, an instruction set architecture called STRAIGHT has been proposed. Owing to its unique manner of referencing operands, STRAIGHT does not cause false dependencies and allows out-of-order execution without register renaming. However, the compiler optimization techniques for STRAIGHT are still immature, and we found that the naive code generators currently available can generate inefficient code with additional instructions. In this paper, we propose two novel compiler optimization techniques and a novel calling convention for STRAIGHT to reduce the number of instructions. We compiled real-world programs with a compiler that implemented these techniques and measured their performance through simulation. The evaluation results show that the proposed methods reduced the number of executed instructions by 15% and improved the performance by 17%.
超标量处理器的重命名单元是一个非常昂贵的模块。它消耗大量的功率,并限制前端带宽。为了克服这个问题,提出了一种称为STRAIGHT的指令集体系结构。由于其引用操作数的独特方式,STRAIGHT不会导致错误的依赖关系,并且允许在不重命名寄存器的情况下乱序执行。然而,STRAIGHT的编译器优化技术仍然不成熟,我们发现目前可用的幼稚代码生成器可以生成带有额外指令的低效代码。在本文中,我们提出了两种新的编译器优化技术和一种新的调用约定来减少STRAIGHT的指令数量。我们用实现这些技术的编译器编译了真实世界的程序,并通过模拟测量了它们的性能。评估结果表明,所提出的方法减少了15%的执行指令数,性能提高了17%。
{"title":"Compiling and Optimizing Real-world Programs for STRAIGHT ISA","authors":"Toru Koizumi, Shu Sugita, Ryota Shioya, J. Kadomoto, H. Irie, S. Sakai","doi":"10.1109/ICCD53106.2021.00070","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00070","url":null,"abstract":"The renaming unit of a superscalar processor is a very expensive module. It consumes large amounts of power and limits the front-end bandwidth. To overcome this problem, an instruction set architecture called STRAIGHT has been proposed. Owing to its unique manner of referencing operands, STRAIGHT does not cause false dependencies and allows out-of-order execution without register renaming. However, the compiler optimization techniques for STRAIGHT are still immature, and we found that the naive code generators currently available can generate inefficient code with additional instructions. In this paper, we propose two novel compiler optimization techniques and a novel calling convention for STRAIGHT to reduce the number of instructions. We compiled real-world programs with a compiler that implemented these techniques and measured their performance through simulation. The evaluation results show that the proposed methods reduced the number of executed instructions by 15% and improved the performance by 17%.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129897935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Improving the Heavy Re-encryption Overhead of Split Counter Mode Encryption for NVM 改进NVM拆分计数器加密重重加密开销
Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00073
Qianqian Pei, Seunghee Shin
Emerging non-volatile memory technology enables non-volatile main memory (NVMM) that can provide larger capacity and better energy-saving opportunities than DRAMs. However, its non-volatility raises security concerns, where the data in NVMMs can be taken if the memory is stolen. Thereby, the data must stay encrypted outside the processor boundary. Such encryption requires decryption before the data being used by the processor, adding extra latency to the performance-critical read operations. Split counter mode encryption hides the latency but introduces frequent page re-encryptions as a trade-off. We find that such re-encryption overhead worsens on the NVMM, whose slow latency negates prior optimizations.To mitigate the overhead, we re-design the encryption scheme based on two key observations. First, we observe that NVMMs only need counters that can count up to twice their lifetime. Second, we observe diminishing returns on the counter size as increasing the counter size further does not necessarily decrease the re-encryption frequency. Our new designs re-arrange those inefficiently used bits to reduce the re-encryption overhead. In the tests, our two designs, 3-level split counter mode encryption and 8-block split counter mode encryption, effectively reduce the re-encryption overheads by 63% and 66%, which improve performances by 26% and 30% at maximum and by 8% and 9% on average from the original split counter scheme.
新兴的非易失性存储技术使非易失性主存储器(NVMM)能够提供比dram更大的容量和更好的节能机会。然而,它的非易失性引起了安全问题,如果内存被盗,nvmm中的数据可以被获取。因此,数据必须在处理器边界之外保持加密。这种加密需要在处理器使用数据之前进行解密,从而增加了对性能至关重要的读取操作的额外延迟。分割计数器模式加密隐藏了延迟,但作为一种权衡,引入了频繁的页面重新加密。我们发现,这种重新加密开销在NVMM上恶化,NVMM的缓慢延迟否定了先前的优化。为了减少开销,我们基于两个关键的观察结果重新设计了加密方案。首先,我们观察到nvmm只需要计数最多为其生命周期两倍的计数器。其次,我们观察到计数器大小的收益递减,因为进一步增加计数器大小并不一定会降低重新加密频率。我们的新设计重新排列那些无效使用的位,以减少重新加密的开销。在测试中,我们的3级分割计数器模式加密和8块分割计数器模式加密两种设计有效地减少了63%和66%的重新加密开销,最大性能提高了26%和30%,平均性能比原始分割计数器方案提高了8%和9%。
{"title":"Improving the Heavy Re-encryption Overhead of Split Counter Mode Encryption for NVM","authors":"Qianqian Pei, Seunghee Shin","doi":"10.1109/ICCD53106.2021.00073","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00073","url":null,"abstract":"Emerging non-volatile memory technology enables non-volatile main memory (NVMM) that can provide larger capacity and better energy-saving opportunities than DRAMs. However, its non-volatility raises security concerns, where the data in NVMMs can be taken if the memory is stolen. Thereby, the data must stay encrypted outside the processor boundary. Such encryption requires decryption before the data being used by the processor, adding extra latency to the performance-critical read operations. Split counter mode encryption hides the latency but introduces frequent page re-encryptions as a trade-off. We find that such re-encryption overhead worsens on the NVMM, whose slow latency negates prior optimizations.To mitigate the overhead, we re-design the encryption scheme based on two key observations. First, we observe that NVMMs only need counters that can count up to twice their lifetime. Second, we observe diminishing returns on the counter size as increasing the counter size further does not necessarily decrease the re-encryption frequency. Our new designs re-arrange those inefficiently used bits to reduce the re-encryption overhead. In the tests, our two designs, 3-level split counter mode encryption and 8-block split counter mode encryption, effectively reduce the re-encryption overheads by 63% and 66%, which improve performances by 26% and 30% at maximum and by 8% and 9% on average from the original split counter scheme.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130535765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
2021 IEEE 39th International Conference on Computer Design (ICCD)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1