首页 > 最新文献

2021 IEEE 39th International Conference on Computer Design (ICCD)最新文献

英文 中文
Understanding and Optimizing Hybrid SSD with High-Density and Low-Cost Flash Memory 高密度低成本闪存混合SSD的理解与优化
Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00046
Liang Shi, Longfei Luo, Yina Lv, Shicheng Li, Changlong Li, E. Sha
With the development of NAND flash technology, hybrid SSDs with high-density and low-cost flash memory have become the mainstream of the existing SSD architecture. In this architecture, two flash modes can be dynamically switched, such as single-level cell (SLC) mode and quad-level cell (QLC) mode. Based on evaluations and analysis of multiple real devices, this paper presents two interesting findings. They demonstrate that the coordination between the two flash-modes is not well-designed in existing architectures. This paper proposes HyFlex, which redesigns the strategies of data placement and flash-mode management of hybrid SSDs in a flexible approach. Specifically, two novel optimization strategies are proposed: velocity-based I/O scheduling (VIS) and garbage collection (GC)-aware capacity tuning (GCT). Experimental results show that HyFlex achieves encouraging performance and endurance improvement.
随着NAND闪存技术的发展,具有高密度和低成本闪存的混合SSD已成为现有SSD架构的主流。在该架构中,可以动态切换两种闪存模式,如单电平单元(SLC)模式和四电平单元(QLC)模式。基于对多个真实装置的评估和分析,本文提出了两个有趣的发现。它们表明,在现有的架构中,两种闪存模式之间的协调没有得到很好的设计。HyFlex以一种灵活的方式重新设计了混合ssd的数据放置和闪存模式管理策略。具体来说,提出了两种新的优化策略:基于速度的I/O调度(VIS)和垃圾收集感知容量调优(GCT)。实验结果表明,HyFlex取得了令人鼓舞的性能和续航能力的提高。
{"title":"Understanding and Optimizing Hybrid SSD with High-Density and Low-Cost Flash Memory","authors":"Liang Shi, Longfei Luo, Yina Lv, Shicheng Li, Changlong Li, E. Sha","doi":"10.1109/ICCD53106.2021.00046","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00046","url":null,"abstract":"With the development of NAND flash technology, hybrid SSDs with high-density and low-cost flash memory have become the mainstream of the existing SSD architecture. In this architecture, two flash modes can be dynamically switched, such as single-level cell (SLC) mode and quad-level cell (QLC) mode. Based on evaluations and analysis of multiple real devices, this paper presents two interesting findings. They demonstrate that the coordination between the two flash-modes is not well-designed in existing architectures. This paper proposes HyFlex, which redesigns the strategies of data placement and flash-mode management of hybrid SSDs in a flexible approach. Specifically, two novel optimization strategies are proposed: velocity-based I/O scheduling (VIS) and garbage collection (GC)-aware capacity tuning (GCT). Experimental results show that HyFlex achieves encouraging performance and endurance improvement.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124951125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
HyperData: A Data Transfer Accelerator for Software Data Planes Based on Targeted Prefetching HyperData:基于目标预取的软件数据平面数据传输加速器
Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00059
Hossein Golestani, T. Wenisch
Datacenter systems rely on fast, efficient I/O soft-ware stacks—Software Data Planes (SDPs)—to coordinate frequent interaction among myriad processes (or VMs) and I/O devices (NICs, SSDs, etc.). Thanks to the impressive and ever-growing speed of today’s I/O devices and μs-scale computation due to hyper-tenancy and microservice-based applications, SDPs play a crucial role in overall system performance and efficiency. In this work, we aim to enhance data transfer among the SDP, I/O devices, and applications/VMs by designing the HyperData accelerator. Data items in SDP systems, such as network packets or storage blocks, are transferred through shared memory queues. Consumer cores typically access the data from DRAM or, thanks to technologies like Intel DDIO, from the (shared) last-level cache. Today, consumers cannot effectively prefetch such data to nearer caches due to the lack of a proper arrival notification mechanism and the complex access pattern of data buffers. HyperData is designed to perform targeted prefetching, wherein the exact data items (or a required subset) are prefetched to the L1 cache of the consumer core. Furthermore, HyperData is applicable to both core–device and core–core data communication, and it supports complex queue formats like Virtio and multi-consumer queues. HyperData is realized with a per-core programmable prefetcher, which issues the prefetch requests, and a system-level monitoring set, which monitors queues for data arrival and triggers prefetch operations. We show that HyperData improves processing latency by 1.20-2.42× in a simulation of a state-of-the-art SDP, with only a few hundred bytes of per-core overhead.
数据中心系统依赖于快速、高效的I/O软件堆栈——软件数据平面(sdp)——来协调无数进程(或虚拟机)和I/O设备(网卡、ssd等)之间频繁的交互。由于当今I/O设备的速度令人印象深刻且不断增长,以及基于超租户和微服务的应用程序所带来的μs级计算,sdp在整体系统性能和效率方面发挥着至关重要的作用。在这项工作中,我们的目标是通过设计HyperData加速器来增强SDP, I/O设备和应用程序/ vm之间的数据传输。SDP系统中的数据项(如网络数据包或存储块)通过共享内存队列传输。消费级核心通常从DRAM访问数据,或者(得益于英特尔DDIO等技术)从(共享的)最后一级缓存访问数据。目前,由于缺乏适当的到达通知机制和数据缓冲区的复杂访问模式,消费者无法有效地将这些数据预取到更近的缓存中。HyperData被设计用于执行目标预取,其中精确的数据项(或所需的子集)被预取到消费者核心的L1缓存中。此外,HyperData既适用于核心设备之间的数据通信,也适用于核心设备之间的数据通信,它支持复杂的队列格式,如Virtio和多消费者队列。HyperData是通过一个每核可编程预取器和一个系统级监控集实现的,前者负责发出预取请求,后者负责监控数据到达队列并触发预取操作。在最先进的SDP模拟中,我们展示了HyperData将处理延迟提高了1.20-2.42倍,而每核开销只有几百字节。
{"title":"HyperData: A Data Transfer Accelerator for Software Data Planes Based on Targeted Prefetching","authors":"Hossein Golestani, T. Wenisch","doi":"10.1109/ICCD53106.2021.00059","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00059","url":null,"abstract":"Datacenter systems rely on fast, efficient I/O soft-ware stacks—Software Data Planes (SDPs)—to coordinate frequent interaction among myriad processes (or VMs) and I/O devices (NICs, SSDs, etc.). Thanks to the impressive and ever-growing speed of today’s I/O devices and μs-scale computation due to hyper-tenancy and microservice-based applications, SDPs play a crucial role in overall system performance and efficiency. In this work, we aim to enhance data transfer among the SDP, I/O devices, and applications/VMs by designing the HyperData accelerator. Data items in SDP systems, such as network packets or storage blocks, are transferred through shared memory queues. Consumer cores typically access the data from DRAM or, thanks to technologies like Intel DDIO, from the (shared) last-level cache. Today, consumers cannot effectively prefetch such data to nearer caches due to the lack of a proper arrival notification mechanism and the complex access pattern of data buffers. HyperData is designed to perform targeted prefetching, wherein the exact data items (or a required subset) are prefetched to the L1 cache of the consumer core. Furthermore, HyperData is applicable to both core–device and core–core data communication, and it supports complex queue formats like Virtio and multi-consumer queues. HyperData is realized with a per-core programmable prefetcher, which issues the prefetch requests, and a system-level monitoring set, which monitors queues for data arrival and triggers prefetch operations. We show that HyperData improves processing latency by 1.20-2.42× in a simulation of a state-of-the-art SDP, with only a few hundred bytes of per-core overhead.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132626079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Chopin: Composing Cost-Effective Custom Chips with Algorithmic Chiplets 肖邦:用算法小芯片组成具有成本效益的定制芯片
Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00069
Pete Ehrett, Todd M. Austin, V. Bertacco
As computational demands rise, the need for specialized hardware has grown acute. However, the immense cost of fully-custom chips has forced many developers to rely on suboptimal solutions like FPGAs, especially for low- to mid-volume applications, in which multi-million-dollar non-recurring engineering (NRE) costs cannot be amortized effectively. We propose to address this problem by composing custom chips out of small, algorithmic chiplets, reusable across diverse designs, such that high NRE costs may be amortized across many different designs. This work models the economics of this paradigm and identifies a cost-optimal granularity for algorithmic chiplets, then demonstrates how those guidelines may be applied to design high-performance, algorithmically-composable hardware components – which may be reused, without modification, across many different processing pipelines. For an example phased-array radar accelerator, our chiplet-centric paradigm improves perf-per-$ by 9.3× over an FPGA, and ∼4× over a conventional ASIC.
随着计算需求的增加,对专用硬件的需求也日益迫切。然而,完全定制芯片的巨大成本迫使许多开发人员依赖于fpga等次优解决方案,特别是对于中小批量应用,数百万美元的非重复性工程(NRE)成本无法有效摊销。为了解决这个问题,我们建议用小的算法芯片组成定制芯片,在不同的设计中可重复使用,这样高的NRE成本可以在许多不同的设计中摊销。这项工作为这种范例的经济建模,并确定了算法小芯片的成本最优粒度,然后演示了如何将这些指导原则应用于设计高性能、算法可组合的硬件组件——这些组件可以在许多不同的处理管道中重用,而无需修改。以相控阵雷达加速器为例,我们以芯片为中心的范例比FPGA提高了9.3倍的per-$,比传统的ASIC提高了4倍。
{"title":"Chopin: Composing Cost-Effective Custom Chips with Algorithmic Chiplets","authors":"Pete Ehrett, Todd M. Austin, V. Bertacco","doi":"10.1109/ICCD53106.2021.00069","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00069","url":null,"abstract":"As computational demands rise, the need for specialized hardware has grown acute. However, the immense cost of fully-custom chips has forced many developers to rely on suboptimal solutions like FPGAs, especially for low- to mid-volume applications, in which multi-million-dollar non-recurring engineering (NRE) costs cannot be amortized effectively. We propose to address this problem by composing custom chips out of small, algorithmic chiplets, reusable across diverse designs, such that high NRE costs may be amortized across many different designs. This work models the economics of this paradigm and identifies a cost-optimal granularity for algorithmic chiplets, then demonstrates how those guidelines may be applied to design high-performance, algorithmically-composable hardware components – which may be reused, without modification, across many different processing pipelines. For an example phased-array radar accelerator, our chiplet-centric paradigm improves perf-per-$ by 9.3× over an FPGA, and ∼4× over a conventional ASIC.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123514086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Improved GPU Implementations of the Pair-HMM Forward Algorithm for DNA Sequence Alignment DNA序列比对Pair-HMM前向算法的改进GPU实现
Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00055
Enliang Li, Subho Sankar Banerjee, Sitao Huang, R. Iyer, Deming Chen
With the rise of Next-Generation Sequencing (NGS) technology, clinical sequencing services become more accessible but are also facing new challenges. The surging demand motivates developments of more efficient algorithms for computational genomics and their hardware acceleration. In this work, we use GPU to accelerate the DNA variant calling and its related alignment problem. The Pair-Hidden Markov Model (Pair-HMM) is one of the most popular and compute-intensive models used in variant calling. As a critical part of the Pair-HMM, the forward algorithm is not only a computational but data-intensive algorithm. Multiple previous works have been done in efforts to accelerate the computation of the forward algorithm by the massive parallelization of the workload. In this paper, we bring advanced GPU implementations with various optimizations, such as efficient host-device communication, task parallelization, pipelining, and memory management, to tackle this challenging task. Our design has shown a speedup of 783X comparing to the Java baseline on Intel single-core CPU, 31.88X to the C++ baseline on IBM Power8 multicore CPU, and 1.53X - 2.21X to the previous state-of-the-art GPU implementations over various genomics datasets.
随着新一代测序(NGS)技术的兴起,临床测序服务变得更容易获得,但也面临着新的挑战。激增的需求推动了更有效的计算基因组学算法及其硬件加速的发展。在这项工作中,我们使用GPU来加速DNA变体调用及其相关的比对问题。对隐马尔可夫模型(Pair-Hidden Markov Model, Pair-HMM)是变量调用中最常用的计算密集型模型之一。前向算法作为Pair-HMM的关键部分,是一种计算量大、数据量大的算法。为了加快前向算法的计算速度,人们已经做了大量的工作来并行化工作负载。在本文中,我们带来了先进的GPU实现与各种优化,如高效的主机设备通信,任务并行化,流水线和内存管理,以解决这个具有挑战性的任务。与Intel单核CPU上的Java基准相比,我们的设计显示了783X的加速,与IBM Power8多核CPU上的c++基准相比,速度提高了31.88X,与各种基因组数据集上以前最先进的GPU实现相比,速度提高了1.53X - 2.21X。
{"title":"Improved GPU Implementations of the Pair-HMM Forward Algorithm for DNA Sequence Alignment","authors":"Enliang Li, Subho Sankar Banerjee, Sitao Huang, R. Iyer, Deming Chen","doi":"10.1109/ICCD53106.2021.00055","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00055","url":null,"abstract":"With the rise of Next-Generation Sequencing (NGS) technology, clinical sequencing services become more accessible but are also facing new challenges. The surging demand motivates developments of more efficient algorithms for computational genomics and their hardware acceleration. In this work, we use GPU to accelerate the DNA variant calling and its related alignment problem. The Pair-Hidden Markov Model (Pair-HMM) is one of the most popular and compute-intensive models used in variant calling. As a critical part of the Pair-HMM, the forward algorithm is not only a computational but data-intensive algorithm. Multiple previous works have been done in efforts to accelerate the computation of the forward algorithm by the massive parallelization of the workload. In this paper, we bring advanced GPU implementations with various optimizations, such as efficient host-device communication, task parallelization, pipelining, and memory management, to tackle this challenging task. Our design has shown a speedup of 783X comparing to the Java baseline on Intel single-core CPU, 31.88X to the C++ baseline on IBM Power8 multicore CPU, and 1.53X - 2.21X to the previous state-of-the-art GPU implementations over various genomics datasets.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122683774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Efficient Methods for SoC Trust Validation Using Information Flow Verification 基于信息流验证的SoC信任验证方法
Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00098
Khitam M. Alatoun, Shanmukha Murali Achyutha, R. Vemuri
Information flow properties are essential to identify security vulnerabilities in System-on-Chip (SoC) designs. Verifying information flow properties, such as integrity and confidentiality, is challenging as these properties cannot be handled using traditional assertion-based verification techniques. This paper proposes two novel approaches, a universal method and a property-driven method, to verify and monitor information flow properties. Both methods can be used for formal verification, dynamic verification during simulation, post-fabrication validation, and run-time monitoring. The universal method expedites implementing the information flow model and has less complexity than the most recently published technique. The property-driven method reduces the overhead of the security model, which helps speed up the verification process and create an efficient run-time hardware monitor. More than 20 information flow properties from 5 different designs were verified and several bugs were identified. We show that the method is scalable for large systems by applying it to an SoC design based on an OpenRISC-1200 processor.
信息流属性对于识别片上系统(SoC)设计中的安全漏洞至关重要。验证信息流属性(如完整性和机密性)具有挑战性,因为这些属性不能使用传统的基于断言的验证技术来处理。本文提出了两种验证和监控信息流属性的新方法:通用方法和属性驱动方法。这两种方法都可以用于形式验证、仿真过程中的动态验证、制造后验证和运行时监控。通用方法加快了信息流模型的实现速度,并且比最新发表的技术具有更低的复杂性。属性驱动的方法减少了安全模型的开销,这有助于加快验证过程并创建高效的运行时硬件监视器。验证了来自5种不同设计的20多个信息流属性,并发现了几个错误。我们通过将该方法应用于基于OpenRISC-1200处理器的SoC设计,证明该方法可扩展到大型系统。
{"title":"Efficient Methods for SoC Trust Validation Using Information Flow Verification","authors":"Khitam M. Alatoun, Shanmukha Murali Achyutha, R. Vemuri","doi":"10.1109/ICCD53106.2021.00098","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00098","url":null,"abstract":"Information flow properties are essential to identify security vulnerabilities in System-on-Chip (SoC) designs. Verifying information flow properties, such as integrity and confidentiality, is challenging as these properties cannot be handled using traditional assertion-based verification techniques. This paper proposes two novel approaches, a universal method and a property-driven method, to verify and monitor information flow properties. Both methods can be used for formal verification, dynamic verification during simulation, post-fabrication validation, and run-time monitoring. The universal method expedites implementing the information flow model and has less complexity than the most recently published technique. The property-driven method reduces the overhead of the security model, which helps speed up the verification process and create an efficient run-time hardware monitor. More than 20 information flow properties from 5 different designs were verified and several bugs were identified. We show that the method is scalable for large systems by applying it to an SoC design based on an OpenRISC-1200 processor.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130280750","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Special Session: How much quality is enough quality? A case for acceptability in approximate designs 专题讨论:多少质量才算足够的质量?近似设计的可接受性
Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00013
Isaías B. Felzmann, João Fabrício Filho, Juliane Regina de Oliveira, L. Wanner
Approximate systems are designed to offer improved efficiency with potentially reduced quality of results. Quality of output in these systems is typically quantified in comparison to a precise result using metrics such as RMSE, MAE, PSNR, or application-specific metrics such as structural similarity of images (SSIM). Furthermore, systems are typically designed to maximize efficiency for a given minimum quality requirement. It is often difficult to determine what this quality requirement should be for an application, let alone a system. Thus, a fixed quality requirement may be overly conservative, and leave optimization opportunities on the table. In this work, we present a different approach to evaluate approximate systems based on the usefulness of results instead of quality. Our method qualitatively determines the acceptability of approximate results within different processing pipelines. To demonstrate the method, we implement three image and signal processing applications featuring scenarios of image classification, image recognition, and frequency estimation. Our results show that designing approximate systems to guarantee acceptability can produce up to 20% more valid results than the conservative quality thresholds commonly adopted in the literature, allowing for higher error rates and, consequently, lower energy cost.
近似系统旨在提高效率,但可能会降低结果的质量。这些系统中的输出质量通常是通过使用RMSE、MAE、PSNR等指标或特定于应用程序的指标(如图像的结构相似性(SSIM))与精确结果相比较来量化的。此外,系统通常被设计为在给定的最低质量要求下实现效率最大化。通常很难确定应用程序的质量需求应该是什么,更不用说系统了。因此,固定的质量需求可能会过于保守,并留下优化机会。在这项工作中,我们提出了一种基于结果的有用性而不是质量来评估近似系统的不同方法。我们的方法定性地决定了近似结果在不同处理管道中的可接受性。为了演示该方法,我们实现了三个图像和信号处理应用,包括图像分类、图像识别和频率估计。我们的研究结果表明,设计近似系统以保证可接受性可以产生比文献中通常采用的保守质量阈值多20%的有效结果,允许更高的错误率,从而降低能源成本。
{"title":"Special Session: How much quality is enough quality? A case for acceptability in approximate designs","authors":"Isaías B. Felzmann, João Fabrício Filho, Juliane Regina de Oliveira, L. Wanner","doi":"10.1109/ICCD53106.2021.00013","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00013","url":null,"abstract":"Approximate systems are designed to offer improved efficiency with potentially reduced quality of results. Quality of output in these systems is typically quantified in comparison to a precise result using metrics such as RMSE, MAE, PSNR, or application-specific metrics such as structural similarity of images (SSIM). Furthermore, systems are typically designed to maximize efficiency for a given minimum quality requirement. It is often difficult to determine what this quality requirement should be for an application, let alone a system. Thus, a fixed quality requirement may be overly conservative, and leave optimization opportunities on the table. In this work, we present a different approach to evaluate approximate systems based on the usefulness of results instead of quality. Our method qualitatively determines the acceptability of approximate results within different processing pipelines. To demonstrate the method, we implement three image and signal processing applications featuring scenarios of image classification, image recognition, and frequency estimation. Our results show that designing approximate systems to guarantee acceptability can produce up to 20% more valid results than the conservative quality thresholds commonly adopted in the literature, allowing for higher error rates and, consequently, lower energy cost.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129586708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Functional Locking through Omission: From HLS to Obfuscated Design 通过省略实现功能锁定:从HLS到模糊设计
Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00095
Z. Wang, S. Mohammed, Y. Makris, Benjamin Carrión Schäfer
VLSI design companies are now mainly fabless and spend large amount of resources to develop their Intellectual Property (IP). It is therefore paramount to protect their IPs from being stolen and illegally reversed engineered. The main approach so far to protect the IP has been to add additional locking logic such that the circuit does not meet the given specifications if the user does not apply the correct key. The main problem with this approach is that the fabless company has to submit the entire design, including the locking circuitry, to the fab. Moreover, these companies often subcontract the VLSI design back-end to a third-party. This implies that the third-party company or fab could potentially tamper with the locking mechanism. One alternative approach is to lock through omission. The main idea is to judiciously select a portion of the design and map it onto an embedded FPGA (eFPGA). In this case, the bitstream acts as the logic key. Third party company nor the fab will, in this case, have access to the locking mechanism as the eFPGA is left un-programmed. This is obviously a more secure way to lock the circuit. The main problem with this approach is the area, power, and delay overhead associated with it. To address this, in this work, we present a framework that takes as input an untimed behavioral description for High-Level Synthesis (HLS) and automatically extracts a portion of the circuit to the eFPGA such that the area overhead is minimized while the original timing constraint is not violated. The main advantage of starting at the behavioral level is that partitioning the design at this stage allows the HLS process to fully re-optimize the circuit, thus, reducing the overhead introduced by this obfuscation mechanism. We also developed a framework to test our proposed approach and plan to release it to the community to encourage the community to find new techniques to break the proposed obfuscation method.
VLSI设计公司现在主要是无晶圆厂,并花费大量资源来开发他们的知识产权(IP)。因此,保护他们的ip不被窃取和非法逆向工程是至关重要的。到目前为止,保护IP的主要方法是增加额外的锁定逻辑,这样如果用户没有应用正确的密钥,电路就不符合给定的规格。这种方法的主要问题是,无晶圆厂公司必须将整个设计提交给晶圆厂,包括锁定电路。此外,这些公司经常将VLSI设计后端分包给第三方。这意味着第三方公司或工厂可能会篡改锁定机制。另一种方法是通过省略进行锁定。主要思想是明智地选择设计的一部分并将其映射到嵌入式FPGA (eFPGA)上。在这种情况下,比特流充当逻辑密钥。在这种情况下,由于eFPGA未编程,第三方公司或晶圆厂都无法访问锁定机制。这显然是一种更安全的锁定电路的方法。这种方法的主要问题是与之相关的面积、功率和延迟开销。为了解决这个问题,在这项工作中,我们提出了一个框架,该框架将高级合成(HLS)的非定时行为描述作为输入,并自动提取电路的一部分到eFPGA,以便在不违反原始时序约束的情况下将面积开销最小化。从行为级别开始的主要优点是,在此阶段划分设计允许HLS过程完全重新优化电路,从而减少这种混淆机制引入的开销。我们还开发了一个框架来测试我们提出的方法,并计划将其发布给社区,以鼓励社区找到新的技术来打破提出的混淆方法。
{"title":"Functional Locking through Omission: From HLS to Obfuscated Design","authors":"Z. Wang, S. Mohammed, Y. Makris, Benjamin Carrión Schäfer","doi":"10.1109/ICCD53106.2021.00095","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00095","url":null,"abstract":"VLSI design companies are now mainly fabless and spend large amount of resources to develop their Intellectual Property (IP). It is therefore paramount to protect their IPs from being stolen and illegally reversed engineered. The main approach so far to protect the IP has been to add additional locking logic such that the circuit does not meet the given specifications if the user does not apply the correct key. The main problem with this approach is that the fabless company has to submit the entire design, including the locking circuitry, to the fab. Moreover, these companies often subcontract the VLSI design back-end to a third-party. This implies that the third-party company or fab could potentially tamper with the locking mechanism. One alternative approach is to lock through omission. The main idea is to judiciously select a portion of the design and map it onto an embedded FPGA (eFPGA). In this case, the bitstream acts as the logic key. Third party company nor the fab will, in this case, have access to the locking mechanism as the eFPGA is left un-programmed. This is obviously a more secure way to lock the circuit. The main problem with this approach is the area, power, and delay overhead associated with it. To address this, in this work, we present a framework that takes as input an untimed behavioral description for High-Level Synthesis (HLS) and automatically extracts a portion of the circuit to the eFPGA such that the area overhead is minimized while the original timing constraint is not violated. The main advantage of starting at the behavioral level is that partitioning the design at this stage allows the HLS process to fully re-optimize the circuit, thus, reducing the overhead introduced by this obfuscation mechanism. We also developed a framework to test our proposed approach and plan to release it to the community to encourage the community to find new techniques to break the proposed obfuscation method.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129719944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Accelerating Sub-Block Erase in 3D NAND Flash Memory 3D NAND闪存中的加速子块擦除
Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00045
H. Gong, Zhirong Shen, J. Shu
3D flash memory removes scaling limitations of planar flash memory, yet it is still plagued by the tedious GC process due to the “big block problem”. In this paper, we propose SpeedupGC, a framework that incorporates the characteristics of data updates into existing sub-block erase designs. The main idea of SpeedupGC is to guide the hotly-updated data to the blocks that are about to be erased, so as to speculatively produce more invalid pages and suppress the relocation overhead. We conduct extensive trace-driven experiments, showing that SpeedupGC can averagely reduce 64.7% of the GC latency, 21.8% of the read latency, 17.7% of the write latency, and 11.5% of the write amplification when compared to state-of-the-art designs.
3D闪存消除了平面闪存的可扩展性限制,但由于“大块问题”,它仍然受到繁琐的GC过程的困扰。在本文中,我们提出了SpeedupGC,这是一个将数据更新特征整合到现有子块擦除设计中的框架。SpeedupGC的主要思想是将热更新的数据引导到即将被擦除的块中,从而推测产生更多的无效页面并抑制重定位开销。我们进行了大量的跟踪驱动实验,结果表明,与最先进的设计相比,SpeedupGC平均可以减少64.7%的GC延迟、21.8%的读延迟、17.7%的写延迟和11.5%的写放大。
{"title":"Accelerating Sub-Block Erase in 3D NAND Flash Memory","authors":"H. Gong, Zhirong Shen, J. Shu","doi":"10.1109/ICCD53106.2021.00045","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00045","url":null,"abstract":"3D flash memory removes scaling limitations of planar flash memory, yet it is still plagued by the tedious GC process due to the “big block problem”. In this paper, we propose SpeedupGC, a framework that incorporates the characteristics of data updates into existing sub-block erase designs. The main idea of SpeedupGC is to guide the hotly-updated data to the blocks that are about to be erased, so as to speculatively produce more invalid pages and suppress the relocation overhead. We conduct extensive trace-driven experiments, showing that SpeedupGC can averagely reduce 64.7% of the GC latency, 21.8% of the read latency, 17.7% of the write latency, and 11.5% of the write amplification when compared to state-of-the-art designs.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125077432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Improving the Heavy Re-encryption Overhead of Split Counter Mode Encryption for NVM 改进NVM拆分计数器加密重重加密开销
Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00073
Qianqian Pei, Seunghee Shin
Emerging non-volatile memory technology enables non-volatile main memory (NVMM) that can provide larger capacity and better energy-saving opportunities than DRAMs. However, its non-volatility raises security concerns, where the data in NVMMs can be taken if the memory is stolen. Thereby, the data must stay encrypted outside the processor boundary. Such encryption requires decryption before the data being used by the processor, adding extra latency to the performance-critical read operations. Split counter mode encryption hides the latency but introduces frequent page re-encryptions as a trade-off. We find that such re-encryption overhead worsens on the NVMM, whose slow latency negates prior optimizations.To mitigate the overhead, we re-design the encryption scheme based on two key observations. First, we observe that NVMMs only need counters that can count up to twice their lifetime. Second, we observe diminishing returns on the counter size as increasing the counter size further does not necessarily decrease the re-encryption frequency. Our new designs re-arrange those inefficiently used bits to reduce the re-encryption overhead. In the tests, our two designs, 3-level split counter mode encryption and 8-block split counter mode encryption, effectively reduce the re-encryption overheads by 63% and 66%, which improve performances by 26% and 30% at maximum and by 8% and 9% on average from the original split counter scheme.
新兴的非易失性存储技术使非易失性主存储器(NVMM)能够提供比dram更大的容量和更好的节能机会。然而,它的非易失性引起了安全问题,如果内存被盗,nvmm中的数据可以被获取。因此,数据必须在处理器边界之外保持加密。这种加密需要在处理器使用数据之前进行解密,从而增加了对性能至关重要的读取操作的额外延迟。分割计数器模式加密隐藏了延迟,但作为一种权衡,引入了频繁的页面重新加密。我们发现,这种重新加密开销在NVMM上恶化,NVMM的缓慢延迟否定了先前的优化。为了减少开销,我们基于两个关键的观察结果重新设计了加密方案。首先,我们观察到nvmm只需要计数最多为其生命周期两倍的计数器。其次,我们观察到计数器大小的收益递减,因为进一步增加计数器大小并不一定会降低重新加密频率。我们的新设计重新排列那些无效使用的位,以减少重新加密的开销。在测试中,我们的3级分割计数器模式加密和8块分割计数器模式加密两种设计有效地减少了63%和66%的重新加密开销,最大性能提高了26%和30%,平均性能比原始分割计数器方案提高了8%和9%。
{"title":"Improving the Heavy Re-encryption Overhead of Split Counter Mode Encryption for NVM","authors":"Qianqian Pei, Seunghee Shin","doi":"10.1109/ICCD53106.2021.00073","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00073","url":null,"abstract":"Emerging non-volatile memory technology enables non-volatile main memory (NVMM) that can provide larger capacity and better energy-saving opportunities than DRAMs. However, its non-volatility raises security concerns, where the data in NVMMs can be taken if the memory is stolen. Thereby, the data must stay encrypted outside the processor boundary. Such encryption requires decryption before the data being used by the processor, adding extra latency to the performance-critical read operations. Split counter mode encryption hides the latency but introduces frequent page re-encryptions as a trade-off. We find that such re-encryption overhead worsens on the NVMM, whose slow latency negates prior optimizations.To mitigate the overhead, we re-design the encryption scheme based on two key observations. First, we observe that NVMMs only need counters that can count up to twice their lifetime. Second, we observe diminishing returns on the counter size as increasing the counter size further does not necessarily decrease the re-encryption frequency. Our new designs re-arrange those inefficiently used bits to reduce the re-encryption overhead. In the tests, our two designs, 3-level split counter mode encryption and 8-block split counter mode encryption, effectively reduce the re-encryption overheads by 63% and 66%, which improve performances by 26% and 30% at maximum and by 8% and 9% on average from the original split counter scheme.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130535765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Compiling and Optimizing Real-world Programs for STRAIGHT ISA 编译和优化现实世界的程序为直ISA
Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00070
Toru Koizumi, Shu Sugita, Ryota Shioya, J. Kadomoto, H. Irie, S. Sakai
The renaming unit of a superscalar processor is a very expensive module. It consumes large amounts of power and limits the front-end bandwidth. To overcome this problem, an instruction set architecture called STRAIGHT has been proposed. Owing to its unique manner of referencing operands, STRAIGHT does not cause false dependencies and allows out-of-order execution without register renaming. However, the compiler optimization techniques for STRAIGHT are still immature, and we found that the naive code generators currently available can generate inefficient code with additional instructions. In this paper, we propose two novel compiler optimization techniques and a novel calling convention for STRAIGHT to reduce the number of instructions. We compiled real-world programs with a compiler that implemented these techniques and measured their performance through simulation. The evaluation results show that the proposed methods reduced the number of executed instructions by 15% and improved the performance by 17%.
超标量处理器的重命名单元是一个非常昂贵的模块。它消耗大量的功率,并限制前端带宽。为了克服这个问题,提出了一种称为STRAIGHT的指令集体系结构。由于其引用操作数的独特方式,STRAIGHT不会导致错误的依赖关系,并且允许在不重命名寄存器的情况下乱序执行。然而,STRAIGHT的编译器优化技术仍然不成熟,我们发现目前可用的幼稚代码生成器可以生成带有额外指令的低效代码。在本文中,我们提出了两种新的编译器优化技术和一种新的调用约定来减少STRAIGHT的指令数量。我们用实现这些技术的编译器编译了真实世界的程序,并通过模拟测量了它们的性能。评估结果表明,所提出的方法减少了15%的执行指令数,性能提高了17%。
{"title":"Compiling and Optimizing Real-world Programs for STRAIGHT ISA","authors":"Toru Koizumi, Shu Sugita, Ryota Shioya, J. Kadomoto, H. Irie, S. Sakai","doi":"10.1109/ICCD53106.2021.00070","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00070","url":null,"abstract":"The renaming unit of a superscalar processor is a very expensive module. It consumes large amounts of power and limits the front-end bandwidth. To overcome this problem, an instruction set architecture called STRAIGHT has been proposed. Owing to its unique manner of referencing operands, STRAIGHT does not cause false dependencies and allows out-of-order execution without register renaming. However, the compiler optimization techniques for STRAIGHT are still immature, and we found that the naive code generators currently available can generate inefficient code with additional instructions. In this paper, we propose two novel compiler optimization techniques and a novel calling convention for STRAIGHT to reduce the number of instructions. We compiled real-world programs with a compiler that implemented these techniques and measured their performance through simulation. The evaluation results show that the proposed methods reduced the number of executed instructions by 15% and improved the performance by 17%.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129897935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
期刊
2021 IEEE 39th International Conference on Computer Design (ICCD)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1