2021 IEEE 39th International Conference on Computer Design (ICCD)最新文献

英文中文

AGQFL: Communication-efficient Federated Learning via Automatic Gradient Quantization in Edge Heterogeneous Systems 边缘异构系统中基于自动梯度量化的高效通信联邦学习

2021 IEEE 39th International Conference on Computer Design (ICCD)

Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00089

Zirui Lian, Jing Cao, Yanru Zuo, Weihong Liu, Zongwei Zhu

With the widespread use of artificial intelligent (AI) applications and dramatic growth in data volumes from edge devices, there are currently many works that place the training of AI models onto edge devices. The state-of-the-art edge training framework, federated learning (FL), requires to transfer of a large amount of data between edge devices and the central server, which causes heavy communication overhead. To alleviate the communication overhead, gradient compression techniques are widely used. However, the bandwidth of the edge devices is usually different, causing communication heterogeneity. Existing gradient compression techniques usually adopt a fixed compression rate and do not take the straggler problem caused by the communication heterogeneity into account. To address these issues, we propose AGQFL, an automatic gradient quantization method consisting of three modules: quantization indicator module, quantization strategy module and quantization optimizer module. The quantization indicator module automatically determines the adjustment direction of quantization precision by measuring the convergence ability of the current model. Following the indicator and the physical bandwidth of each node, the quantization strategy module adjusts the quantization precision at run-time. Furthermore, the quantization optimizer module designs a new optimizer to reduce the training bias and eliminate the instability during the training process. Experimental results show that AGQFL can greatly speed up the training process in edge AI systems while maintaining or even improving model accuracy.

随着人工智能(AI)应用程序的广泛使用和边缘设备数据量的急剧增长，目前有许多工作将人工智能模型的训练放在边缘设备上。最先进的边缘训练框架联邦学习(FL)需要在边缘设备和中央服务器之间传输大量数据，这导致了沉重的通信开销。为了减少通信开销，梯度压缩技术被广泛使用。但是，边缘设备的带宽通常不同，导致通信异构。现有的梯度压缩技术通常采用固定的压缩率，没有考虑由于通信异构性造成的离散问题。针对这些问题，我们提出了一种自动梯度量化方法AGQFL，该方法由量化指标模块、量化策略模块和量化优化器模块三个模块组成。量化指标模块通过测量当前模型的收敛能力，自动确定量化精度的调整方向。量化策略模块根据指标和每个节点的物理带宽，在运行时调整量化精度。此外，量化优化器模块设计了一种新的优化器，以减少训练偏差，消除训练过程中的不稳定性。实验结果表明，AGQFL可以大大加快边缘AI系统的训练过程，同时保持甚至提高模型精度。

{"title":"AGQFL: Communication-efficient Federated Learning via Automatic Gradient Quantization in Edge Heterogeneous Systems","authors":"Zirui Lian, Jing Cao, Yanru Zuo, Weihong Liu, Zongwei Zhu","doi":"10.1109/ICCD53106.2021.00089","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00089","url":null,"abstract":"With the widespread use of artificial intelligent (AI) applications and dramatic growth in data volumes from edge devices, there are currently many works that place the training of AI models onto edge devices. The state-of-the-art edge training framework, federated learning (FL), requires to transfer of a large amount of data between edge devices and the central server, which causes heavy communication overhead. To alleviate the communication overhead, gradient compression techniques are widely used. However, the bandwidth of the edge devices is usually different, causing communication heterogeneity. Existing gradient compression techniques usually adopt a fixed compression rate and do not take the straggler problem caused by the communication heterogeneity into account. To address these issues, we propose AGQFL, an automatic gradient quantization method consisting of three modules: quantization indicator module, quantization strategy module and quantization optimizer module. The quantization indicator module automatically determines the adjustment direction of quantization precision by measuring the convergence ability of the current model. Following the indicator and the physical bandwidth of each node, the quantization strategy module adjusts the quantization precision at run-time. Furthermore, the quantization optimizer module designs a new optimizer to reduce the training bias and eliminate the instability during the training process. Experimental results show that AGQFL can greatly speed up the training process in edge AI systems while maintaining or even improving model accuracy.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115637111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Energy-Efficient MAC Units for Fused Posit Arithmetic 融合正数算法的节能MAC单元

2021 IEEE 39th International Conference on Computer Design (ICCD)

Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00032

Raul Murillo, David Mallasén, Alberto A. Del Barrio, Guillermo Botella Juan

Posit arithmetic is an alternative format to the standard IEEE 754 for floating-point numbers that claims to provide compelling advantages over floats, including higher accuracy, larger dynamic range, or bitwise compatibility across systems. The interest in the design of arithmetic units for this novel format has increased in the last few years. However, while multiple designs for posit adder and multiplier have been developed recently in the literature, fused units for posit arithmetic are still in the early stages of research. Moreover, due to the large size of accumulators needed in fused operations, the few fused posit units proposed so far still require many hardware resources. In order to contribute to the development of the posit number format, and facilitate its use in applications such as deep learning, this paper presents several designs of energy-efficient posit multiply- accumulate (MAC) units with support for standard quire format. Concretely, the proposed designs are capable of computing fused dot products of large vectors without accuracy drop, while consuming less energy than previous implementations. Experiments show that, compared to previous implementations, the proposed designs consume up to 75.49%, 88.45% and 83.43% less energy and are 73.18%, 87.36% and 83.00% faster for 8, 16 and 32 bitwidths, with an additional area of only 4.97%, 7.44% and 4.24%, respectively.

正数算术是标准IEEE 754浮点数的另一种格式，它声称比浮点数提供了令人信服的优势，包括更高的精度、更大的动态范围或跨系统的位兼容性。在过去几年中，对这种新颖格式的算术单元设计的兴趣有所增加。然而，虽然最近文献中已经发展了多种正加法器和乘法器的设计，但正算术的融合单元仍处于研究的早期阶段。此外，由于融合运算所需的累加器体积较大，目前提出的几种融合定位单元仍然需要大量的硬件资源。为了促进正数格式的发展，并促进其在深度学习等应用中的应用，本文提出了几种支持标准队列格式的节能正数乘累积(MAC)单元的设计。具体而言，所提出的设计能够在不降低精度的情况下计算大矢量的融合点积，同时比以前的实现消耗更少的能量。实验表明，与以前的实现相比，所提出的设计能耗降低了75.49%、88.45%和83.43%，在8、16和32比特宽下的速度分别提高了73.18%、87.36%和83.00%，而额外面积仅为4.97%、7.44%和4.24%。

{"title":"Energy-Efficient MAC Units for Fused Posit Arithmetic","authors":"Raul Murillo, David Mallasén, Alberto A. Del Barrio, Guillermo Botella Juan","doi":"10.1109/ICCD53106.2021.00032","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00032","url":null,"abstract":"Posit arithmetic is an alternative format to the standard IEEE 754 for floating-point numbers that claims to provide compelling advantages over floats, including higher accuracy, larger dynamic range, or bitwise compatibility across systems. The interest in the design of arithmetic units for this novel format has increased in the last few years. However, while multiple designs for posit adder and multiplier have been developed recently in the literature, fused units for posit arithmetic are still in the early stages of research. Moreover, due to the large size of accumulators needed in fused operations, the few fused posit units proposed so far still require many hardware resources. In order to contribute to the development of the posit number format, and facilitate its use in applications such as deep learning, this paper presents several designs of energy-efficient posit multiply- accumulate (MAC) units with support for standard quire format. Concretely, the proposed designs are capable of computing fused dot products of large vectors without accuracy drop, while consuming less energy than previous implementations. Experiments show that, compared to previous implementations, the proposed designs consume up to 75.49%, 88.45% and 83.43% less energy and are 73.18%, 87.36% and 83.00% faster for 8, 16 and 32 bitwidths, with an additional area of only 4.97%, 7.44% and 4.24%, respectively.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122560820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Accurate and Fast Performance Modeling of Processors with Decoupled Front-end 前端解耦处理器的准确快速性能建模

2021 IEEE 39th International Conference on Computer Design (ICCD)

Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00025

Yuya Degawa, Toru Koizumi, Tomoki Nakamura, Ryota Shioya, J. Kadomoto, H. Irie, S. Sakai

Various techniques, such as cache replacement algorithms and prefetching, have been studied to prevent instruction cache misses from becoming a bottleneck in the processor frontend. In such studies, the goal of the design has been to reduce the number of instruction cache misses. However, owing to the increasing complexity of modern processors, the correlation between reducing instruction cache misses and reducing the number of executed cycles has become smaller than in previous cases. In this paper, we propose a new guideline for improving the performance of modern processors. In addition, we propose a method for estimating the approximate performance of a design two orders of magnitude faster than a full simulation each time the designers modify their design.

各种技术，如缓存替换算法和预取，已经被研究，以防止指令缓存丢失成为处理器前端的瓶颈。在这样的研究中，设计的目标是减少指令缓存丢失的数量。然而，由于现代处理器的复杂性不断增加，减少指令缓存缺失和减少执行周期数量之间的相关性比以前的情况要小。在本文中，我们提出了提高现代处理器性能的新准则。此外，我们提出了一种估算设计近似性能的方法，每次设计师修改他们的设计时，其速度比完全模拟快两个数量级。

引用次数: 0

Novel Ultra-Low-Voltage Flip-Flops: Near-Vth Modeling and VLSI Integration

2021 IEEE 39th International Conference on Computer Design (ICCD)

Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00021

A. Ranasinghe, S. H. Gerez

This paper presents two novel ultra-low-voltage (ULV) Single-Edge-Triggered flip-flops (SET-FF) based on the True-Single-Phase-Clocking (TSPC) scheme. By exploiting the TSPC principle, the overall energy efficiency has been improved compared to the traditional flip-flop designs while providing fully static, contention-free functionality to satisfy ULV operation. At 0.5V near-Vth level in 65nm bulk CMOS technology, the proposed SET-FFs demonstrate up to 11-45% and 7-20% of energy efficiency at 0% and 100% data activity rates compared to the best known SET-FFs. The proposed SET-FF can safely operate down to 0.24V of supply voltage without corrupting rail-to-rail voltage levels at its internal nodes. The integration of proposed SET-FFs in a 320-bit parallel shift register demonstrated up to 33% of clock network power, 17-39% of register power reductions compared to the state-of-the-art and commercial standard-cells at near-Vth level. In addition to these merits, with the aid of parasitic modeling, this paper re-evaluates the vital performance metrics of SET-FFs at near-Vth voltage domain, improving their characterization accuracy and enabling the VLSI integration for commercial end-use.

本文提出了两种基于真单相时钟(TSPC)方案的新型超低电压(ULV)单边触发触发器(SET-FF)。通过利用TSPC原理，与传统触发器设计相比，整体能源效率得到了提高，同时提供了完全静态、无争用的功能，以满足ULV操作。在65nm块体CMOS技术的0.5V近v电平下，与最知名的set - ff相比，所提出的set - ff在0%和100%数据活动速率下的能效可达11-45%和7-20%。所提出的SET-FF可以安全地工作到低至0.24V的电源电压，而不会破坏其内部节点的轨对轨电压水平。在320位并行移位寄存器中集成所提出的set - ff，在接近v级的情况下，与最先进和商用标准单元相比，时钟网络功耗降低了33%，寄存器功耗降低了17-39%。除了这些优点之外，借助寄生建模，本文重新评估了近vth电压域的set - off的重要性能指标，提高了它们的表征精度，并使VLSI集成能够用于商业最终用途。

{"title":"Novel Ultra-Low-Voltage Flip-Flops: Near-Vth Modeling and VLSI Integration","authors":"A. Ranasinghe, S. H. Gerez","doi":"10.1109/ICCD53106.2021.00021","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00021","url":null,"abstract":"This paper presents two novel ultra-low-voltage (ULV) Single-Edge-Triggered flip-flops (SET-FF) based on the True-Single-Phase-Clocking (TSPC) scheme. By exploiting the TSPC principle, the overall energy efficiency has been improved compared to the traditional flip-flop designs while providing fully static, contention-free functionality to satisfy ULV operation. At 0.5V near-Vth level in 65nm bulk CMOS technology, the proposed SET-FFs demonstrate up to 11-45% and 7-20% of energy efficiency at 0% and 100% data activity rates compared to the best known SET-FFs. The proposed SET-FF can safely operate down to 0.24V of supply voltage without corrupting rail-to-rail voltage levels at its internal nodes. The integration of proposed SET-FFs in a 320-bit parallel shift register demonstrated up to 33% of clock network power, 17-39% of register power reductions compared to the state-of-the-art and commercial standard-cells at near-Vth level. In addition to these merits, with the aid of parasitic modeling, this paper re-evaluates the vital performance metrics of SET-FFs at near-Vth voltage domain, improving their characterization accuracy and enabling the VLSI integration for commercial end-use.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129272061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Seer-SSD: Bridging Semantic Gap between Log-Structured File Systems and SSDs to Reduce SSD Write Amplification Seer-SSD:弥合日志结构文件系统和SSD之间的语义差距，以减少SSD的写放大

2021 IEEE 39th International Conference on Computer Design (ICCD)

Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00020

You Zhou, Ke Wang, Fei Wu, Changsheng Xie, Hao Lv

Log-structured file systems (LS-FSs) sequentialize writes, so they are expected to perform well on flash-based SSDs. However, we observe a semantic gap between the LS- FS and SSD that causes a stale-LBA problem. When data are updated, the LS-FS allocates new logical block addresses (LBAs). The relevant stale LBAs are invalidated and then trimmed or reused with a delay by the LS-FS. During the time interval, stale LBAs are regarded temporarily as valid and migrated unnecessarily by garbage collection in the SSD. Our experimental study of real-world traces reveals that stale-LBA migrations amount to 59%-150% of host data writes. To solve this serious problem, we propose Seer-SSD to deliver stale-LBA metadata along with written data from the LS-FS to the SSD. Then, stale LBAs are invalidated actively and selectively in the SSD without compromising file system consistency. Seer-SSD can be implemented easily based on existing block interfaces and maintain compatibility with non-LS-FSs. We perform a case study on an emulated NVMe SSD hosting F2FS (a state-of-the- art LS-FS). Experimental results with popular databases show that Seer-SSD improves the throughput by 99.8% and reduces the write amplification by 53.6%, on average, compared to a traditional SSD unaware of stale LBAs.

日志结构文件系统(ls - fs)将写入顺序化，因此它们在基于闪存的ssd上表现良好。然而，我们观察到LS- FS和SSD之间的语义差距导致了旧的lba问题。当数据更新时，LS-FS分配新的逻辑块地址(LBAs)。相关的过期lba失效，然后由LS-FS延迟裁剪或重用。在此时间间隔内，过期的LBAs将被暂时视为有效的，并由SSD中的垃圾收集进行不必要的迁移。我们对实际跟踪的实验研究表明，陈旧的lba迁移占主机数据写入的59%-150%。为了解决这个严重的问题，我们提出了Seer-SSD，将过时的lba元数据连同从LS-FS写入的数据一起传输到SSD。然后，在不影响文件系统一致性的情况下，在SSD中主动地、有选择地使过期的lba失效。Seer-SSD可以基于现有的块接口轻松实现，并保持与非ls - fs的兼容性。我们对托管F2FS(最先进的LS-FS)的模拟NVMe SSD进行了案例研究。在常用数据库上的实验结果表明，与不知道陈旧lba的传统SSD相比，Seer-SSD的吞吐量平均提高了99.8%，写放大平均降低了53.6%。

{"title":"Seer-SSD: Bridging Semantic Gap between Log-Structured File Systems and SSDs to Reduce SSD Write Amplification","authors":"You Zhou, Ke Wang, Fei Wu, Changsheng Xie, Hao Lv","doi":"10.1109/ICCD53106.2021.00020","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00020","url":null,"abstract":"Log-structured file systems (LS-FSs) sequentialize writes, so they are expected to perform well on flash-based SSDs. However, we observe a semantic gap between the LS- FS and SSD that causes a stale-LBA problem. When data are updated, the LS-FS allocates new logical block addresses (LBAs). The relevant stale LBAs are invalidated and then trimmed or reused with a delay by the LS-FS. During the time interval, stale LBAs are regarded temporarily as valid and migrated unnecessarily by garbage collection in the SSD. Our experimental study of real-world traces reveals that stale-LBA migrations amount to 59%-150% of host data writes. To solve this serious problem, we propose Seer-SSD to deliver stale-LBA metadata along with written data from the LS-FS to the SSD. Then, stale LBAs are invalidated actively and selectively in the SSD without compromising file system consistency. Seer-SSD can be implemented easily based on existing block interfaces and maintain compatibility with non-LS-FSs. We perform a case study on an emulated NVMe SSD hosting F2FS (a state-of-the- art LS-FS). Experimental results with popular databases show that Seer-SSD improves the throughput by 99.8% and reduces the write amplification by 53.6%, on average, compared to a traditional SSD unaware of stale LBAs.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"185 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120899315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CoRe-ECO: Concurrent Refinement of Detailed Place-and-Route for an Efficient ECO Automation 核心ECO:同时细化详细的地点和路线，以实现高效的ECO自动化

2021 IEEE 39th International Conference on Computer Design (ICCD)

Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00065

Chung-Kuan Cheng, A. Kahng, Ilgweon Kang, Minsoo Kim, Daeyeal Lee, Bill Lin, Dongwon Park, M. Woo

With the relentless scaling of technology nodes, physical design engineers encounter non-trivial challenges caused by rapidly increasing design complexity, particularly in the routing stage. Back-end designers must manually stitch/modify all of the design rule violations (DRVs) that remain after automatic place-and-route (P&R), during the implementation of engineering change orders (ECOs). In this paper, we propose CoRe-ECO, a concurrent refinement framework for efficient automation of the ECO process. Our framework efficiently resolves pin accessibility-induced DRVs by simultaneously performing detailed placement, detailed routing, and cell replacement. In addition to perturbation-minimized solutions, our proposed SMT-based optimization framework also suggests the adoption of alternative master cells to better achieve DRV-clean layouts. We demonstrate that our framework successfully resolves from 33.3% to 100.0% (58.6% on average) of remaining DRVs on M1-M3 layers, across a range of benchmark circuits with various cell architectures, while also providing average total wirelength reduction of 0.003%.

随着技术节点的不断扩展，物理设计工程师遇到了快速增加的设计复杂性所带来的重大挑战，特别是在路由阶段。在实施工程变更单(eco)期间，后端设计人员必须手动缝合/修改所有在自动放置和布线(P&R)之后仍然存在的设计规则违规(drv)。在本文中，我们提出了CoRe-ECO，一个用于ECO过程高效自动化的并发改进框架。我们的框架通过同时执行详细的放置、详细的路由和单元替换，有效地解决了引脚可访问性引起的drv。除了微扰最小化解决方案，我们提出的基于smt的优化框架还建议采用替代主单元来更好地实现drv清洁布局。我们证明了我们的框架成功地解决了从33.3%到100.0%(平均58.6%)的M1-M3层上剩余drv，在一系列具有各种单元架构的基准电路中，同时还提供了平均总无线长度减少0.003%。

{"title":"CoRe-ECO: Concurrent Refinement of Detailed Place-and-Route for an Efficient ECO Automation","authors":"Chung-Kuan Cheng, A. Kahng, Ilgweon Kang, Minsoo Kim, Daeyeal Lee, Bill Lin, Dongwon Park, M. Woo","doi":"10.1109/ICCD53106.2021.00065","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00065","url":null,"abstract":"With the relentless scaling of technology nodes, physical design engineers encounter non-trivial challenges caused by rapidly increasing design complexity, particularly in the routing stage. Back-end designers must manually stitch/modify all of the design rule violations (DRVs) that remain after automatic place-and-route (P&R), during the implementation of engineering change orders (ECOs). In this paper, we propose CoRe-ECO, a concurrent refinement framework for efficient automation of the ECO process. Our framework efficiently resolves pin accessibility-induced DRVs by simultaneously performing detailed placement, detailed routing, and cell replacement. In addition to perturbation-minimized solutions, our proposed SMT-based optimization framework also suggests the adoption of alternative master cells to better achieve DRV-clean layouts. We demonstrate that our framework successfully resolves from 33.3% to 100.0% (58.6% on average) of remaining DRVs on M1-M3 layers, across a range of benchmark circuits with various cell architectures, while also providing average total wirelength reduction of 0.003%.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114201325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Special Session: ADAPT: ANN-ControlleD System-Level Runtime Adaptable APproximate CompuTing 专题会议:ADAPT: ann控制的系统级运行时自适应近似计算

2021 IEEE 39th International Conference on Computer Design (ICCD)

Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00012

Prattay Chowdhury, B. C. Schafer

Approximate computing has shown to be an effective approach to generate smaller and more power-efficient circuits by trading the accuracy of the circuit vs. area and/or power. So far, most work on approximate computing has focused on specific components within a system. It severely limits the approximation potential as most Integrated Circuits (ICs) are now complex heterogeneous systems. One additional limitation of current work in this domain is they assume that the training data matches the actual workload. This is nevertheless not always true as these complex Systems-on-Chip (SoCs) are used for a variety of different applications. To address these issues, this work investigates if lower-power designs can be found through mixing approximations across the different components in the SoC as opposed to only aggressively approximating a single component. The main hypothesis is that some approximations amplify across the system, while others tend to cancel each other out, thus, allowing to maximize the power savings while meeting the given maximum error threshold. To investigate this, we propose a method called ADAPT. ADAPT uses a neural network-based controller to dynamically adjust the supply voltage (Vdd) of different components in SoC at runtime based on the actual workload.

近似计算已被证明是一种有效的方法，通过交换电路的精度与面积和/或功率来生成更小、更节能的电路。到目前为止，大多数关于近似计算的工作都集中在系统中的特定组件上。由于大多数集成电路(ic)现在是复杂的异构系统，它严重限制了近似的潜力。该领域当前工作的另一个限制是，它们假设训练数据与实际工作负载相匹配。然而，这并不总是正确的，因为这些复杂的片上系统(soc)用于各种不同的应用。为了解决这些问题，这项工作调查了是否可以通过混合SoC中不同组件的近似来找到低功耗设计，而不是仅仅积极地近似单个组件。主要假设是，一些近似在整个系统中放大，而另一些则倾向于相互抵消，因此，在满足给定的最大误差阈值的同时，允许最大限度地节省电力。为了研究这个问题，我们提出了一种叫做ADAPT的方法。ADAPT采用基于神经网络的控制器，在运行时根据实际工作负载动态调整SoC中不同组件的电源电压(Vdd)。

{"title":"Special Session: ADAPT: ANN-ControlleD System-Level Runtime Adaptable APproximate CompuTing","authors":"Prattay Chowdhury, B. C. Schafer","doi":"10.1109/ICCD53106.2021.00012","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00012","url":null,"abstract":"Approximate computing has shown to be an effective approach to generate smaller and more power-efficient circuits by trading the accuracy of the circuit vs. area and/or power. So far, most work on approximate computing has focused on specific components within a system. It severely limits the approximation potential as most Integrated Circuits (ICs) are now complex heterogeneous systems. One additional limitation of current work in this domain is they assume that the training data matches the actual workload. This is nevertheless not always true as these complex Systems-on-Chip (SoCs) are used for a variety of different applications. To address these issues, this work investigates if lower-power designs can be found through mixing approximations across the different components in the SoC as opposed to only aggressively approximating a single component. The main hypothesis is that some approximations amplify across the system, while others tend to cancel each other out, thus, allowing to maximize the power savings while meeting the given maximum error threshold. To investigate this, we propose a method called ADAPT. ADAPT uses a neural network-based controller to dynamically adjust the supply voltage (Vdd) of different components in SoC at runtime based on the actual workload.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121709522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Special Session PC 特别会议PC

2021 IEEE 39th International Conference on Computer Design (ICCD)

Pub Date : 2021-10-01 DOI: 10.1109/iccd53106.2021.00009

引用次数: 0

Fault-Aware Prediction-Guided Page Offlining for Uncorrectable Memory Error Prevention 用于防止不可纠正内存错误的故障感知预测引导页面脱机

2021 IEEE 39th International Conference on Computer Design (ICCD)

Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00077

Xiaoming Du, Cong Li, Shen Zhou, Xian Liu, Xiaohan Xu, Tianjiao Wang, Shi-Lun Ge

Uncorrectable memory errors are the major causes of hardware failures in datacenters leading to server crashes. Page offlining is an error-prevention mechanism implemented in modern operating systems. Traditional offlining policies are based on correctable error (CE) rate of a page in a past period. However, CEs are just the observations while the underlying causes are memory circuit faults. A certain fault such as a row fault can impact quite a few pages. Meanwhile, not all faults are equally prone to uncorrectable errors (UEs). In this paper, we propose a fault-aware prediction-guide policy for page offlining. In the proposed policy, we first identify row faults based on CE observations as the preliminary candidates for offlining. Leveraging the knowledge of the error correction code, we design a predictor based on error-bit patterns to predict whether a row fault is prone to UEs or not. Pages impacted by the UE-prone rows are then offlined. Empirical evaluation using the error log from a modern large-scale cluster in ByteDance demonstrates that the proposed policy avoids several times more UEs than the traditional policy does at a comparable cost of memory capacity loss due to page offlining.

无法纠正的内存错误是导致服务器崩溃的数据中心硬件故障的主要原因。页面脱机是现代操作系统中实现的一种错误预防机制。传统的脱机策略是基于页面在过去一段时间内的可纠正错误率。然而，ce只是观察结果，其根本原因是存储电路故障。某个错误(例如行错误)可能会影响相当多的页面。同时，并非所有故障都同样容易出现不可纠正错误(ue)。本文提出了一种基于故障感知的页面脱机预测指导策略。在建议的策略中，我们首先根据CE观测确定行断层作为离线的初步候选。利用纠错码的知识，我们设计了一个基于错误位模式的预测器来预测行故障是否容易出现ue。然后，受易出现ue的行影响的页面将脱机。使用来自ByteDance的现代大规模集群的错误日志的经验评估表明，所提出的策略避免的ue是传统策略的几倍，而由于页面脱机而导致的内存容量损失的成本相当。

{"title":"Fault-Aware Prediction-Guided Page Offlining for Uncorrectable Memory Error Prevention","authors":"Xiaoming Du, Cong Li, Shen Zhou, Xian Liu, Xiaohan Xu, Tianjiao Wang, Shi-Lun Ge","doi":"10.1109/ICCD53106.2021.00077","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00077","url":null,"abstract":"Uncorrectable memory errors are the major causes of hardware failures in datacenters leading to server crashes. Page offlining is an error-prevention mechanism implemented in modern operating systems. Traditional offlining policies are based on correctable error (CE) rate of a page in a past period. However, CEs are just the observations while the underlying causes are memory circuit faults. A certain fault such as a row fault can impact quite a few pages. Meanwhile, not all faults are equally prone to uncorrectable errors (UEs). In this paper, we propose a fault-aware prediction-guide policy for page offlining. In the proposed policy, we first identify row faults based on CE observations as the preliminary candidates for offlining. Leveraging the knowledge of the error correction code, we design a predictor based on error-bit patterns to predict whether a row fault is prone to UEs or not. Pages impacted by the UE-prone rows are then offlined. Empirical evaluation using the error log from a modern large-scale cluster in ByteDance demonstrates that the proposed policy avoids several times more UEs than the traditional policy does at a comparable cost of memory capacity loss due to page offlining.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121618723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Legion: Tailoring Grouped Neural Execution Considering Heterogeneity on Multiple Edge Devices 军团:考虑多边缘设备异质性的裁剪分组神经执行

2021 IEEE 39th International Conference on Computer Design (ICCD)

Pub Date : 2021-10-01 DOI: 10.1109/ICCD53106.2021.00067

Kyunghwan Choi, Seongju Lee, Beom Woo Kang, Yongjun Park

Distributing workloads that cannot be handled by a single edge device across multiple edge devices is a promising solution that minimizes the inference latency of deep learning applications by exploiting model parallelism. Several prior solutions have been proposed to partition target models efficiently, but most studies have focused on finding the optimal fused layer configurations, which minimize the data-transfer overhead between layers. However, as recent deep learning network models have become more complex and the ability to deploy them quickly has become a key challenge, the search for the best fused layer configurations of target models has become a major requirement. To solve this problem, we propose a lightweight model partitioning framework called Legion to find the optimal fused layer configurations with minimal profiling execution trials. By finding the optimal configurations using cost matrix construction and wild card selection, the experimental results showed that Legion achieved a similar performance to the full configuration search at a fraction of the search time. Moreover, Legion performed effectively even on a group of heterogeneous target devices by introducing a per-device cost-related matrix construction. With three popular networks, Legion shows only 3.4% performance loss as compared to a full searching scheme (FSS), on various different device configurations consisting of up to six heterogeneous devices, and minimizes the profiling overhead by 48.7× on average.

将单个边缘设备无法处理的工作负载分布在多个边缘设备上是一种很有前途的解决方案，通过利用模型并行性来最大限度地减少深度学习应用程序的推理延迟。先前已经提出了几种有效划分目标模型的解决方案，但大多数研究都集中在寻找最优融合层配置上，以使层之间的数据传输开销最小化。然而，随着最近深度学习网络模型变得越来越复杂，快速部署它们的能力已成为一个关键挑战，寻找目标模型的最佳融合层配置已成为一个主要需求。为了解决这个问题，我们提出了一个名为Legion的轻量级模型划分框架，以最少的分析执行次数找到最佳的融合层配置。通过构建代价矩阵和选择外卡来寻找最优配置，实验结果表明，军团在搜索时间的一小部分内获得了与全配置搜索相似的性能。此外，通过引入每个设备成本相关的矩阵结构，Legion即使在一组异构目标设备上也能有效地执行。对于三种流行的网络，在由多达六个异构设备组成的各种不同设备配置上，与完整搜索方案(FSS)相比，Legion的性能损失仅为3.4%，并且将分析开销平均降低了48.7倍。

{"title":"Legion: Tailoring Grouped Neural Execution Considering Heterogeneity on Multiple Edge Devices","authors":"Kyunghwan Choi, Seongju Lee, Beom Woo Kang, Yongjun Park","doi":"10.1109/ICCD53106.2021.00067","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00067","url":null,"abstract":"Distributing workloads that cannot be handled by a single edge device across multiple edge devices is a promising solution that minimizes the inference latency of deep learning applications by exploiting model parallelism. Several prior solutions have been proposed to partition target models efficiently, but most studies have focused on finding the optimal fused layer configurations, which minimize the data-transfer overhead between layers. However, as recent deep learning network models have become more complex and the ability to deploy them quickly has become a key challenge, the search for the best fused layer configurations of target models has become a major requirement. To solve this problem, we propose a lightweight model partitioning framework called Legion to find the optimal fused layer configurations with minimal profiling execution trials. By finding the optimal configurations using cost matrix construction and wild card selection, the experimental results showed that Legion achieved a similar performance to the full configuration search at a fraction of the search time. Moreover, Legion performed effectively even on a group of heterogeneous target devices by introducing a per-device cost-related matrix construction. With three popular networks, Legion shows only 3.4% performance loss as compared to a full searching scheme (FSS), on various different device configurations consisting of up to six heterogeneous devices, and minimizes the profiling overhead by 48.7× on average.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126043129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2021 IEEE 39th International Conference on Computer Design (ICCD)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀