基于制造后可重构架构平台的自动指令融合提高嵌入式系统效率

9th International Symposium on Quality Electronic Design (isqed 2008) Pub Date : 2008-03-17 DOI:10.1109/ISQED.2008.85

A. Cheng

{"title":"基于制造后可重构架构平台的自动指令融合提高嵌入式系统效率","authors":"A. Cheng","doi":"10.1109/ISQED.2008.85","DOIUrl":null,"url":null,"abstract":"Portable embedded SoC processor architects are constantly challenged by exponentially increasing demand for newer functionality, faster real-time communication, stronger security, and higher reliability; while the constraint on energy, feature size, NRE cost, and time-to-market (TTM) grows tighter than ever. Existing approaches attempting to achieve these mutual conflicting design goals rely heavily on adopting special-purpose accelerators (SPA) to take charge of the heavy lifting in the aimed embedded SoC designs. These SPAs, synthesized from either ASIC or FPGA, are usually augmented to the base processor as co-processors to execute the performance-critical regions of applications. ASIC-based SPAs achieve performance-energy efficiency at the expense of sacrificing post-manufacturing programmability while incurring large NRE and TTM; FPGA-based SPAs retain programmability at the expense of significant energy and area increase. Furthermore, augmenting these SPAs as co-processors adds considerable communication and synchronization overhead severely compromising their initially promised benefits. This paper proposes an innovative design paradigm that moves away from the common scheme of adding co-processing ASIC/FPGA SPAs to an integrated and reconfigurable design. Specifically, we propose a new class of embedded processor by replacing the processor's conventional ALU with a more powerful and flexible versatile processing unit (VPU). VPU enables multiple interdependent instructions to be fused and processed together as a single atomic VPU instruction by exploring dataflow dependencies of the application code. The instruction fusion is automatically performed by a VPU-aware compiler. The optimized VPU code reduces code size and amplifies the effective processor bandwidth and capacity by eliminating transient computation and register spill code. Experimental results show up to 400% and average 150% speedup for MediaBench with negligible area increase.","PeriodicalId":243121,"journal":{"name":"9th International Symposium on Quality Electronic Design (isqed 2008)","volume":"179 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Amplifying Embedded System Efficiency via Automatic Instruction Fusion on a Post-Manufacturing Reconfigurable Architecture Platform\",\"authors\":\"A. Cheng\",\"doi\":\"10.1109/ISQED.2008.85\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Portable embedded SoC processor architects are constantly challenged by exponentially increasing demand for newer functionality, faster real-time communication, stronger security, and higher reliability; while the constraint on energy, feature size, NRE cost, and time-to-market (TTM) grows tighter than ever. Existing approaches attempting to achieve these mutual conflicting design goals rely heavily on adopting special-purpose accelerators (SPA) to take charge of the heavy lifting in the aimed embedded SoC designs. These SPAs, synthesized from either ASIC or FPGA, are usually augmented to the base processor as co-processors to execute the performance-critical regions of applications. ASIC-based SPAs achieve performance-energy efficiency at the expense of sacrificing post-manufacturing programmability while incurring large NRE and TTM; FPGA-based SPAs retain programmability at the expense of significant energy and area increase. Furthermore, augmenting these SPAs as co-processors adds considerable communication and synchronization overhead severely compromising their initially promised benefits. This paper proposes an innovative design paradigm that moves away from the common scheme of adding co-processing ASIC/FPGA SPAs to an integrated and reconfigurable design. Specifically, we propose a new class of embedded processor by replacing the processor's conventional ALU with a more powerful and flexible versatile processing unit (VPU). VPU enables multiple interdependent instructions to be fused and processed together as a single atomic VPU instruction by exploring dataflow dependencies of the application code. The instruction fusion is automatically performed by a VPU-aware compiler. The optimized VPU code reduces code size and amplifies the effective processor bandwidth and capacity by eliminating transient computation and register spill code. Experimental results show up to 400% and average 150% speedup for MediaBench with negligible area increase.\",\"PeriodicalId\":243121,\"journal\":{\"name\":\"9th International Symposium on Quality Electronic Design (isqed 2008)\",\"volume\":\"179 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2008-03-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"9th International Symposium on Quality Electronic Design (isqed 2008)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISQED.2008.85\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"9th International Symposium on Quality Electronic Design (isqed 2008)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISQED.2008.85","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

便携式嵌入式SoC处理器架构不断受到更新功能，更快的实时通信，更强的安全性和更高可靠性需求的指数增长的挑战;而对能源、特征尺寸、NRE成本和上市时间(TTM)的限制比以往任何时候都要严格。试图实现这些相互冲突的设计目标的现有方法严重依赖于采用专用加速器(SPA)来承担目标嵌入式SoC设计中的繁重工作。这些spa由ASIC或FPGA合成，通常作为协处理器扩展到基本处理器，以执行应用程序的性能关键区域。基于asic的spa以牺牲制造后可编程性为代价实现了性能-能源效率，同时产生了大量的NRE和TTM;基于fpga的spa以显著的能量和面积增加为代价保持可编程性。此外，将这些spa扩展为协处理器会增加相当大的通信和同步开销，严重损害它们最初承诺的好处。本文提出了一种创新的设计范例，该范例从添加协同处理ASIC/FPGA spa的常见方案转移到集成和可重构设计。具体来说，我们提出了一种新型嵌入式处理器，用更强大、更灵活的多功能处理单元(VPU)取代处理器的传统ALU。VPU通过探索应用程序代码的数据流依赖关系，使多个相互依赖的指令能够融合并作为单个原子VPU指令一起处理。指令融合由vpu感知编译器自动完成。优化后的VPU代码通过消除瞬态计算和寄存器溢出代码，减小了代码大小，提高了有效的处理器带宽和容量。实验结果表明，mediabbench的加速速度最高可达400%，平均可达150%，而面积增加可以忽略不计。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Amplifying Embedded System Efficiency via Automatic Instruction Fusion on a Post-Manufacturing Reconfigurable Architecture Platform

Portable embedded SoC processor architects are constantly challenged by exponentially increasing demand for newer functionality, faster real-time communication, stronger security, and higher reliability; while the constraint on energy, feature size, NRE cost, and time-to-market (TTM) grows tighter than ever. Existing approaches attempting to achieve these mutual conflicting design goals rely heavily on adopting special-purpose accelerators (SPA) to take charge of the heavy lifting in the aimed embedded SoC designs. These SPAs, synthesized from either ASIC or FPGA, are usually augmented to the base processor as co-processors to execute the performance-critical regions of applications. ASIC-based SPAs achieve performance-energy efficiency at the expense of sacrificing post-manufacturing programmability while incurring large NRE and TTM; FPGA-based SPAs retain programmability at the expense of significant energy and area increase. Furthermore, augmenting these SPAs as co-processors adds considerable communication and synchronization overhead severely compromising their initially promised benefits. This paper proposes an innovative design paradigm that moves away from the common scheme of adding co-processing ASIC/FPGA SPAs to an integrated and reconfigurable design. Specifically, we propose a new class of embedded processor by replacing the processor's conventional ALU with a more powerful and flexible versatile processing unit (VPU). VPU enables multiple interdependent instructions to be fused and processed together as a single atomic VPU instruction by exploring dataflow dependencies of the application code. The instruction fusion is automatically performed by a VPU-aware compiler. The optimized VPU code reduces code size and amplifies the effective processor bandwidth and capacity by eliminating transient computation and register spill code. Experimental results show up to 400% and average 150% speedup for MediaBench with negligible area increase.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

9th International Symposium on Quality Electronic Design (isqed 2008)

自引率

0.00%

发文量

期刊最新文献

A Low Energy Two-Step Successive Approximation Algorithm for ADC Design Robust Analog Design for Automotive Applications by Design Centering with Safe Operating Areas Characterization of Standard Cells for Intra-Cell Mismatch Variations Noise Interaction Between Power Distribution Grids and Substrate Error-Tolerant SRAM Design for Ultra-Low Power Standby Operation