International Conference on Hardware/Software Codesign and System Synthesis最新文献

英文中文

Reliable performance analysis of a multicore multithreaded system-on-chip 多核多线程片上系统的可靠性能分析

International Conference on Hardware/Software Codesign and System Synthesis

Pub Date : 2008-10-19 DOI: 10.1145/1450135.1450172

S. Schliecker, Mircea Negrean, G. Nicolescu, P. Paulin, R. Ernst

Formal performance analysis is now regularly applied in the design of distributed embedded systems such as automotive electronics, where it greatly contributes to an improved predictability and platform robustness of complex networked systems. Even though it might be highly beneficial also in MpSoC design, formal performance analysis could not easily be applied so far, because the classical task communication model does not cover processor-memory traffic, which is an integral part of MpSoC timing. Introducing memory accesses as individual transactions under the classical model has shown to be inefficient, and previous approaches work well only under strict orthogonalization of different traffic streams. Recent research has presented extensions of the classical task model and a corresponding analysis that covers performance implications of shared memory traffic. In this paper we present a multithreaded multiprocessors platform and multimedia application. We conduct performance analysis using the new analysis options and specifically benchmark the quality of the available approach. Our experiments show that corner case coverage can now be supplied with a very high accuracy, allowing to quickly investigate architectural alternatives.

正式的性能分析现在经常应用于分布式嵌入式系统的设计中，例如汽车电子，它极大地提高了复杂网络系统的可预测性和平台鲁棒性。尽管它可能在MpSoC设计中也非常有益，但到目前为止，正式的性能分析还不能很容易地应用，因为经典的任务通信模型不包括处理器-内存流量，这是MpSoC时序的一个组成部分。在经典模型下，将内存访问作为单个事务引入是低效的，而且以前的方法只有在不同流量流的严格正交化下才能很好地工作。最近的研究提出了经典任务模型的扩展和相应的分析，涵盖了共享内存流量的性能影响。本文提出了一个多线程多处理器平台和多媒体应用。我们使用新的分析选项进行性能分析，并特别对可用方法的质量进行基准测试。我们的实验表明，角落案例覆盖现在可以以非常高的精度提供，允许快速调查架构替代方案。

{"title":"Reliable performance analysis of a multicore multithreaded system-on-chip","authors":"S. Schliecker, Mircea Negrean, G. Nicolescu, P. Paulin, R. Ernst","doi":"10.1145/1450135.1450172","DOIUrl":"https://doi.org/10.1145/1450135.1450172","url":null,"abstract":"Formal performance analysis is now regularly applied in the design of distributed embedded systems such as automotive electronics, where it greatly contributes to an improved predictability and platform robustness of complex networked systems. Even though it might be highly beneficial also in MpSoC design, formal performance analysis could not easily be applied so far, because the classical task communication model does not cover processor-memory traffic, which is an integral part of MpSoC timing. Introducing memory accesses as individual transactions under the classical model has shown to be inefficient, and previous approaches work well only under strict orthogonalization of different traffic streams.\u0000 Recent research has presented extensions of the classical task model and a corresponding analysis that covers performance implications of shared memory traffic. In this paper we present a multithreaded multiprocessors platform and multimedia application. We conduct performance analysis using the new analysis options and specifically benchmark the quality of the available approach. Our experiments show that corner case coverage can now be supplied with a very high accuracy, allowing to quickly investigate architectural alternatives.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124074706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 45

Don't forget memories: a case study redesigning a pattern counting ASIC circuit for FPGAs 不要忘记记忆:一个重新设计fpga模式计数ASIC电路的案例研究

International Conference on Hardware/Software Codesign and System Synthesis

Pub Date : 2008-10-19 DOI: 10.1145/1450135.1450171

David Sheldon, F. Vahid

Modern embedded compute platforms increasingly contain both microprocessors and field-programmable gate arrays (FPGAs). The FPGAs may implement accelerators or other circuits to speedup performance. Many such circuits have been previously designed for acceleration via application-specific integrated circuits (ASICs). Redesigning an ASIC circuit for FPGA implementation involves several challenges. We describe a case study that highlights a common challenge related to memories. The study involves converting a pattern counting circuit architecture, based on a pipelined binary tree and originally designed for ASIC implementation, into a circuit suitable for FPGAs. The original ASIC-oriented circuit, when mapped to a Spartan 3e FPGA, could process 10 million patterns per second and handle up to 4,096 patterns. The redesigned circuit could instead process 100 million patterns per second and handle up to 32,768 patterns, representing a 10x performance improvement and a 4x utilization improvement. The redesign involved partitioning large memories into smaller ones at the expense of redundant control logic. Through this and other case studies, design patterns may emerge that aid designers in redesigning ASIC circuits for FPGAs as well as in building new high-performance and efficient circuits for FPGAs.

现代嵌入式计算平台越来越多地同时包含微处理器和现场可编程门阵列(fpga)。fpga可以实现加速器或其他电路来加速性能。许多这样的电路以前都是通过专用集成电路(asic)来加速设计的。重新设计用于FPGA实现的ASIC电路涉及几个挑战。我们描述了一个案例研究，突出了与记忆相关的常见挑战。该研究涉及将基于流水线二叉树的模式计数电路架构(最初设计用于ASIC实现)转换为适合fpga的电路。原始的面向asic的电路，当映射到Spartan 3e FPGA时，每秒可以处理1000万个模式，最多可以处理4096个模式。重新设计的电路每秒可以处理1亿个模式，最多可处理32,768个模式，性能提高了10倍，利用率提高了4倍。重新设计涉及到以冗余控制逻辑为代价将大内存划分为小内存。通过这个和其他案例研究，设计模式可能会出现，帮助设计人员重新设计fpga的ASIC电路，以及为fpga构建新的高性能和高效电路。

{"title":"Don't forget memories: a case study redesigning a pattern counting ASIC circuit for FPGAs","authors":"David Sheldon, F. Vahid","doi":"10.1145/1450135.1450171","DOIUrl":"https://doi.org/10.1145/1450135.1450171","url":null,"abstract":"Modern embedded compute platforms increasingly contain both microprocessors and field-programmable gate arrays (FPGAs). The FPGAs may implement accelerators or other circuits to speedup performance. Many such circuits have been previously designed for acceleration via application-specific integrated circuits (ASICs). Redesigning an ASIC circuit for FPGA implementation involves several challenges. We describe a case study that highlights a common challenge related to memories. The study involves converting a pattern counting circuit architecture, based on a pipelined binary tree and originally designed for ASIC implementation, into a circuit suitable for FPGAs. The original ASIC-oriented circuit, when mapped to a Spartan 3e FPGA, could process 10 million patterns per second and handle up to 4,096 patterns. The redesigned circuit could instead process 100 million patterns per second and handle up to 32,768 patterns, representing a 10x performance improvement and a 4x utilization improvement. The redesign involved partitioning large memories into smaller ones at the expense of redundant control logic. Through this and other case studies, design patterns may emerge that aid designers in redesigning ASIC circuits for FPGAs as well as in building new high-performance and efficient circuits for FPGAs.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134471601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Hardware/software partitioning of floating point software applications to fixed-pointed coprocessor circuits 浮点软件应用到定点协处理器电路的硬件/软件划分

International Conference on Hardware/Software Codesign and System Synthesis

Pub Date : 2008-10-19 DOI: 10.1145/1450135.1450148

L. Saldanha, Roman L. Lysecky

While hardware/software partitioning has been shown to provide significant performance gains, most hardware/software partitioning approaches are limited to partitioning computational kernels utilizing integers or fixed point implementations. Software developers often initially develop an application using built-in floating point representations and later convert the application to a fixed point representation - a potentially time consuming process. In this paper, we present a hardware/software partitioning approach for floating point applications that eliminates the need for developers to rewrite software applications for fixed point implementations. Instead, the proposed approach incorporates efficient, configurable floating point to fixed point and fixed point to floating point hardware converters at the boundary between the hardware coprocessors and memory. This effectively separates the system into a floating point domain consisting of the microprocessor and memory subsystem and a fixed point domain consisting of the partitioned hardware coprocessors, thereby providing an efficient and rapid method for implementing fixed point hardware coprocessors.

虽然硬件/软件分区已被证明可以提供显著的性能提升，但大多数硬件/软件分区方法仅限于使用整数或定点实现对计算内核进行分区。软件开发人员最初通常使用内置的浮点表示开发应用程序，然后将应用程序转换为定点表示——这是一个潜在的耗时过程。在本文中，我们提出了一种用于浮点应用程序的硬件/软件分区方法，该方法消除了开发人员为定点实现重写软件应用程序的需要。相反，所提出的方法在硬件协处理器和存储器之间的边界上集成了高效、可配置的浮点到定点和定点到浮点硬件转换器。这有效地将系统划分为由微处理器和存储子系统组成的浮点域和由分区的硬件协处理器组成的定点域，从而为定点硬件协处理器的实现提供了一种高效、快速的方法。

{"title":"Hardware/software partitioning of floating point software applications to fixed-pointed coprocessor circuits","authors":"L. Saldanha, Roman L. Lysecky","doi":"10.1145/1450135.1450148","DOIUrl":"https://doi.org/10.1145/1450135.1450148","url":null,"abstract":"While hardware/software partitioning has been shown to provide significant performance gains, most hardware/software partitioning approaches are limited to partitioning computational kernels utilizing integers or fixed point implementations. Software developers often initially develop an application using built-in floating point representations and later convert the application to a fixed point representation - a potentially time consuming process. In this paper, we present a hardware/software partitioning approach for floating point applications that eliminates the need for developers to rewrite software applications for fixed point implementations. Instead, the proposed approach incorporates efficient, configurable floating point to fixed point and fixed point to floating point hardware converters at the boundary between the hardware coprocessors and memory. This effectively separates the system into a floating point domain consisting of the microprocessor and memory subsystem and a fixed point domain consisting of the partitioned hardware coprocessors, thereby providing an efficient and rapid method for implementing fixed point hardware coprocessors.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"428 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132690059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Speculative DMA for architecturally visible storage in instruction set extensions 指令集扩展中架构可见存储的推测性DMA

International Conference on Hardware/Software Codesign and System Synthesis

Pub Date : 2008-10-19 DOI: 10.1145/1450135.1450191

Theo Kluter, P. Brisk, P. Ienne, E. Charbon

Instruction set extensions (ISEs) can accelerate embedded processor performance. Many algorithms for ISE generation have shown good potential; some of them have recently been expanded to include Architecturally Visible Storage (AVS) - compiler-controlled memories, similar to scratchpads, that are accessible only to ISEs. To achieve a speedup using AVS, Direct Memory Access (DMA) transfers are required to move data from the main memory to the AVS; unfortunately, this creates coherence problems between the AVS and the cache, which previous methods for ISEs with AVS failed to address; additionally, these methods need to leave many conservative DMA transfers in place, whose execution significantly limits the achievable speedup. This paper presents a memory coherence scheme for ISEs with AVS, which can ensure execution correctness and memory consistency with minimal area overhead. We also present a method that speculatively removes redundant DMA transfers. Cycle-accurate experimental results were obtained using an FPGA-emulation platform. These results show that the application-specific instruction-set extended processors with speculative DMA-enhanced AVS gain significantly over previous techniques, despite the overhead of the coherence mechanism.

指令集扩展(ISEs)可以提高嵌入式处理器的性能。许多生成ISE的算法已经显示出良好的潜力;其中一些最近被扩展到包括架构可见存储(AVS)——编译器控制的存储器，类似于刮擦板，只有ise可以访问。为了使用AVS实现加速，需要直接内存访问(DMA)传输将数据从主存储器移动到AVS;不幸的是，这造成了AVS和缓存之间的一致性问题，这是以前的AVS ise方法未能解决的问题;此外，这些方法需要保留许多保守的DMA传输，这些传输的执行会极大地限制可实现的加速。本文提出了一种具有AVS的ISEs的内存一致性方案，该方案可以在最小的空间开销下保证执行正确性和内存一致性。我们还提出了一种推测性地去除冗余DMA传输的方法。在fpga仿真平台上获得了周期精度的实验结果。这些结果表明，尽管相干机制的开销很大，但具有推测性dma增强AVS的特定应用指令集扩展处理器比以前的技术获得了显着的增益。

{"title":"Speculative DMA for architecturally visible storage in instruction set extensions","authors":"Theo Kluter, P. Brisk, P. Ienne, E. Charbon","doi":"10.1145/1450135.1450191","DOIUrl":"https://doi.org/10.1145/1450135.1450191","url":null,"abstract":"Instruction set extensions (ISEs) can accelerate embedded processor performance. Many algorithms for ISE generation have shown good potential; some of them have recently been expanded to include Architecturally Visible Storage (AVS) - compiler-controlled memories, similar to scratchpads, that are accessible only to ISEs. To achieve a speedup using AVS, Direct Memory Access (DMA) transfers are required to move data from the main memory to the AVS; unfortunately, this creates coherence problems between the AVS and the cache, which previous methods for ISEs with AVS failed to address; additionally, these methods need to leave many conservative DMA transfers in place, whose execution significantly limits the achievable speedup. This paper presents a memory coherence scheme for ISEs with AVS, which can ensure execution correctness and memory consistency with minimal area overhead. We also present a method that speculatively removes redundant DMA transfers. Cycle-accurate experimental results were obtained using an FPGA-emulation platform. These results show that the application-specific instruction-set extended processors with speculative DMA-enhanced AVS gain significantly over previous techniques, despite the overhead of the coherence mechanism.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130925637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

Symbolic voter placement for dependability-aware system synthesis 可靠性感知系统综合的符号选民安置

International Conference on Hardware/Software Codesign and System Synthesis

Pub Date : 2008-10-19 DOI: 10.1145/1450135.1450190

Felix Reimann, M. Glaß, M. Lukasiewycz, J. Keinert, C. Haubelt, J. Teich

This paper presents a system synthesis approach for dependable embedded systems. The proposed approach significantly extends previous work by automatically inserting fault detection and fault toleration mechanisms into an implementation. The main contributions of this paper are 1) a dependability-aware system synthesis approach that automatically performs a redundant task binding and placement of voting structures to increase both, reliability and safety, respectively, 2) an efficient dependability analysis approach to evaluate lifetime reliability and safety, and 3) results from synthesizing a Motion-JPEG decoder for an FPGA platform using the proposed system synthesis approach. As a result, a set of high-quality solutions of the decoder with maximized reliability, safety, performance, and simultaneously minimized resource requirements is achieved.

本文提出了一种可靠嵌入式系统的系统综合方法。提出的方法通过在实现中自动插入故障检测和容错机制，大大扩展了以前的工作。本文的主要贡献是:1)可靠性感知系统综合方法，该方法自动执行冗余任务绑定和投票结构的放置，以分别提高可靠性和安全性;2)有效的可靠性分析方法，以评估寿命可靠性和安全性;3)使用所提出的系统综合方法为FPGA平台合成运动jpeg解码器的结果。从而获得了一套高质量的解码器解决方案，具有最大的可靠性、安全性和性能，同时最小化了资源需求。

引用次数: 27

Application specific non-volatile primary memory for embedded systems 嵌入式系统专用非易失性主存储器

International Conference on Hardware/Software Codesign and System Synthesis

Pub Date : 2008-10-19 DOI: 10.1145/1450135.1450144

Kwangyoon Lee, A. Orailoglu

Memory subsystems have been considered as one of the most critical components in embedded systems and furthermore, displaying increasing complexity as application requirements diversify. Modern embedded systems are generally equipped with multiple heterogeneous memory devices to satisfy diverse requirements and constraints. NAND flash memory has been widely adopted for data storage because of its outstanding benefits on cost, power, capacity and non-volatility. However, in NAND flash memory, the intrinsic costs for the read and write accesses are highly disproportionate in performance and access granularity. The consequent data management complexity and performance deterioration have precluded the adoption of NAND flash memory. In this paper, we introduce a highly effective non-volatile primary memory architecture which incorporates application specific information to develop a NAND flash based primary memory. The proposed architecture provides a unified non-volatile primary memory solution which relieves design complications caused by the growing complexity in memory subsystems. Our architecture aggressively minimizes the overhead and redundancy of the NAND based systems by exploiting efficient address space management and dynamic data migration based on accurate application behavioral analysis. We also propose a highly parallelized memory architecture through an active and dynamic data redistribution over the multiple flash memories based on run-time workload analysis. The experimental results show that our proposed architecture significantly enhances average memory access cycle time which is comparable to the standard DRAM access cycle time and also considerably prolongs the device life-cycle by autonomous wear-leveling and minimizing the program/erase operations.

内存子系统一直被认为是嵌入式系统中最关键的组件之一，而且随着应用需求的多样化，其复杂性也在不断增加。现代嵌入式系统通常配备多个异构存储设备，以满足不同的需求和约束。NAND闪存因其在成本、功耗、容量和非易失性等方面的突出优势而被广泛应用于数据存储。然而，在NAND闪存中，读写访问的内在成本在性能和访问粒度上是极不相称的。随之而来的数据管理复杂性和性能下降阻碍了NAND闪存的采用。在本文中，我们介绍了一种高效的非易失性主存储器架构，该架构结合了应用特定信息来开发基于NAND闪存的主存储器。该架构提供了一个统一的非易失性主存储器解决方案，减轻了由于存储子系统日益复杂而导致的设计复杂性。我们的架构通过利用高效的地址空间管理和基于准确应用程序行为分析的动态数据迁移，积极地将基于NAND的系统的开销和冗余降到最低。我们还提出了一种基于运行时工作负载分析的高并行内存架构，通过在多个闪存上进行主动和动态的数据重新分配。实验结果表明，我们提出的架构显着提高了与标准DRAM访问周期相当的平均存储器访问周期时间，并且通过自主损耗均衡和最小化程序/擦除操作大大延长了器件的生命周期。

{"title":"Application specific non-volatile primary memory for embedded systems","authors":"Kwangyoon Lee, A. Orailoglu","doi":"10.1145/1450135.1450144","DOIUrl":"https://doi.org/10.1145/1450135.1450144","url":null,"abstract":"Memory subsystems have been considered as one of the most critical components in embedded systems and furthermore, displaying increasing complexity as application requirements diversify. Modern embedded systems are generally equipped with multiple heterogeneous memory devices to satisfy diverse requirements and constraints. NAND flash memory has been widely adopted for data storage because of its outstanding benefits on cost, power, capacity and non-volatility. However, in NAND flash memory, the intrinsic costs for the read and write accesses are highly disproportionate in performance and access granularity. The consequent data management complexity and performance deterioration have precluded the adoption of NAND flash memory. In this paper, we introduce a highly effective non-volatile primary memory architecture which incorporates application specific information to develop a NAND flash based primary memory. The proposed architecture provides a unified non-volatile primary memory solution which relieves design complications caused by the growing complexity in memory subsystems. Our architecture aggressively minimizes the overhead and redundancy of the NAND based systems by exploiting efficient address space management and dynamic data migration based on accurate application behavioral analysis. We also propose a highly parallelized memory architecture through an active and dynamic data redistribution over the multiple flash memories based on run-time workload analysis. The experimental results show that our proposed architecture significantly enhances average memory access cycle time which is comparable to the standard DRAM access cycle time and also considerably prolongs the device life-cycle by autonomous wear-leveling and minimizing the program/erase operations.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120937480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 26

Distributed flit-buffer flow control for networks-on-chip 片上网络的分布式暂存缓冲流量控制

International Conference on Hardware/Software Codesign and System Synthesis

Pub Date : 2008-10-19 DOI: 10.1145/1450135.1450183

Nicola Concer, M. Petracca, L. Carloni

The combination of flit-buffer flow control methods and latency-insensitive protocols is an effective solution for networks-on-chip (NoC). Since they both rely on backpressure, the two techniques are easy to combine while offering complementary advantages: low complexity of router design and the ability to cope with long communication channels via automatic wire pipelining. We study various alternative implementations of this idea by considering the combination of three different types of flit-buffer flow control methods and two different classes of channel repeaters (based respectively on flip-flops and relay stations). We characterize the area and performance of the two most promising alternative implementations for NoCs by completing the RTL design and logic synthesis of the repeaters and routers for different channel parallelisms. Finally, we derive high-level abstractions of our circuit designs and we use them to perform system-level simulations under various scenarios for two distinct NoC topologies and various applications. Based on our comparative analysis and experimental results, we propose a NoC design approach that combines the reduction of the router queues to a minimum size with the distribution of flit buffering onto the channels. This approach provides precious flexibility during the physical design phase for many NoCs, particularly in those systems-on-chip that must be designed to meet a tight constraint on the target clock frequency.

将暂存缓冲流控制方法与时延不敏感协议相结合是实现片上网络(NoC)的有效方法。由于它们都依赖于背压，这两种技术很容易结合起来，同时具有互补的优势:路由器设计的复杂性低，能够通过自动有线管道处理长通信通道。我们通过考虑三种不同类型的切换缓冲流量控制方法和两种不同类型的信道中继器(分别基于触发器和中继站)的组合，研究了这种思想的各种替代实现。我们通过完成RTL设计和不同通道并行的中继器和路由器的逻辑合成，描述了两种最有前途的noc替代实现的面积和性能。最后，我们推导了我们的电路设计的高级抽象，并使用它们在两种不同的NoC拓扑和各种应用的各种场景下执行系统级仿真。基于我们的对比分析和实验结果，我们提出了一种NoC设计方法，该方法将路由器队列减少到最小尺寸并将flit缓冲分配到信道上。这种方法在物理设计阶段为许多noc提供了宝贵的灵活性，特别是在那些必须设计以满足目标时钟频率严格限制的片上系统中。

{"title":"Distributed flit-buffer flow control for networks-on-chip","authors":"Nicola Concer, M. Petracca, L. Carloni","doi":"10.1145/1450135.1450183","DOIUrl":"https://doi.org/10.1145/1450135.1450183","url":null,"abstract":"The combination of flit-buffer flow control methods and latency-insensitive protocols is an effective solution for networks-on-chip (NoC). Since they both rely on backpressure, the two techniques are easy to combine while offering complementary advantages: low complexity of router design and the ability to cope with long communication channels via automatic wire pipelining. We study various alternative implementations of this idea by considering the combination of three different types of flit-buffer flow control methods and two different classes of channel repeaters (based respectively on flip-flops and relay stations). We characterize the area and performance of the two most promising alternative implementations for NoCs by completing the RTL design and logic synthesis of the repeaters and routers for different channel parallelisms. Finally, we derive high-level abstractions of our circuit designs and we use them to perform system-level simulations under various scenarios for two distinct NoC topologies and various applications. Based on our comparative analysis and experimental results, we propose a NoC design approach that combines the reduction of the router queues to a minimum size with the distribution of flit buffering onto the channels. This approach provides precious flexibility during the physical design phase for many NoCs, particularly in those systems-on-chip that must be designed to meet a tight constraint on the target clock frequency.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130849833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

LOCS: a low overhead profiler-driven design flow for security of MPSoCs los:用于mpsoc安全性的低开销评测器驱动的设计流程

International Conference on Hardware/Software Codesign and System Synthesis

Pub Date : 2008-10-19 DOI: 10.1145/1450135.1450154

K. Patel, S. Parameswaran

Security is a growing concern in processor based systems and hence requires immediate attention. New paradigms in the design of MPSoCs must be found, with security as one of the primary objectives. Software attacks like Code Injection Attacks exploit vulnerabilities in "trusted" code. Previous countermeasures addressing code injection attacks in MPSoCs have significant performance overheads and do not check every single line of code. The work described in this paper has reduced performance overhead and ensures that all the lines in the program code are checked. We propose an MPSoC system where one processor (which we call a MONITOR processor) is responsible for supervising all other application processors. Our design flow, LOCS, instruments and profiles the execution of basic blocks in the program. LOCS subsequently uses the profiler output to re-instrument the source files to minimize runtime overheads. LOCS also aids in the design of hardware customizations required by the MONITOR. At runtime, the MONITOR checks the validity of the control flow transitions and the execution time of basic blocks. We implemented our system on a commercial extensible processor, Xtensa LX2, and tested it on three multimedia benchmarks. The experiments show that our system has the worst-case performance degradation of about 24% and an area overhead of approximately 40%. LOCS has smaller performance, area and code size overheads than all previous code injection countermeasures for MPSoCs.

在基于处理器的系统中，安全性日益受到关注，因此需要立即予以关注。必须找到mpsoc设计的新范例，以安全性为主要目标之一。像代码注入攻击这样的软件攻击利用了“可信”代码中的漏洞。以前针对mpsoc中代码注入攻击的对策具有显著的性能开销，并且不会检查每一行代码。本文中描述的工作减少了性能开销，并确保程序代码中的所有行都被检查。我们提出了一个MPSoC系统，其中一个处理器(我们称之为MONITOR处理器)负责监督所有其他应用处理器。我们的设计流程，LOCS，仪器和配置文件的执行基本模块在程序中。los随后使用分析器输出重新检测源文件，以最小化运行时开销。LOCS还有助于设计MONITOR所需的硬件定制。在运行时，MONITOR检查控制流转换的有效性和基本块的执行时间。我们在商业可扩展处理器Xtensa LX2上实现了我们的系统，并在三个多媒体基准测试中对其进行了测试。实验表明，该系统在最坏情况下的性能下降约为24%，面积开销约为40%。LOCS具有比以前所有mpsoc的代码注入对策更小的性能、面积和代码大小开销。

{"title":"LOCS: a low overhead profiler-driven design flow for security of MPSoCs","authors":"K. Patel, S. Parameswaran","doi":"10.1145/1450135.1450154","DOIUrl":"https://doi.org/10.1145/1450135.1450154","url":null,"abstract":"Security is a growing concern in processor based systems and hence requires immediate attention. New paradigms in the design of MPSoCs must be found, with security as one of the primary objectives. Software attacks like Code Injection Attacks exploit vulnerabilities in \"trusted\" code. Previous countermeasures addressing code injection attacks in MPSoCs have significant performance overheads and do not check every single line of code. The work described in this paper has reduced performance overhead and ensures that all the lines in the program code are checked.\u0000 We propose an MPSoC system where one processor (which we call a MONITOR processor) is responsible for supervising all other application processors. Our design flow, LOCS, instruments and profiles the execution of basic blocks in the program. LOCS subsequently uses the profiler output to re-instrument the source files to minimize runtime overheads. LOCS also aids in the design of hardware customizations required by the MONITOR. At runtime, the MONITOR checks the validity of the control flow transitions and the execution time of basic blocks.\u0000 We implemented our system on a commercial extensible processor, Xtensa LX2, and tested it on three multimedia benchmarks. The experiments show that our system has the worst-case performance degradation of about 24% and an area overhead of approximately 40%. LOCS has smaller performance, area and code size overheads than all previous code injection countermeasures for MPSoCs.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114367136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

A time-predictable system initialization design for huge-capacity flash-memory storage systems 大容量闪存存储系统的时间可预测系统初始化设计

International Conference on Hardware/Software Codesign and System Synthesis

Pub Date : 2008-10-19 DOI: 10.1145/1450135.1450140

Chin-Hsien Wu

The capacity of flash-memory storage systems grows at a speed similar to many other storage systems. In order to properly manage the product cost, vendors face serious challenges in system designs. How to provide an expected system initialization time for huge-capacity flash-memory storage systems has become an important research topic. In this paper, a time-predictable system initialization design is proposed for huge-capacity flash-memory storage systems. The objective of the design is to provide an expected system initialization time based on a coarse-grained flash translation layer. The time-predictable analysis of the design is provided to discuss the relation between the size of main memory and the system initialization time. The system initialization time can be also estimated and predicted by the time-predictable analysis.

闪存存储系统的容量以与许多其他存储系统相似的速度增长。为了合理地管理产品成本，供应商在系统设计上面临着严峻的挑战。如何为大容量闪存存储系统提供一个预期的系统初始化时间已成为一个重要的研究课题。针对大容量闪存存储系统，提出了一种可预测时间的系统初始化设计方法。设计的目标是提供基于粗粒度闪存转换层的预期系统初始化时间。对设计进行了时间预测分析，讨论了主存大小与系统初始化时间的关系。通过时间预测分析，还可以估计和预测系统初始化时间。

引用次数: 5

Holistic design and caching in mobile computing 移动计算中的整体设计和缓存

International Conference on Hardware/Software Codesign and System Synthesis

Pub Date : 2008-10-19 DOI: 10.1145/1450135.1450161

Mwaffaq Otoom, J. M. Paul

We utilize application trends analysis, focused on webpage content, in order to examine the design of mobile computers more holistically. We find that both Internet bandwidth and processing local to the computing device is being wasted by re-transmission of formatting data. By taking this broader view, and separating Macromedia Flash content into raw data and its packaging, we show that performance can be increased by 84%, power consumption can be decreased by 71%, and communications bandwidth can be saved by an order of magnitude.

我们利用应用趋势分析，专注于网页内容，以便更全面地检查移动计算机的设计。我们发现，格式化数据的重新传输浪费了互联网带宽和计算设备的本地处理。采用这种更广泛的观点，并将Macromedia Flash内容分离为原始数据及其封装，我们发现性能可以提高84%，功耗可以降低71%，通信带宽可以节省一个数量级。

引用次数: 6

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

International Conference on Hardware/Software Codesign and System Synthesis

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀