首页 > 最新文献

International Conference on Hardware/Software Codesign and System Synthesis, 2004. CODES + ISSS 2004.最新文献

英文 中文
Fast cosimulation of transformative systems with OS support on SMP computer SMP计算机上支持操作系统的转换系统快速协同仿真
Zhengting He, A. Mok
Transformative applications are a class of dataflow computation characterized by iterative behavior. The problem of partitioning a transformative application specification to a set of available hardware (HW) and software (SW) processing elements (PEs) and derivation of a job execution order (scheduling) on them has been quite well studied, but the problem of obtaining fast simulation of these applications poses different constraints. In this paper, we propose an efficient framework for a symmetric multi-processor (SMP) simulation host to achieve fast HW/SW co-simulation for transformative applications, given the partition solutions and the derived schedulers. The framework overcomes the limitations in existing Linux SMP kernel and requires only a reasonable amount of modifications to it. We also present a heuristic algorithm which effectively assigns simulation tasks to the processors on the simulation host, considering both average job simulation time on each processor and other simulation overhead. Our experiments show that the algorithm is able to find satisfactory suboptimal solutions with very little computation time. Based on the task assignment solution, the simulation time can be reduced by 25% to 50% from the obvious but naive approach.
转换应用程序是一类以迭代行为为特征的数据流计算。将转换应用程序规范划分为一组可用的硬件(HW)和软件(SW)处理元素(pe)以及在它们上派生作业执行顺序(调度)的问题已经得到了很好的研究,但是获得这些应用程序的快速模拟的问题提出了不同的约束。在本文中,我们提出了一个对称多处理器(SMP)仿真主机的有效框架,以实现快速的硬件/软件协同仿真,用于变革性应用,给出了分区解决方案和派生的调度程序。该框架克服了现有Linux SMP内核的限制,只需要对其进行合理的修改。我们还提出了一种启发式算法,该算法可以有效地将仿真任务分配给仿真主机上的处理器,同时考虑每个处理器上的平均作业仿真时间和其他仿真开销。实验结果表明,该算法能够在很小的计算时间内找到满意的次优解。基于任务分配方案,仿真时间比明显但幼稚的方法减少了25%到50%。
{"title":"Fast cosimulation of transformative systems with OS support on SMP computer","authors":"Zhengting He, A. Mok","doi":"10.1145/1016720.1016761","DOIUrl":"https://doi.org/10.1145/1016720.1016761","url":null,"abstract":"Transformative applications are a class of dataflow computation characterized by iterative behavior. The problem of partitioning a transformative application specification to a set of available hardware (HW) and software (SW) processing elements (PEs) and derivation of a job execution order (scheduling) on them has been quite well studied, but the problem of obtaining fast simulation of these applications poses different constraints. In this paper, we propose an efficient framework for a symmetric multi-processor (SMP) simulation host to achieve fast HW/SW co-simulation for transformative applications, given the partition solutions and the derived schedulers. The framework overcomes the limitations in existing Linux SMP kernel and requires only a reasonable amount of modifications to it. We also present a heuristic algorithm which effectively assigns simulation tasks to the processors on the simulation host, considering both average job simulation time on each processor and other simulation overhead. Our experiments show that the algorithm is able to find satisfactory suboptimal solutions with very little computation time. Based on the task assignment solution, the simulation time can be reduced by 25% to 50% from the obvious but naive approach.","PeriodicalId":127038,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis, 2004. CODES + ISSS 2004.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2004-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123328766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Efficient mapping of hierarchical trees on coarse-grain reconfigurable architectures 层次树在粗粒度可重构体系结构上的有效映射
F. Rivera, M. Sanchez-Elez, M. Fernandez, R. Hermida, N. Bagherzadeh
Reconfigurable architectures have become increasingly important in years. We present an approach to the problem of executing 3D graphics interactive applications onto these architectures. The hierarchical trees are usually implemented to reduce the data processed, thereby diminishing the execution time. We have developed a mapping scheme that parallelizes the tree execution onto a SIMD reconfigurable architecture. This mapping scheme considerably reduces the time penalty caused by the possibility of executing different tree nodes in SIMD fashion. We have developed a technique that achieves an efficient hierarchical tree execution taking decisions at execution time. It also promotes the possibility of data coherence in order to reduce the execution time. The experimental results show high performance and efficient resource utilization on tested applications.
近年来,可重构架构变得越来越重要。我们提出了一种解决在这些架构上执行3D图形交互应用程序的方法。通常实现分层树是为了减少处理的数据,从而减少执行时间。我们已经开发了一个映射方案,将树的执行并行化到SIMD可重构架构上。这种映射方案大大减少了以SIMD方式执行不同树节点的可能性所造成的时间损失。我们已经开发了一种技术,可以在执行时做出决策,从而实现高效的分层树执行。它还提高了数据一致性的可能性,以减少执行时间。实验结果表明,该方法在测试应用中具有较高的性能和资源利用率。
{"title":"Efficient mapping of hierarchical trees on coarse-grain reconfigurable architectures","authors":"F. Rivera, M. Sanchez-Elez, M. Fernandez, R. Hermida, N. Bagherzadeh","doi":"10.1145/1016720.1016731","DOIUrl":"https://doi.org/10.1145/1016720.1016731","url":null,"abstract":"Reconfigurable architectures have become increasingly important in years. We present an approach to the problem of executing 3D graphics interactive applications onto these architectures. The hierarchical trees are usually implemented to reduce the data processed, thereby diminishing the execution time. We have developed a mapping scheme that parallelizes the tree execution onto a SIMD reconfigurable architecture. This mapping scheme considerably reduces the time penalty caused by the possibility of executing different tree nodes in SIMD fashion. We have developed a technique that achieves an efficient hierarchical tree execution taking decisions at execution time. It also promotes the possibility of data coherence in order to reduce the execution time. The experimental results show high performance and efficient resource utilization on tested applications.","PeriodicalId":127038,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis, 2004. CODES + ISSS 2004.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2004-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114015214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Detecting overflow detection 检测溢出检测
V. Kotlyar, M. Moudgill
Fixed-point saturating arithmetic is widely used in media and digital signal processing applications. A number of processor architectures provide instructions that implement saturating operations. However, standard high-level languages, such as ANSI C, provide no direct support for saturating arithmetic. Applications written in standard languages have to implement saturating operations in terms of basic two's complement operations. In order to provide fast execution of such programs it is important to have an optimizing compiler automatically detect and convert appropriate code fragments to hardware instructions. We present a set of techniques for automatic recognition of saturating arithmetic operations. We show that in most cases the recognition problem is simply one of Boolean circuit equivalence. Given the expense of solving circuit equivalence, we develop a set of practical approximations based on abstract interpretation. Experiments show that our techniques, while reliably recognizing saturating arithmetic, have small compile-time overhead. We also demonstrate that our approach is not limited to saturating arithmetic, but is directly applicable to recognizing other idioms, such as add-with-carry and absolute value.
定点饱和算法广泛应用于媒体和数字信号处理领域。许多处理器体系结构提供了实现饱和操作的指令。然而,标准的高级语言,如ANSI C,并不直接支持饱和算法。用标准语言编写的应用程序必须根据基本的二进制补码运算来实现饱和运算。为了提供这类程序的快速执行,让优化编译器自动检测并将适当的代码片段转换为硬件指令是很重要的。提出了一套饱和算术运算的自动识别技术。结果表明,在大多数情况下,识别问题只是一个布尔电路等价问题。考虑到求解电路等效的费用,我们开发了一套基于抽象解释的实用近似。实验表明,我们的技术在可靠地识别饱和算法的同时,具有较小的编译时开销。我们还证明了我们的方法不仅限于饱和算术,而且直接适用于识别其他习惯用法,例如加法进位和绝对值。
{"title":"Detecting overflow detection","authors":"V. Kotlyar, M. Moudgill","doi":"10.1145/1016720.1016732","DOIUrl":"https://doi.org/10.1145/1016720.1016732","url":null,"abstract":"Fixed-point saturating arithmetic is widely used in media and digital signal processing applications. A number of processor architectures provide instructions that implement saturating operations. However, standard high-level languages, such as ANSI C, provide no direct support for saturating arithmetic. Applications written in standard languages have to implement saturating operations in terms of basic two's complement operations. In order to provide fast execution of such programs it is important to have an optimizing compiler automatically detect and convert appropriate code fragments to hardware instructions. We present a set of techniques for automatic recognition of saturating arithmetic operations. We show that in most cases the recognition problem is simply one of Boolean circuit equivalence. Given the expense of solving circuit equivalence, we develop a set of practical approximations based on abstract interpretation. Experiments show that our techniques, while reliably recognizing saturating arithmetic, have small compile-time overhead. We also demonstrate that our approach is not limited to saturating arithmetic, but is directly applicable to recognizing other idioms, such as add-with-carry and absolute value.","PeriodicalId":127038,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis, 2004. CODES + ISSS 2004.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2004-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130265077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Power analysis of system-level on-chip communication architectures 系统级片上通信架构的功耗分析
K. Lahiri, A. Raghunathan
For complex system-on-chips (SoCs) fabricated in nanometer technologies, the system-level on-chip communication architecture is emerging as a significant source of power consumption. Managing and optimizing this important component of SoC power requires a detailed understanding of the characteristics of its power consumption. Various power estimation and low-power design techniques have been proposed for the global interconnects that form part of SoC communication architectures (e.g., low-swing buses, bus encoding, etc). While effective, they only address a limited part of communication architecture power consumption. A state-of-the-art communication architecture, viewed in its entirety, is quite complex, comprising several components, such as bus interfaces, arbiters, bridges, decoders, and multiplexers, in addition to the global bus lines. Relatively little research has focused on analyzing and comparing the power consumed by different components of the communication architecture. In this work, we present a systematic evaluation and analysis of the power consumed by a state-of-the-art communication architecture (the AMBA on-chip bus), using a commercial design flow. We focus on developing a quantitative understanding of the relative contributions of different communication architecture components to its power consumption, and the factors on which they depend. We decompose the communication architecture power into power consumed by logic components (such as arbiters, decoders, bus bridges), global bus lines (that carry address, data, and control information), and bus interfaces. We also perform studies that analyze the impact of varying application traffic characteristics, and varying SoC complexity, on communication architecture power. Based on our analyses, we evaluate different techniques for reducing the power consumed by the on-chip communication architecture, and compare their effectiveness in achieving power savings at the system level. In addition to quantitatively reinforcing the view that on-chip communication is an important target for system-level power optimization, our work demonstrates (i) the importance of considering the communication architecture in its entirety, and (ii) the opportunities that exist for power reduction through careful communication architecture design.
对于采用纳米技术制造的复杂片上系统(soc),系统级片上通信架构正在成为一个重要的功耗来源。管理和优化SoC电源的这一重要组成部分需要详细了解其功耗特征。各种功率估计和低功耗设计技术已经被提出用于构成SoC通信架构一部分的全球互连(例如,低摆幅总线,总线编码等)。虽然有效,但它们只能解决通信体系结构功耗的有限部分。一个最先进的通信体系结构,从整体上看,是相当复杂的,除了全球总线线路外,还包括几个组件,如总线接口、仲裁器、桥接器、解码器和多路复用器。相对较少的研究集中于分析和比较通信体系结构的不同组件所消耗的功率。在这项工作中,我们采用商业设计流程,对最先进的通信架构(AMBA片上总线)的功耗进行了系统的评估和分析。我们的重点是对不同通信架构组件对其功耗的相对贡献以及它们所依赖的因素进行定量理解。我们将通信架构功率分解为逻辑组件(如仲裁器、解码器、总线桥接器)、全局总线线路(携带地址、数据和控制信息)和总线接口所消耗的功率。我们还进行了研究,分析了不同应用程序流量特征和不同SoC复杂性对通信架构功率的影响。基于我们的分析,我们评估了用于降低片上通信架构功耗的不同技术,并比较了它们在实现系统级功耗节约方面的有效性。除了定量强化片上通信是系统级功耗优化的重要目标这一观点外,我们的工作还证明了(i)整体考虑通信架构的重要性,以及(ii)通过仔细的通信架构设计存在降低功耗的机会。
{"title":"Power analysis of system-level on-chip communication architectures","authors":"K. Lahiri, A. Raghunathan","doi":"10.1145/1016720.1016777","DOIUrl":"https://doi.org/10.1145/1016720.1016777","url":null,"abstract":"For complex system-on-chips (SoCs) fabricated in nanometer technologies, the system-level on-chip communication architecture is emerging as a significant source of power consumption. Managing and optimizing this important component of SoC power requires a detailed understanding of the characteristics of its power consumption. Various power estimation and low-power design techniques have been proposed for the global interconnects that form part of SoC communication architectures (e.g., low-swing buses, bus encoding, etc). While effective, they only address a limited part of communication architecture power consumption. A state-of-the-art communication architecture, viewed in its entirety, is quite complex, comprising several components, such as bus interfaces, arbiters, bridges, decoders, and multiplexers, in addition to the global bus lines. Relatively little research has focused on analyzing and comparing the power consumed by different components of the communication architecture. In this work, we present a systematic evaluation and analysis of the power consumed by a state-of-the-art communication architecture (the AMBA on-chip bus), using a commercial design flow. We focus on developing a quantitative understanding of the relative contributions of different communication architecture components to its power consumption, and the factors on which they depend. We decompose the communication architecture power into power consumed by logic components (such as arbiters, decoders, bus bridges), global bus lines (that carry address, data, and control information), and bus interfaces. We also perform studies that analyze the impact of varying application traffic characteristics, and varying SoC complexity, on communication architecture power. Based on our analyses, we evaluate different techniques for reducing the power consumed by the on-chip communication architecture, and compare their effectiveness in achieving power savings at the system level. In addition to quantitatively reinforcing the view that on-chip communication is an important target for system-level power optimization, our work demonstrates (i) the importance of considering the communication architecture in its entirety, and (ii) the opportunities that exist for power reduction through careful communication architecture design.","PeriodicalId":127038,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis, 2004. CODES + ISSS 2004.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2004-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124387532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 66
Memory system design space exploration for low-power, real-time speech recognition 低功耗、实时语音识别的存储系统设计空间探索
R. Krishna, S. Mahlke, T. Austin
The recent proliferation of computing technology has generated new interest natural I/O interface technologies such as speech recognition. Unfortunately, the computational and memory demands of such applications currently prohibit their use on low-power portable devices in anything more than their simplest forms. Previous work has demonstrated that the thread level concurrency inherent in this application domain can be used to dramatically improve performance with minimal impact on overall system energy consumption, but that such benefits are severely constrained by memory system bandwidth. This work presents a design space exploration of potential memory system architectures. A range of low-power memory organizations are considered, from conventional caching to more advanced system-on-chip implementations. We find that, given architectures able to exploit concurrency in this domain, large L2 based cache hierarchies and high bandwidth memory systems employing data stream partitioning and on-chip embedded DRAM and ROM technologies can provide much of the performance of idealized memory systems without violating the power constraints of the low-power domain.
最近计算技术的普及引起了人们对自然I/O接口技术(如语音识别)的新兴趣。不幸的是,这些应用程序的计算和内存需求目前禁止它们以最简单的形式在低功耗便携式设备上使用。以前的工作已经证明,这个应用程序域中固有的线程级并发性可以在对整个系统能耗影响最小的情况下显著提高性能,但是这种好处受到内存系统带宽的严重限制。这项工作提出了潜在的存储系统架构的设计空间探索。考虑了一系列低功耗内存组织,从传统的缓存到更高级的片上系统实现。我们发现,给定的架构能够利用该领域的并发性,大型基于L2的缓存层次结构和采用数据流分区和片上嵌入式DRAM和ROM技术的高带宽内存系统可以在不违反低功耗领域的功率限制的情况下提供理想内存系统的大部分性能。
{"title":"Memory system design space exploration for low-power, real-time speech recognition","authors":"R. Krishna, S. Mahlke, T. Austin","doi":"10.1145/1016720.1016756","DOIUrl":"https://doi.org/10.1145/1016720.1016756","url":null,"abstract":"The recent proliferation of computing technology has generated new interest natural I/O interface technologies such as speech recognition. Unfortunately, the computational and memory demands of such applications currently prohibit their use on low-power portable devices in anything more than their simplest forms. Previous work has demonstrated that the thread level concurrency inherent in this application domain can be used to dramatically improve performance with minimal impact on overall system energy consumption, but that such benefits are severely constrained by memory system bandwidth. This work presents a design space exploration of potential memory system architectures. A range of low-power memory organizations are considered, from conventional caching to more advanced system-on-chip implementations. We find that, given architectures able to exploit concurrency in this domain, large L2 based cache hierarchies and high bandwidth memory systems employing data stream partitioning and on-chip embedded DRAM and ROM technologies can provide much of the performance of idealized memory systems without violating the power constraints of the low-power domain.","PeriodicalId":127038,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis, 2004. CODES + ISSS 2004.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2004-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120923742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Efficient search space exploration for HW-SW partitioning HW-SW分区的高效搜索空间探索
S. Banerjee, N. Dutt
Hardware/software (HW-SW) partitioning is a key problem in the codesign of embedded systems, studied extensively in the past. One major open challenge for traditional partitioning approaches - as we move to more complex and heterogeneous SoCs - is the lack of efficient exploration of the large space of possible HW/SW configurations, coupled with the inability to efficiently scale up with larger problem sizes. We make two contributions for HW-SW partitioning of applications represented as procedural call-graphs: 1) we prove that during partitioning, the execution time metric for moving a vertex needs to be updated only for the immediate neighbours of the vertex, rather than for all ancestors along paths to the root vertex; consequently, we observe faster run-times for move-based partitioning algorithms such as simulated annealing (SA), allowing call graphs with thousands of vertices to be processed in less than a second, and 2) we devise a new cost function for SA that allows frequent discovery of better partitioning solutions by searching spaces overlooked by traditional SA cost functions. We present experimental results on a very large design space, where several thousand configurations are explored in minutes as compared to several hours or days using a traditional SA formulation. Furthermore, our approach is frequently able to locate better design points with over 10 % improvement in application execution time compared to the solutions generated by a Kernighan-Lin partitioning algorithm starting with an all-SW partitioning.
硬件/软件(HW-SW)划分是嵌入式系统协同设计中的一个关键问题,过去已被广泛研究。当我们转向更复杂和异构的soc时,传统分区方法面临的一个主要挑战是缺乏对可能的硬件/软件配置的大空间的有效探索,以及无法有效地扩展更大的问题规模。我们对以过程调用图表示的应用程序的HW-SW分区做出了两个贡献:1)我们证明了在分区期间,移动顶点的执行时间度量只需要更新顶点的近邻,而不是沿着路径到根顶点的所有祖先;因此,我们观察到基于移动的分区算法(如模拟退火(SA))的运行时间更快,允许在不到一秒的时间内处理具有数千个顶点的调用图。2)我们为SA设计了一个新的代价函数,该函数允许通过搜索传统SA代价函数忽略的空间来频繁发现更好的分区解决方案。我们在一个非常大的设计空间中展示了实验结果,与使用传统SA配方的几个小时或几天相比,在几分钟内探索了数千种配置。此外,与从全sw分区开始的Kernighan-Lin分区算法生成的解决方案相比,我们的方法通常能够找到更好的设计点,在应用程序执行时间上提高了10%以上。
{"title":"Efficient search space exploration for HW-SW partitioning","authors":"S. Banerjee, N. Dutt","doi":"10.1145/1016720.1016752","DOIUrl":"https://doi.org/10.1145/1016720.1016752","url":null,"abstract":"Hardware/software (HW-SW) partitioning is a key problem in the codesign of embedded systems, studied extensively in the past. One major open challenge for traditional partitioning approaches - as we move to more complex and heterogeneous SoCs - is the lack of efficient exploration of the large space of possible HW/SW configurations, coupled with the inability to efficiently scale up with larger problem sizes. We make two contributions for HW-SW partitioning of applications represented as procedural call-graphs: 1) we prove that during partitioning, the execution time metric for moving a vertex needs to be updated only for the immediate neighbours of the vertex, rather than for all ancestors along paths to the root vertex; consequently, we observe faster run-times for move-based partitioning algorithms such as simulated annealing (SA), allowing call graphs with thousands of vertices to be processed in less than a second, and 2) we devise a new cost function for SA that allows frequent discovery of better partitioning solutions by searching spaces overlooked by traditional SA cost functions. We present experimental results on a very large design space, where several thousand configurations are explored in minutes as compared to several hours or days using a traditional SA formulation. Furthermore, our approach is frequently able to locate better design points with over 10 % improvement in application execution time compared to the solutions generated by a Kernighan-Lin partitioning algorithm starting with an all-SW partitioning.","PeriodicalId":127038,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis, 2004. CODES + ISSS 2004.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2004-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134266771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 45
Dynamic overlay of scratchpad memory for energy minimization 动态覆盖的刮刮板存储器的能量最小化
Manish Verma, L. Wehmeyer, P. Marwedel
The memory subsystem accounts for a significant portion of the aggregate energy budget of contemporary embedded systems. Moreover, there exists a large potential for optimizing the energy consumption of the memory subsystem. Consequently, novel memories as well as novel algorithms for their efficient utilization are being designed. Scratchpads are known to perform better than caches in terms of power, performance, area and predictability. However, unlike caches they depend upon software allocation techniques for their utilization. We present an allocation technique which analyzes the application and inserts instructions to dynamically copy both code segments and variables onto the scratchpad at runtime. We demonstrate that the problem of dynamically overlaying scratchpad is an extension of the global register allocation problem. The overlay problem is solved optimally using ILP formulation techniques. Our approach improves upon the only previously known allocation technique for statically allocating both variables and code segments onto the scratchpad. Experiments report an average reduction of 34% and 18% in the energy consumption and the runtime of the applications, respectively. A minimal increase in code size is also reported.
内存子系统占当代嵌入式系统总能量预算的很大一部分。此外,在优化存储子系统的能耗方面存在很大的潜力。因此,人们正在设计新的存储器以及有效利用它们的新算法。众所周知,在功率、性能、面积和可预测性方面,刮擦板的性能优于缓存。然而,与缓存不同的是,它们的使用依赖于软件分配技术。我们提出了一种分配技术,该技术可以分析应用程序并插入指令,以便在运行时动态地将代码段和变量复制到草稿板上。我们证明了动态覆盖刮擦板问题是全局寄存器分配问题的扩展。利用ILP配方技术最优地解决了叠加问题。我们的方法改进了以前唯一已知的将变量和代码段静态分配到暂存板上的分配技术。实验报告显示,在能耗和应用程序运行时间上分别平均降低了34%和18%。还报告了代码大小的最小增加。
{"title":"Dynamic overlay of scratchpad memory for energy minimization","authors":"Manish Verma, L. Wehmeyer, P. Marwedel","doi":"10.1109/CODES+ISSS.2004.20","DOIUrl":"https://doi.org/10.1109/CODES+ISSS.2004.20","url":null,"abstract":"The memory subsystem accounts for a significant portion of the aggregate energy budget of contemporary embedded systems. Moreover, there exists a large potential for optimizing the energy consumption of the memory subsystem. Consequently, novel memories as well as novel algorithms for their efficient utilization are being designed. Scratchpads are known to perform better than caches in terms of power, performance, area and predictability. However, unlike caches they depend upon software allocation techniques for their utilization. We present an allocation technique which analyzes the application and inserts instructions to dynamically copy both code segments and variables onto the scratchpad at runtime. We demonstrate that the problem of dynamically overlaying scratchpad is an extension of the global register allocation problem. The overlay problem is solved optimally using ILP formulation techniques. Our approach improves upon the only previously known allocation technique for statically allocating both variables and code segments onto the scratchpad. Experiments report an average reduction of 34% and 18% in the energy consumption and the runtime of the applications, respectively. A minimal increase in code size is also reported.","PeriodicalId":127038,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis, 2004. CODES + ISSS 2004.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2004-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134521550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 150
Fast cycle-accurate simulation and instruction set generation for constraint-based descriptions of programmable architectures 基于约束描述的可编程体系结构的快速周期精确仿真和指令集生成
S. Weber, M. Moskewicz, M. Gries, C. Sauer, K. Keutzer
State-of-the-art architecture description languages have been successfully used to model application-specific programmable architectures limited to particular control schemes. We introduce a language and methodology that provide a framework for constructing and simulating a wider range of architectures. The framework exploits the fact that designers are often only concerned with data paths, not the instruction set and control. In the framework, each processing element is described in a structural language that only requires the specification of the data path and constraints on how it can be used. From such a description, the supported operations of the processing clement are automatically extracted and a controller is generated. Various architectures are then realized by composing the processing elements. Furthermore, hardware descriptions and bit-true cycle-accurate simulators are automatically generated. Results show that our simulators are up to an order of magnitude faster than other reported simulators of this type and two orders of magnitude faster than equivalent Verilog simulations.
最先进的体系结构描述语言已经成功地用于对限于特定控制方案的特定应用程序可编程体系结构进行建模。我们介绍了一种语言和方法,为构建和模拟更广泛的体系结构提供了框架。该框架利用了设计人员通常只关心数据路径而不是指令集和控制的事实。在框架中,每个处理元素都是用结构化语言描述的,这种语言只需要说明数据路径和如何使用它的约束。从这样的描述中,自动提取处理元素所支持的操作并生成控制器。然后通过组合处理元素来实现各种体系结构。此外,还自动生成硬件描述和位真周期精确模拟器。结果表明,我们的模拟器比其他同类模拟器快一个数量级,比等效Verilog模拟快两个数量级。
{"title":"Fast cycle-accurate simulation and instruction set generation for constraint-based descriptions of programmable architectures","authors":"S. Weber, M. Moskewicz, M. Gries, C. Sauer, K. Keutzer","doi":"10.1145/1016720.1016728","DOIUrl":"https://doi.org/10.1145/1016720.1016728","url":null,"abstract":"State-of-the-art architecture description languages have been successfully used to model application-specific programmable architectures limited to particular control schemes. We introduce a language and methodology that provide a framework for constructing and simulating a wider range of architectures. The framework exploits the fact that designers are often only concerned with data paths, not the instruction set and control. In the framework, each processing element is described in a structural language that only requires the specification of the data path and constraints on how it can be used. From such a description, the supported operations of the processing clement are automatically extracted and a controller is generated. Various architectures are then realized by composing the processing elements. Furthermore, hardware descriptions and bit-true cycle-accurate simulators are automatically generated. Results show that our simulators are up to an order of magnitude faster than other reported simulators of this type and two orders of magnitude faster than equivalent Verilog simulations.","PeriodicalId":127038,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis, 2004. CODES + ISSS 2004.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2004-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115911405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Design and programming of embedded multiprocessors: an interface-centric approach 嵌入式多处理器的设计和编程:以接口为中心的方法
P. V. D. Wolf, E. Kock, T. Henriksson, W. Kruijtzer, G. Essink
We present design technology for the structured design and programming of embedded multi-processor systems. It comprises a task-level interface that can be used both for developing parallel application models and as a platform interface for implementing applications on multi-processor architectures. Associated mapping technology supports refinement of application models towards implementation. By linking application development and implementation aspects, the technology integrates the specification and design phases in the MPSoC design process. Two design cases demonstrate the efficient implementation of the platform interface on different architectures. Industry-wide standardization of a task-level interface can facilitate reuse of function-specific hardware/software modules across companies.
介绍了嵌入式多处理器系统的结构化设计和程序设计技术。它包含一个任务级接口,既可以用于开发并行应用程序模型,也可以用作在多处理器架构上实现应用程序的平台接口。相关的映射技术支持向实现方向细化应用程序模型。通过连接应用程序开发和实现方面,该技术集成了MPSoC设计过程中的规范和设计阶段。两个设计案例演示了平台接口在不同架构上的有效实现。任务级接口的全行业标准化可以促进跨公司特定功能的硬件/软件模块的重用。
{"title":"Design and programming of embedded multiprocessors: an interface-centric approach","authors":"P. V. D. Wolf, E. Kock, T. Henriksson, W. Kruijtzer, G. Essink","doi":"10.1109/CODES+ISSS.2004.17","DOIUrl":"https://doi.org/10.1109/CODES+ISSS.2004.17","url":null,"abstract":"We present design technology for the structured design and programming of embedded multi-processor systems. It comprises a task-level interface that can be used both for developing parallel application models and as a platform interface for implementing applications on multi-processor architectures. Associated mapping technology supports refinement of application models towards implementation. By linking application development and implementation aspects, the technology integrates the specification and design phases in the MPSoC design process. Two design cases demonstrate the efficient implementation of the platform interface on different architectures. Industry-wide standardization of a task-level interface can facilitate reuse of function-specific hardware/software modules across companies.","PeriodicalId":127038,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis, 2004. CODES + ISSS 2004.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2004-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121553776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 152
Modeling operation and microarchitecture concurrency for communication architectures with application to retargetable simulation 通信体系结构的操作和微体系结构并发建模及其可重目标仿真应用
Xinping Zhu, W. Qin, S. Malik
In multiprocessor based SoCs, optimizing the communication architecture is often as important as, if not more than, optimizing the computation architecture. While there are mature platforms and techniques for the modeling and evaluation of computation architectures, the same is not true for the communication architectures. A major challenge in modeling the communication architecture is managing the concurrency at multiple levels: at the operation level, multiple communication operations may be active at any time; at the microarchitecture level, several microarchitectural components may be operating in parallel. Further, it is important to be able to clearly specify how the operation level concurrency maps to the microarchitectural level concurrency. This work presents a modeling methodology and a retargetable simulation framework which fill this gap. This framework seeks to facilitate the design space exploration of the communication sub-system through a rigorous modeling approach based on a formal concurrency model, the operation state machine (OSM). We first introduce the basic notions and concepts of OSM and show by example how this model can be used to represent the inherent concurrency in the architecture and microarchitecture of processors. Then we demonstrate the applicability of OSM in modeling on-chip communication architectures (OCAs) by walking though a router based packet switching network example and a bus example. Due to the fact that the OSM model is naturally suited to handle the operation and microarchitecture level concurrencies of OCAs as well, our OSM-based modeling methodology enables the entire system including both the computation and communication architectures to be modeled in a single OSM framework. This allows us to develop a tool set that can synthesize cycle-accurate system simulators for multi-PE SoC prototypes. To demonstrate the flexibility of this methodology, we choose two distinct system configurations with different types of OCA: a 4/spl times/4 mesh network of 16 PEs, and a cluster of 4 PEs connected by a bus. We show that by simulation, critical system information such as timing and communication patterns can be obtained and evaluated. Consequently, system-level design choices regarding the communication architecture can be made with high confidence in early stages of design. In addition to improving design quality, this methodology also results in significantly shortened design-time.
在基于多处理器的soc中,优化通信体系结构通常与优化计算体系结构同样重要,甚至更重要。虽然对于计算体系结构的建模和评估有成熟的平台和技术,但对于通信体系结构来说,情况并非如此。对通信体系结构进行建模的一个主要挑战是管理多个级别的并发性:在操作级别,多个通信操作可能随时处于活动状态;在微体系结构级别,多个微体系结构组件可能并行运行。此外,能够清楚地指定操作级并发性如何映射到微架构级并发性是很重要的。这项工作提出了一种建模方法和一个可重新定位的仿真框架,填补了这一空白。该框架旨在通过基于正式并发模型,即操作状态机(OSM)的严格建模方法促进通信子系统的设计空间探索。我们首先介绍OSM的基本概念和概念,并通过示例展示如何使用该模型来表示处理器体系结构和微体系结构中的固有并发性。然后,我们通过一个基于路由器的分组交换网络示例和一个总线示例来演示OSM在片上通信体系结构(oca)建模中的适用性。由于OSM模型自然适合处理oca的操作和微架构级并发,我们基于OSM的建模方法使整个系统(包括计算和通信架构)能够在单个OSM框架中建模。这使我们能够开发一套工具集,可以为多pe SoC原型合成周期精确的系统模拟器。为了展示这种方法的灵活性,我们选择了两种具有不同类型OCA的不同系统配置:16个pe的4/spl倍/4网状网络,以及通过总线连接的4个pe集群。我们通过仿真证明,关键的系统信息,如时序和通信模式可以获得和评估。因此,有关通信体系结构的系统级设计选择可以在设计的早期阶段以高可信度进行。除了提高设计质量外,这种方法还显著缩短了设计时间。
{"title":"Modeling operation and microarchitecture concurrency for communication architectures with application to retargetable simulation","authors":"Xinping Zhu, W. Qin, S. Malik","doi":"10.1145/1016720.1016738","DOIUrl":"https://doi.org/10.1145/1016720.1016738","url":null,"abstract":"In multiprocessor based SoCs, optimizing the communication architecture is often as important as, if not more than, optimizing the computation architecture. While there are mature platforms and techniques for the modeling and evaluation of computation architectures, the same is not true for the communication architectures. A major challenge in modeling the communication architecture is managing the concurrency at multiple levels: at the operation level, multiple communication operations may be active at any time; at the microarchitecture level, several microarchitectural components may be operating in parallel. Further, it is important to be able to clearly specify how the operation level concurrency maps to the microarchitectural level concurrency. This work presents a modeling methodology and a retargetable simulation framework which fill this gap. This framework seeks to facilitate the design space exploration of the communication sub-system through a rigorous modeling approach based on a formal concurrency model, the operation state machine (OSM). We first introduce the basic notions and concepts of OSM and show by example how this model can be used to represent the inherent concurrency in the architecture and microarchitecture of processors. Then we demonstrate the applicability of OSM in modeling on-chip communication architectures (OCAs) by walking though a router based packet switching network example and a bus example. Due to the fact that the OSM model is naturally suited to handle the operation and microarchitecture level concurrencies of OCAs as well, our OSM-based modeling methodology enables the entire system including both the computation and communication architectures to be modeled in a single OSM framework. This allows us to develop a tool set that can synthesize cycle-accurate system simulators for multi-PE SoC prototypes. To demonstrate the flexibility of this methodology, we choose two distinct system configurations with different types of OCA: a 4/spl times/4 mesh network of 16 PEs, and a cluster of 4 PEs connected by a bus. We show that by simulation, critical system information such as timing and communication patterns can be obtained and evaluated. Consequently, system-level design choices regarding the communication architecture can be made with high confidence in early stages of design. In addition to improving design quality, this methodology also results in significantly shortened design-time.","PeriodicalId":127038,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis, 2004. CODES + ISSS 2004.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2004-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130963573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
期刊
International Conference on Hardware/Software Codesign and System Synthesis, 2004. CODES + ISSS 2004.
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1