International Conference on Hardware/Software Codesign and System Synthesis最新文献

英文中文

SuSeSim: a fast simulation strategy to find optimal L1 cache configuration for embedded systems SuSeSim:为嵌入式系统找到最佳L1缓存配置的快速仿真策略

International Conference on Hardware/Software Codesign and System Synthesis

Pub Date : 2009-10-11 DOI: 10.1145/1629435.1629476

M. S. Haque, Andhi Janapsatya, S. Parameswaran

Simulation of an application is a popular and reliable approach to find the optimal configuration of level one cache memory for an application specific embedded system processor. However, long simulation time is one of the main disadvantages of simulation based approaches. In this paper, we propose a new and fast simulation method, Super Set Simulator (SuSeSim). While previous methods use Top-Down searching strategy, SuSeSim utilizes a Bottom-Up search strategy along with a new elaborate data structure to reduce the search space to determine a cache hit or miss. SuSeSim can simulate hundreds of cache configurations simultaneously by reading an application's memory request trace just once. Total number of cache hits and misses are accurately recorded. Depending on different cache block sizes and benchmark applications, SuSeSim can reduce the number of tags to be checked by up to 43% compared to the existing fastest simulation approach (the CRCB algorithm). With the help of a faster search and an easy to maintain data structure, SuSeSim can be up to 94% faster in simulating memory requests compared to the CRCB algorithm.

应用程序模拟是为特定于应用程序的嵌入式系统处理器找到一级缓存内存的最佳配置的一种流行且可靠的方法。然而，仿真时间长是基于仿真方法的主要缺点之一。本文提出了一种新的快速仿真方法——超集模拟器(SuSeSim)。以前的方法使用自顶向下的搜索策略，而SuSeSim使用自底向上的搜索策略以及一个新的精心设计的数据结构来减少搜索空间，以确定缓存命中或未命中。SuSeSim可以通过只读取一次应用程序的内存请求跟踪来同时模拟数百个缓存配置。准确记录缓存命中和未命中的总数。根据不同的缓存块大小和基准测试应用程序，与现有最快的模拟方法(CRCB算法)相比，SuSeSim可以将要检查的标签数量减少多达43%。在更快的搜索和易于维护的数据结构的帮助下，与CRCB算法相比，SuSeSim在模拟内存请求方面的速度可以提高94%。

{"title":"SuSeSim: a fast simulation strategy to find optimal L1 cache configuration for embedded systems","authors":"M. S. Haque, Andhi Janapsatya, S. Parameswaran","doi":"10.1145/1629435.1629476","DOIUrl":"https://doi.org/10.1145/1629435.1629476","url":null,"abstract":"Simulation of an application is a popular and reliable approach to find the optimal configuration of level one cache memory for an application specific embedded system processor. However, long simulation time is one of the main disadvantages of simulation based approaches. In this paper, we propose a new and fast simulation method, Super Set Simulator (SuSeSim). While previous methods use Top-Down searching strategy, SuSeSim utilizes a Bottom-Up search strategy along with a new elaborate data structure to reduce the search space to determine a cache hit or miss. SuSeSim can simulate hundreds of cache configurations simultaneously by reading an application's memory request trace just once. Total number of cache hits and misses are accurately recorded. Depending on different cache block sizes and benchmark applications, SuSeSim can reduce the number of tags to be checked by up to 43% compared to the existing fastest simulation approach (the CRCB algorithm). With the help of a faster search and an easy to maintain data structure, SuSeSim can be up to 94% faster in simulating memory requests compared to the CRCB algorithm.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124706013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

A high-level virtual platform for early MPSoC software development 一个用于早期MPSoC软件开发的高级虚拟平台

International Conference on Hardware/Software Codesign and System Synthesis

Pub Date : 2009-10-11 DOI: 10.1145/1629435.1629438

J. Ceng, Weihua Sheng, J. Castrillón, Anastasia Stulova, R. Leupers, G. Ascheid, H. Meyr

Multiprocessor System-on-Chips (MPSoCs) are nowadays widely used, but the problem of their software development persists to be one of the biggest challenges for developers. Virtual Platforms (VPs) are introduced to the industry, which allow MPSoC software development without a hardware prototype. Nevertheless, for developers in early design stage where no VP is available, the software programming support is not satisfactory. This paper introduces a High-level Virtual Platform (HVP) which aims at early MPSoC software development. The framework provides a set of tools for abstract MPSoC simulation and the corresponding application programming support in order to enable the development of reusable C code at a high level. The case study performed on several MPSoCs shows that the code developed on the HVP can be easily reused on different target platforms. Moreover, the high simulation speed achieved by the HVP also improves the design efficiency of software developers.

多处理器片上系统(mpsoc)如今得到了广泛的应用，但其软件开发问题仍然是开发人员面临的最大挑战之一。虚拟平台(VPs)被引入到行业中，它允许在没有硬件原型的情况下进行MPSoC软件开发。然而，对于没有VP的早期设计阶段的开发人员来说，软件编程支持是不令人满意的。本文介绍了一个针对MPSoC早期软件开发的高级虚拟平台(High-level Virtual Platform, HVP)。该框架提供了一套用于抽象MPSoC仿真的工具和相应的应用程序编程支持，以便在高级别上开发可重用的C代码。在几个mpsoc上进行的案例研究表明，在HVP上开发的代码可以很容易地在不同的目标平台上重用。此外，HVP实现的高仿真速度也提高了软件开发人员的设计效率。

引用次数: 44

ILP optimal scheduling for multi-module memory 多模块内存的ILP最优调度

International Conference on Hardware/Software Codesign and System Synthesis

Pub Date : 2009-10-11 DOI: 10.1145/1629435.1629473

Meikang Qiu, Lei Zhang, E. Sha

In high-end digital signal processing (DSP) system, multi-module memory provides high memory bandwidth and low power operating mode for energy savings. However, making full use of these architectural features is a challenging problem for code optimization. In this paper, we propose an integer linear programming model to optimize the performance and energy consumption of multi-module memories by solving variable assignment, instruction scheduling and operating mode setting problems simultaneously. The combined effect of performance and energy saving requirements also has been considered. We develop two optimization techniques to improve the computation efficiency of our ILP model. The experimental results show that the optimal performance and energy solution can be achieved within a reasonable amount of time.

在高端数字信号处理(DSP)系统中，多模块存储器提供高存储带宽和低功耗工作模式，从而节省能源。然而，充分利用这些架构特性对于代码优化来说是一个具有挑战性的问题。本文提出了一种整数线性规划模型，通过同时解决变量分配、指令调度和操作模式设置问题来优化多模块存储器的性能和能耗。还考虑了性能和节能要求的综合影响。为了提高ILP模型的计算效率，我们开发了两种优化技术。实验结果表明，在合理的时间内可以获得最优的性能和能量解。

引用次数: 8

Synthesis of topology configurations and deadlock free routing algorithms for ReNoC-based systems-on-chip 基于recc的片上系统拓扑结构的综合和无死锁路由算法

International Conference on Hardware/Software Codesign and System Synthesis

Pub Date : 2009-10-11 DOI: 10.1145/1629435.1629500

M.B. Stuart, M. B. Stensgaard, J. Sparsø

In the near future, generic System-on-Chip (SoC) platforms will be replacing custom designed SoCs. Such generic platforms require a highly flexible interconnect in order to support a wide variety of applications. The ReNoC architecture provides this by allowing power efficient, application specific topologies to be configured on top of a fixed but reconfigurable physical architecture through a mixture of packet switching and physical circuit switching. The first contribution of this paper is three novel algorithms that, given an abstract description of the application and the physical architecture, 1) synthesize the application specific topologies, 2) map them onto the physical architecture, and 3) create deadlock free, application specific routing algorithms. The second contribution is a novel physical architecture based on an extended mesh of ReNoC nodes. We apply our algorithms to a mixture of real and synthetic applications and three different physical architectures. Our results show that the different algorithms' performance are highly dependent on the physical architecture. On average, our novel physical architecture reduces power consumption by 58% compared to a conventional Network-on-Chip.

在不久的将来，通用的片上系统(SoC)平台将取代定制设计的SoC。这样的通用平台需要高度灵活的互连，以支持各种各样的应用程序。通过混合分组交换和物理电路交换，ReNoC架构允许在固定但可重新配置的物理架构上配置节能、特定于应用的拓扑，从而提供了这一点。本文的第一个贡献是三个新算法，给定应用程序和物理体系结构的抽象描述，1)综合特定于应用程序的拓扑，2)将它们映射到物理体系结构，以及3)创建无死锁的特定于应用程序的路由算法。第二个贡献是基于renc节点扩展网格的新型物理体系结构。我们将我们的算法应用于真实的和合成的应用程序以及三种不同的物理架构。我们的结果表明，不同算法的性能高度依赖于物理架构。与传统的片上网络相比，我们的新型物理架构平均可降低58%的功耗。

{"title":"Synthesis of topology configurations and deadlock free routing algorithms for ReNoC-based systems-on-chip","authors":"M.B. Stuart, M. B. Stensgaard, J. Sparsø","doi":"10.1145/1629435.1629500","DOIUrl":"https://doi.org/10.1145/1629435.1629500","url":null,"abstract":"In the near future, generic System-on-Chip (SoC) platforms will be replacing custom designed SoCs. Such generic platforms require a highly flexible interconnect in order to support a wide variety of applications. The ReNoC architecture provides this by allowing power efficient, application specific topologies to be configured on top of a fixed but reconfigurable physical architecture through a mixture of packet switching and physical circuit switching.\u0000 The first contribution of this paper is three novel algorithms that, given an abstract description of the application and the physical architecture, 1) synthesize the application specific topologies, 2) map them onto the physical architecture, and 3) create deadlock free, application specific routing algorithms.\u0000 The second contribution is a novel physical architecture based on an extended mesh of ReNoC nodes. We apply our algorithms to a mixture of real and synthetic applications and three different physical architectures. Our results show that the different algorithms' performance are highly dependent on the physical architecture. On average, our novel physical architecture reduces power consumption by 58% compared to a conventional Network-on-Chip.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134285888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Fast model-based test case classification for performance analysis of multimedia MPSoC platforms 基于快速模型的多媒体MPSoC平台性能分析测试用例分类

International Conference on Hardware/Software Codesign and System Synthesis

Pub Date : 2009-10-11 DOI: 10.1145/1629435.1629492

Deepak Gangadharan, S. Chakraborty, Roger Zimmermann

Currently, performance analysis of multimedia-MPSoC platforms largely rely on simulation. The execution of one or more applications on such a platform is simulated for a library of test video clips. If all specified performance constraints are satisfied for this library, then the architecture is assumed to be well-designed. This is similar to testing software for functional correctness. However, in contrast to functional testing, simulating a set of video clips for a complex application/architecture is extremely time consuming. In this paper we propose a technique for clustering a library of video clips, such that it is sufficient to simulate only one clip from each cluster rather than the entire library. Our clustering is scalable, i.e., the number of clusters may be determined based on the number of clips that the system designer wishes to simulate (which is independent of the input library size). For each video clip in the library, we perform a fast bitstream analysis from which the workload generated while processing this clip on the given architecture may be estimated. This workload information, in conjunction with a workload model and a performance model of the architecture, is used for the clustering. This entire process does not involve any simulation and is hence extremely fast. We illustrate its utility through a detailed case study using an MPEG-2 decoder application running on an MPSoC platform. As part of validation of our methodology, it was observed that video clips falling into the same cluster exhibit similar worst case buffer backlogs and worst case delays for one macroblock. Overall the results demonstrate that the proposed method provides a very fast and accurate analysis and hence can be of significant benefit to the system designer.

目前，多媒体mpsoc平台的性能分析主要依赖于仿真。在这样的平台上对测试视频剪辑库模拟一个或多个应用程序的执行。如果满足此库的所有指定性能约束，则假定该体系结构设计良好。这类似于测试软件的功能正确性。然而，与功能测试相比，为复杂的应用程序/体系结构模拟一组视频剪辑非常耗时。在本文中，我们提出了一种聚类视频剪辑库的技术，这样就足以模拟每个集群中的一个剪辑，而不是整个库。我们的集群是可扩展的，也就是说，集群的数量可以根据系统设计者希望模拟的剪辑数量来确定(这与输入库的大小无关)。对于库中的每个视频片段，我们执行一个快速的比特流分析，从中可以估计在给定架构上处理该片段时产生的工作量。此工作负载信息与体系结构的工作负载模型和性能模型一起用于集群。整个过程不涉及任何模拟，因此非常快。我们通过使用在MPSoC平台上运行的MPEG-2解码器应用程序的详细案例研究来说明其实用性。作为我们方法验证的一部分，我们观察到落在同一集群中的视频剪辑在一个宏块中表现出类似的最坏情况缓冲积压和最坏情况延迟。总体而言，结果表明所提出的方法提供了一个非常快速和准确的分析，因此可以为系统设计者带来显著的好处。

{"title":"Fast model-based test case classification for performance analysis of multimedia MPSoC platforms","authors":"Deepak Gangadharan, S. Chakraborty, Roger Zimmermann","doi":"10.1145/1629435.1629492","DOIUrl":"https://doi.org/10.1145/1629435.1629492","url":null,"abstract":"Currently, performance analysis of multimedia-MPSoC platforms largely rely on simulation. The execution of one or more applications on such a platform is simulated for a library of test video clips. If all specified performance constraints are satisfied for this library, then the architecture is assumed to be well-designed. This is similar to testing software for functional correctness. However, in contrast to functional testing, simulating a set of video clips for a complex application/architecture is extremely time consuming. In this paper we propose a technique for clustering a library of video clips, such that it is sufficient to simulate only one clip from each cluster rather than the entire library. Our clustering is scalable, i.e., the number of clusters may be determined based on the number of clips that the system designer wishes to simulate (which is independent of the input library size). For each video clip in the library, we perform a fast bitstream analysis from which the workload generated while processing this clip on the given architecture may be estimated. This workload information, in conjunction with a workload model and a performance model of the architecture, is used for the clustering. This entire process does not involve any simulation and is hence extremely fast. We illustrate its utility through a detailed case study using an MPEG-2 decoder application running on an MPSoC platform. As part of validation of our methodology, it was observed that video clips falling into the same cluster exhibit similar worst case buffer backlogs and worst case delays for one macroblock. Overall the results demonstrate that the proposed method provides a very fast and accurate analysis and hence can be of significant benefit to the system designer.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126634004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

An MDP-based application oriented optimal policy for wireless sensor networks 一种基于mdp的面向应用的无线传感器网络优化策略

International Conference on Hardware/Software Codesign and System Synthesis

Pub Date : 2009-10-11 DOI: 10.1145/1629435.1629461

Arslan Munir, A. Gordon-Ross

Technological advancements due to Moore's law have led to the proliferation of complex wireless sensor network (WSN) domains. One commonality across all WSN domains is the need to meet application requirements (i.e. lifetime, responsiveness, etc.) through domain specific sensor node design. Techniques such as sensor node parameter tuning enable WSN designers to specialize tunable parameters (i.e. processor voltage and frequency, sensing frequency, etc.) to meet these application requirements. However, given WSN domain diversity, varying environmental situations (stimuli), and sensor node complexity, sensor node parameter tuning is a very challenging task. In this paper, we propose an automated Markov Decision Process (MDP)-based methodology to prescribe optimal sensor node operation (selection of values for tunable parameters such as processor voltage, processor frequency, and sensing frequency) to meet application requirements and adapt to changing environmental stimuli. Numerical results confirm the optimality of our proposed methodology and reveal that our methodology more closely meets application requirements compared to other feasible policies.

由于摩尔定律的技术进步导致了复杂无线传感器网络(WSN)域的扩散。所有WSN域的一个共同点是需要通过特定于域的传感器节点设计来满足应用需求(即生命周期、响应性等)。传感器节点参数调优等技术使WSN设计人员能够专门设计可调参数(即处理器电压和频率，传感频率等)以满足这些应用需求。然而，考虑到WSN域的多样性、不同的环境情况(刺激)和传感器节点的复杂性，传感器节点参数的调优是一项非常具有挑战性的任务。在本文中，我们提出了一种基于自动化马尔可夫决策过程(MDP)的方法来规定最佳传感器节点操作(选择可调参数的值，如处理器电压，处理器频率和传感频率)，以满足应用需求并适应不断变化的环境刺激。数值结果证实了我们提出的方法的最优性，并表明与其他可行的策略相比，我们的方法更接近于应用需求。

{"title":"An MDP-based application oriented optimal policy for wireless sensor networks","authors":"Arslan Munir, A. Gordon-Ross","doi":"10.1145/1629435.1629461","DOIUrl":"https://doi.org/10.1145/1629435.1629461","url":null,"abstract":"Technological advancements due to Moore's law have led to the proliferation of complex wireless sensor network (WSN) domains. One commonality across all WSN domains is the need to meet application requirements (i.e. lifetime, responsiveness, etc.) through domain specific sensor node design. Techniques such as sensor node parameter tuning enable WSN designers to specialize tunable parameters (i.e. processor voltage and frequency, sensing frequency, etc.) to meet these application requirements. However, given WSN domain diversity, varying environmental situations (stimuli), and sensor node complexity, sensor node parameter tuning is a very challenging task. In this paper, we propose an automated Markov Decision Process (MDP)-based methodology to prescribe optimal sensor node operation (selection of values for tunable parameters such as processor voltage, processor frequency, and sensing frequency) to meet application requirements and adapt to changing environmental stimuli. Numerical results confirm the optimality of our proposed methodology and reveal that our methodology more closely meets application requirements compared to other feasible policies.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121966102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 39

Squashing microcode stores to size in embedded systems while delivering rapid microcode accesses 在嵌入式系统中压缩微码存储，同时提供快速的微码访问

International Conference on Hardware/Software Codesign and System Synthesis

Pub Date : 2009-10-11 DOI: 10.1145/1629435.1629471

Chengmo Yang, Mingjing Chen, A. Orailoglu

Microcoded customized IPs offer superior performance and direct programmability of micro-architectural structures compared to instruction-based processors, yet at the cost of drastically enlarged code sizes. Code compression can deliver size reductions but necessitates attention to performance issues, so that the performance benefits of microcoded IPs are not squandered in the process. To attain this goal, we propose in this paper a fast code compression technique through exploiting the fact that the microcodes contain a sizable amount of unspecified bits. Although the values and the positions of the specified bits are highly irregular, the proposed technique can still flexibly and precisely fill in these fully specified bits through utilizing a linear network. The linear property inherent in the compression strategy in turn enables the development of an extremely low-overhead decompression engine. At runtime, the decompressed code can be generated in such a way that all the specified bits can be filled as required by a fixed-bandwidth XOR network. The combination of the proposed flexible XOR-based network with a minimum two-level storage for highly specified fields, such as immediate values, offers utmost code compression, attained within a negligible amount of performance and hardware overhead.

与基于指令的处理器相比，微编码定制ip提供了优越的性能和微架构结构的直接可编程性，但代价是代码大小大幅增加。代码压缩可以减少大小，但需要注意性能问题，因此微编码ip的性能优势不会在此过程中被浪费掉。为了实现这一目标，我们在本文中提出了一种快速代码压缩技术，该技术通过利用微码包含大量未指定位的事实。虽然指定位的值和位置是高度不规则的，但该技术仍然可以通过利用线性网络灵活而精确地填充这些完全指定的位。压缩策略中固有的线性特性反过来又使开发极低开销的解压引擎成为可能。在运行时，可以以这样一种方式生成解压缩代码，即固定带宽的异或网络可以根据需要填充所有指定的位。所建议的灵活的基于xor的网络与用于高度指定字段(如即时值)的最小两级存储相结合，提供了最大的代码压缩，在性能和硬件开销可以忽略不计的情况下实现。

{"title":"Squashing microcode stores to size in embedded systems while delivering rapid microcode accesses","authors":"Chengmo Yang, Mingjing Chen, A. Orailoglu","doi":"10.1145/1629435.1629471","DOIUrl":"https://doi.org/10.1145/1629435.1629471","url":null,"abstract":"Microcoded customized IPs offer superior performance and direct programmability of micro-architectural structures compared to instruction-based processors, yet at the cost of drastically enlarged code sizes. Code compression can deliver size reductions but necessitates attention to performance issues, so that the performance benefits of microcoded IPs are not squandered in the process. To attain this goal, we propose in this paper a fast code compression technique through exploiting the fact that the microcodes contain a sizable amount of unspecified bits. Although the values and the positions of the specified bits are highly irregular, the proposed technique can still flexibly and precisely fill in these fully specified bits through utilizing a linear network. The linear property inherent in the compression strategy in turn enables the development of an extremely low-overhead decompression engine. At runtime, the decompressed code can be generated in such a way that all the specified bits can be filled as required by a fixed-bandwidth XOR network. The combination of the proposed flexible XOR-based network with a minimum two-level storage for highly specified fields, such as immediate values, offers utmost code compression, attained within a negligible amount of performance and hardware overhead.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127807492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Mapping pipelined applications onto heterogeneous embedded systems: a bayesian optimization algorithm based approach 将流水线应用程序映射到异构嵌入式系统:基于贝叶斯优化算法的方法

International Conference on Hardware/Software Codesign and System Synthesis

Pub Date : 2009-10-11 DOI: 10.1145/1629435.1629495

Antonino Tumeo, Marco Branca, L. Camerini, C. Pilato, P. Lanzi, Fabrizio Ferrandi, D. Sciuto

In this paper we propose a flow based on the Bayesian Optimization Algorithm (BOA) for mapping pipelined applications on a heterogeneous multiprocessor platform on Field Programmable Gate Array (FPGA) with customizable processors. BOA is a Probabilistic Model Building Genetic Algorithm (PMBGA) that, substituting the classical mutation and crossover operators with the construction and the sampling of a Bayesian network, is able to identify correlated sub-structures within the problem to be maintained while generating new solutions. The paper introduces the model adopted for pipelined applications and then shows why BOA fits the problem better than other search algorithms, like Genetic Algorithm (GA), Simulated Annealing (SA) and Tabu Search (TS). We also show that our algorithm is able to cope with data parallel pipelined algorithms. We finally validate our flow on realistic applications like JPEG and ADPCM coding by executing the resulting mapping on our platform.

本文提出了一种基于贝叶斯优化算法(BOA)的流程，用于在具有可定制处理器的现场可编程门阵列(FPGA)的异构多处理器平台上映射流水线应用程序。BOA是一种概率模型构建遗传算法(PMBGA)，它用贝叶斯网络的构建和采样取代了经典的突变和交叉算子，能够识别待维护问题中的相关子结构，同时生成新的解决方案。本文介绍了流水线应用所采用的模型，然后说明了为什么BOA比遗传算法(GA)、模拟退火算法(SA)和禁忌搜索(TS)等其他搜索算法更适合问题。我们还证明了我们的算法能够处理数据并行流水线算法。通过在我们的平台上执行结果映射，我们最终在实际应用程序(如JPEG和ADPCM编码)上验证了我们的流程。

引用次数: 6

Statistical physics approaches for network-on-chip traffic characterization 片上网络流量表征的统计物理方法

International Conference on Hardware/Software Codesign and System Synthesis

Pub Date : 2009-10-11 DOI: 10.1145/1629435.1629498

P. Bogdan, R. Marculescu

In order to face the growing complexity of embedded applications, we aim to build highly efficient Network-on-Chip (NoC) architectures which can connect in a scalable manner various computational modules of the platform. For such networked platforms, it is increasingly important to accurately model the traffic characteristics as this is intimately related to our ability to determine the optimal buffer size at various routers in the network and thus provide analytical metrics for various power-performance trade-offs. In this paper, we show that the main limitations of queueing theory and Markov chain approaches to solving the buffer sizing problem can be overcome by adopting a statistical physics approach to probability density characterization which incorporates the power law distribution, correlations, and scaling properties exhibited within an NoC architecture due to various network transactions. As experimental results show, this new approach represents a breakthrough in accurate traffic modeling under non-equilibrium conditions. As such, our results can be directly used to solve the buffer sizing problem for multiprocessor systems where communication happens via the NoC approach.

为了应对日益复杂的嵌入式应用，我们的目标是构建高效的片上网络(NoC)架构，该架构可以以可扩展的方式连接平台的各种计算模块。对于这样的网络平台，准确地建模流量特征变得越来越重要，因为这与我们确定网络中各种路由器的最佳缓冲区大小的能力密切相关，从而为各种功率性能权衡提供分析指标。在本文中，我们展示了排队理论和马尔可夫链方法解决缓冲区大小问题的主要局限性，可以通过采用统计物理方法进行概率密度表征来克服，该方法结合了幂律分布、相关性和在NoC架构中由于各种网络事务而表现出的缩放特性。实验结果表明，该方法在非平衡条件下的精确交通建模方面取得了突破。因此，我们的结果可以直接用于解决通过NoC方法进行通信的多处理器系统的缓冲区大小问题。

{"title":"Statistical physics approaches for network-on-chip traffic characterization","authors":"P. Bogdan, R. Marculescu","doi":"10.1145/1629435.1629498","DOIUrl":"https://doi.org/10.1145/1629435.1629498","url":null,"abstract":"In order to face the growing complexity of embedded applications, we aim to build highly efficient Network-on-Chip (NoC) architectures which can connect in a scalable manner various computational modules of the platform. For such networked platforms, it is increasingly important to accurately model the traffic characteristics as this is intimately related to our ability to determine the optimal buffer size at various routers in the network and thus provide analytical metrics for various power-performance trade-offs. In this paper, we show that the main limitations of queueing theory and Markov chain approaches to solving the buffer sizing problem can be overcome by adopting a statistical physics approach to probability density characterization which incorporates the power law distribution, correlations, and scaling properties exhibited within an NoC architecture due to various network transactions. As experimental results show, this new approach represents a breakthrough in accurate traffic modeling under non-equilibrium conditions. As such, our results can be directly used to solve the buffer sizing problem for multiprocessor systems where communication happens via the NoC approach.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132812655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 62

Using continuous statistical machine learning to enable high-speed performance prediction in hybrid instruction-/cycle-accurate instruction set simulators 使用连续统计机器学习在混合指令/周期精确指令集模拟器中实现高速性能预测

International Conference on Hardware/Software Codesign and System Synthesis

Pub Date : 2009-10-11 DOI: 10.1145/1629435.1629478

D. Powell, Björn Franke

Functional instruction set simulators perform instruction-accurate simulation of benchmarks at high instruction rates. Unlike their slower, but cycle-accurate counterparts however, they are not capable of providing cycle counts due to the higher level of hardware abstraction. In this paper we present a novel approach to performance prediction based on statistical machine learning utilizing a hybrid instruction- and cycle-accurate simulator. We introduce the concept of continuous machine learning to simulation whereby new training data points are acquired on demand and used for on-the-fly updates of the performance model. Furthermore, we show how statistical regression can be adapted to reduce the cost of these updates during a performance-critical simulation. For a state-of-the-art simulator modeling the ARC 750D embedded processor we demonstrate that our approach is highly accurate, with average error <2.5% while achieving a speed-up of approx. 50% over the baseline cycle-accurate simulation.

功能指令集模拟器在高指令速率下执行指令精确的基准模拟。然而，与它们速度较慢但周期精确的对应程序不同，由于更高级别的硬件抽象，它们无法提供周期计数。在本文中，我们提出了一种基于统计机器学习的性能预测新方法，该方法利用混合指令和周期精确模拟器。我们将连续机器学习的概念引入到仿真中，根据需要获取新的训练数据点，并用于性能模型的实时更新。此外，我们还展示了如何调整统计回归来降低性能关键模拟期间这些更新的成本。对于最先进的模拟ARC 750D嵌入式处理器，我们证明了我们的方法非常准确，平均误差<2.5%，同时实现了大约的加速。50%以上的基线周期精确模拟。

引用次数: 19

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

International Conference on Hardware/Software Codesign and System Synthesis

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀