首页 > 最新文献

International Conference on Hardware/Software Codesign and System Synthesis最新文献

英文 中文
A standby-sparing technique with low energy-overhead for fault-tolerant hard real-time systems 用于容错硬实时系统的低能耗备用技术
Pub Date : 2009-10-11 DOI: 10.1145/1629435.1629463
A. Ejlali, B. Al-Hashimi, P. Eles
Time redundancy (rollback-recovery) and hardware redundancy are commonly used in real-time systems to achieve fault tolerance. From an energy consumption point of view, time redundancy is generally more preferable than hardware redundancy. However, hard real-time systems often use hardware redundancy to meet high reliability requirements of safety-critical applications. In this paper we propose a hardware-redundancy technique with low energy-overhead for hard real-time systems. The proposed technique is based on standby-sparing, where the system is composed of a primary unit and a spare. Through analytical models, we have developed an online energy-management method which uses a slack reclamation scheme to reduce the energy consumption of both the primary and spare units. In this method, dynamic voltage scaling (DVS) is used for the primary unit and dynamic power management (DPM) is used for the spare. We conducted several experiments to compare the proposed system with a fault-tolerant real-time system which uses time redundancy for fault tolerance and DVS with slack reclamation for low energy consumption. The results show that for relaxed time constraints, the proposed system provides up to 24% energy saving as compared to the time-redundancy system. For tight deadlines when the time-redundancy system can tolerate no faults, the proposed system preserves its fault-tolerance but with about 32% more energy consumption.
时间冗余(回滚恢复)和硬件冗余在实时系统中常用来实现容错。从能耗的角度来看,时间冗余通常比硬件冗余更可取。然而,硬实时系统通常使用硬件冗余来满足安全关键应用的高可靠性要求。本文针对硬实时系统提出了一种低能耗的硬件冗余技术。提出的技术是基于备用备用的,其中系统由一个主单元和一个备用单元组成。通过分析模型,提出了一种在线能量管理方法,该方法采用闲置回收方案来降低主机组和备用机组的能耗。在这种方法中,主机组使用动态电压缩放(DVS),备用机组使用动态电源管理(DPM)。我们进行了几个实验,将所提出的系统与容错实时系统进行比较,容错实时系统使用时间冗余来容错,分布式交换机使用松弛回收来降低能耗。结果表明,在宽松的时间约束下,与时间冗余系统相比,该系统可节省24%的能量。在时间冗余系统不能容错的情况下,系统保持了容错性,但能耗增加了32%左右。
{"title":"A standby-sparing technique with low energy-overhead for fault-tolerant hard real-time systems","authors":"A. Ejlali, B. Al-Hashimi, P. Eles","doi":"10.1145/1629435.1629463","DOIUrl":"https://doi.org/10.1145/1629435.1629463","url":null,"abstract":"Time redundancy (rollback-recovery) and hardware redundancy are commonly used in real-time systems to achieve fault tolerance. From an energy consumption point of view, time redundancy is generally more preferable than hardware redundancy. However, hard real-time systems often use hardware redundancy to meet high reliability requirements of safety-critical applications. In this paper we propose a hardware-redundancy technique with low energy-overhead for hard real-time systems. The proposed technique is based on standby-sparing, where the system is composed of a primary unit and a spare. Through analytical models, we have developed an online energy-management method which uses a slack reclamation scheme to reduce the energy consumption of both the primary and spare units. In this method, dynamic voltage scaling (DVS) is used for the primary unit and dynamic power management (DPM) is used for the spare. We conducted several experiments to compare the proposed system with a fault-tolerant real-time system which uses time redundancy for fault tolerance and DVS with slack reclamation for low energy consumption. The results show that for relaxed time constraints, the proposed system provides up to 24% energy saving as compared to the time-redundancy system. For tight deadlines when the time-redundancy system can tolerate no faults, the proposed system preserves its fault-tolerance but with about 32% more energy consumption.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127072885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 64
LOP: a novel SRAM-based architecture for low power and high throughput packet classification LOP:一种新颖的基于sram的架构,用于低功耗和高吞吐量数据包分类
Pub Date : 2009-10-11 DOI: 10.1145/1629435.1629455
Xin He, Jorgen Peddersen, S. Parameswaran
Packet classification has become an important problem to solve in modern network processors used in networking embedded systems such as routers. Algorithms for matching incoming packets from the network to pre-defined rules, have been proposed by a number of researchers. Current software-based packet classification techniques have low performance, prompting many researchers to move their focus to new architectures encompassing both software and hardware components. Some of the newer hardware architectures exclusively utilize Ternary Content Addressable Memory (TCAM) to improve the performance of rule matching. However, this results in systems with high power consumption. TCAM consumes a high amount of power due to the fact that it reads the entire memory array during every access, much of which is unnecessary. In this paper, we propose LOP, a novel SRAM-based architecture where incoming packets are compared against parts of all rules simultaneously until a single matching rule is found for the compared bits in the packets. This method LOP significantly reduces power consumption as only a segment of the memory is compared against the incoming packet. Despite the additional time penalty to match a single packet, parallel comparison of multiple packets can improve throughput beyond that of the TCAMapproaches, while consuming significantly low power. Nine different benchmarks were tested in two classification systems, with results showing that LOP architectures provide high lookup rates and high throughput, and low power consumption. Compared with a state-of-the-art TCAM implementation (throughput of 495 Million Search per Second (Msps)) in 65nm CMOS technology, on average, LOP saves 43% of energy consumption with a throughput of 590Msps.
分组分类已成为现代网络处理器中应用于网络嵌入式系统(如路由器)的一个重要问题。许多研究人员已经提出了将来自网络的传入数据包与预定义规则相匹配的算法。当前基于软件的包分类技术性能较差,这促使许多研究人员将注意力转移到包含软件和硬件组件的新架构上。一些较新的硬件架构专门利用三元内容可寻址内存(TCAM)来提高规则匹配的性能。然而,这导致了系统的高功耗。TCAM消耗大量的功率,因为它在每次访问期间读取整个内存阵列,其中大部分是不必要的。在本文中,我们提出了LOP,一种新颖的基于sram的架构,其中传入数据包同时与所有规则的部分进行比较,直到找到数据包中比较位的单个匹配规则。这种方法LOP显著降低了功耗,因为只有一部分内存与传入数据包进行比较。尽管匹配单个数据包会带来额外的时间损失,但多个数据包的并行比较可以提高吞吐量,超过tcam方法,同时消耗的功耗显著降低。在两个分类系统中测试了9个不同的基准测试,结果表明LOP架构提供了高查找率和高吞吐量,以及低功耗。与65纳米CMOS技术中最先进的TCAM实现(每秒4.95亿次搜索(Msps))相比,LOP平均节省了43%的能耗,吞吐量为590Msps。
{"title":"LOP: a novel SRAM-based architecture for low power and high throughput packet classification","authors":"Xin He, Jorgen Peddersen, S. Parameswaran","doi":"10.1145/1629435.1629455","DOIUrl":"https://doi.org/10.1145/1629435.1629455","url":null,"abstract":"Packet classification has become an important problem to solve in modern network processors used in networking embedded systems such as routers. Algorithms for matching incoming packets from the network to pre-defined rules, have been proposed by a number of researchers. Current software-based packet classification techniques have low performance, prompting many researchers to move their focus to new architectures encompassing both software and hardware components. Some of the newer hardware architectures exclusively utilize Ternary Content Addressable Memory (TCAM) to improve the performance of rule matching. However, this results in systems with high power consumption. TCAM consumes a high amount of power due to the fact that it reads the entire memory array during every access, much of which is unnecessary. In this paper, we propose LOP, a novel SRAM-based architecture where incoming packets are compared against parts of all rules simultaneously until a single matching rule is found for the compared bits in the packets. This method LOP significantly reduces power consumption as only a segment of the memory is compared against the incoming packet. Despite the additional time penalty to match a single packet, parallel comparison of multiple packets can improve throughput beyond that of the TCAMapproaches, while consuming significantly low power. Nine different benchmarks were tested in two classification systems, with results showing that LOP architectures provide high lookup rates and high throughput, and low power consumption. Compared with a state-of-the-art TCAM implementation (throughput of 495 Million Search per Second (Msps)) in 65nm CMOS technology, on average, LOP saves 43% of energy consumption with a throughput of 590Msps.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114621289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
A DP-network for optimal dynamic routing in network-on-chip 片上网络中最优动态路由的dp网络
Pub Date : 2009-10-11 DOI: 10.1145/1629435.1629452
T. Mak, P. Cheung, W. Luk, K. Lam
Dynamic routing is desirable because of its substantial improvement in communication bandwidth and intelligent adaptation to faulty links and congested traffics. However, implementation of adaptive routing in a network-on-chip (NoC) system is not trivial and further complicated by the requirements of deadlock-free and real-time optimal decision making. In this paper, we present a deadlock-free routing architecture which employs a dynamic programming (DP) network to provide on-the-fly optimal path planning and network monitoring for packet switching. Also, a new routing strategy called k-step look ahead is introduced. This new strategy can substantially reduced the size of routing table and maintain a high quality of adaptation which leads to a scalable dynamic routing solution with minimal hardware overhead. Our results based on a cycle-accurate simulator demonstrate the effectiveness of the DP-network, which outperforms both the deterministic and adaptive routing algorithms in average delay on various traffic scenarios by 22.3%. Moreover, the hardware overhead for DP-network is insignificant based on the results obtained from the hardware implementations.
动态路由能够显著提高通信带宽,并能对故障链路和拥塞流量进行智能适应。然而,在片上网络(NoC)系统中实现自适应路由并不简单,并且由于无死锁和实时优化决策的要求而变得更加复杂。本文提出了一种无死锁路由架构,该架构采用动态规划(DP)网络为分组交换提供动态最优路径规划和网络监控。此外,还引入了一种新的路由策略,称为k-step forward。这种新策略可以大大减少路由表的大小,并保持高质量的自适应,从而实现硬件开销最小的可扩展动态路由解决方案。基于周期精确模拟器的结果证明了dp网络的有效性,在各种流量场景下,其平均延迟优于确定性和自适应路由算法22.3%。此外,从硬件实现的结果来看,dp网络的硬件开销是微不足道的。
{"title":"A DP-network for optimal dynamic routing in network-on-chip","authors":"T. Mak, P. Cheung, W. Luk, K. Lam","doi":"10.1145/1629435.1629452","DOIUrl":"https://doi.org/10.1145/1629435.1629452","url":null,"abstract":"Dynamic routing is desirable because of its substantial improvement in communication bandwidth and intelligent adaptation to faulty links and congested traffics. However, implementation of adaptive routing in a network-on-chip (NoC) system is not trivial and further complicated by the requirements of deadlock-free and real-time optimal decision making. In this paper, we present a deadlock-free routing architecture which employs a dynamic programming (DP) network to provide on-the-fly optimal path planning and network monitoring for packet switching. Also, a new routing strategy called k-step look ahead is introduced. This new strategy can substantially reduced the size of routing table and maintain a high quality of adaptation which leads to a scalable dynamic routing solution with minimal hardware overhead. Our results based on a cycle-accurate simulator demonstrate the effectiveness of the DP-network, which outperforms both the deterministic and adaptive routing algorithms in average delay on various traffic scenarios by 22.3%. Moreover, the hardware overhead for DP-network is insignificant based on the results obtained from the hardware implementations.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123405453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
A tuneable software cache coherence protocol for heterogeneous MPSoCs 异构mpsoc的可调软件缓存一致性协议
Pub Date : 2009-10-11 DOI: 10.1145/1629435.1629488
Frank E. B. Ophelders, M. Bekooij, H. Corporaal
In a multiprocessor system-on-chip (MPSoC) private caches introduce the cache coherence problem. Here, we target at heterogeneous MPSoCs with a network-on-chip (NoC). Existing hardware cache coherence protocols are less suitable for MPSoCs because many off-the-shelf processors used in MPSoCs do not support these protocols. Furthermore, these protocols typically rely on global visibility and serialization of writes which does not match well with the parallel point-to-point communication provided by a NoC. Therefore, we propose a software cache coherence protocol, which can be applied in a heterogeneous MPSoC with a NoC. The software cache coherence protocol relies on explicit synchronization in the software. More specifically, caches are guaranteed to be coherent according to the Release Consistency model, on top of which we have implemented the standard Pthreads communication library. Heterogeneous MPSoCs with off-the-shelf processors can easily be supported, because processors are only required to provide cache control operations, e.g., clean and invalidate. All cache coherence operations are interruptible and do not impact the execution of tasks on other processors, therefore this protocol is suitable for predictable MPSoCs. Our software cache coherence protocol is implemented on an ARM926EJ-S MPSoC which is mapped on an FPGA. From experiments we conclude that the protocol overhead is low for the applications taken from the SPLASH-2 benchmark set. For these applications we observed a speedup between 1.89 and 2.01 on the two processor MPSoC.
在多处理器片上系统(MPSoC)中,私有缓存引入了缓存一致性问题。在这里,我们的目标是具有片上网络(NoC)的异构mpsoc。现有的硬件缓存一致性协议不太适合mpsoc,因为mpsoc中使用的许多现成处理器不支持这些协议。此外,这些协议通常依赖于全局可见性和写的序列化,这与NoC提供的并行点对点通信不太匹配。因此,我们提出了一种软件缓存一致性协议,该协议可以应用于具有NoC的异构MPSoC。软件缓存一致性协议依赖于软件中的显式同步。更具体地说,根据Release Consistency模型,缓存保证是一致的,在Release Consistency模型的基础上,我们实现了标准Pthreads通信库。异构mpsoc与现成的处理器可以很容易地得到支持,因为处理器只需要提供缓存控制操作,例如,清理和无效。所有缓存一致性操作都是可中断的,不会影响其他处理器上任务的执行,因此该协议适用于可预测的mpsoc。我们的软件缓存一致性协议在ARM926EJ-S MPSoC上实现,该MPSoC映射到FPGA上。从实验中我们得出结论,对于使用SPLASH-2基准集的应用程序来说,协议开销很低。对于这些应用程序,我们观察到两个处理器MPSoC的加速在1.89和2.01之间。
{"title":"A tuneable software cache coherence protocol for heterogeneous MPSoCs","authors":"Frank E. B. Ophelders, M. Bekooij, H. Corporaal","doi":"10.1145/1629435.1629488","DOIUrl":"https://doi.org/10.1145/1629435.1629488","url":null,"abstract":"In a multiprocessor system-on-chip (MPSoC) private caches introduce the cache coherence problem. Here, we target at heterogeneous MPSoCs with a network-on-chip (NoC). Existing hardware cache coherence protocols are less suitable for MPSoCs because many off-the-shelf processors used in MPSoCs do not support these protocols. Furthermore, these protocols typically rely on global visibility and serialization of writes which does not match well with the parallel point-to-point communication provided by a NoC. Therefore, we propose a software cache coherence protocol, which can be applied in a heterogeneous MPSoC with a NoC. The software cache coherence protocol relies on explicit synchronization in the software. More specifically, caches are guaranteed to be coherent according to the Release Consistency model, on top of which we have implemented the standard Pthreads communication library. Heterogeneous MPSoCs with off-the-shelf processors can easily be supported, because processors are only required to provide cache control operations, e.g., clean and invalidate. All cache coherence operations are interruptible and do not impact the execution of tasks on other processors, therefore this protocol is suitable for predictable MPSoCs. Our software cache coherence protocol is implemented on an ARM926EJ-S MPSoC which is mapped on an FPGA. From experiments we conclude that the protocol overhead is low for the applications taken from the SPLASH-2 benchmark set. For these applications we observed a speedup between 1.89 and 2.01 on the two processor MPSoC.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132543101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
An on-chip interconnect and protocol stack for multiple communication paradigms and programming models 片上互连和协议栈,用于多种通信范式和编程模型
Pub Date : 2009-10-11 DOI: 10.1145/1629435.1629450
Andreas Hansson, K. Goossens
A growing number of applications, with diverse requirements, are integrated on the same System on Chip (SoC) in the form of hardware and software Intellectual Property (IP). The diverse requirements, coupled with the IPs being developed by unrelated design teams, lead to multiple communication paradigms, programming models, and interface protocols that the on-chip interconnect must accommodate. Traditionally, on-chip buses offer distributed shared memory communication with established memory-consistency models, but are tightly coupled to a specific interface protocol. On-chip networks, on the other hand, offer layering and interface abstraction, but are centred around point-to-point streaming communication, and do not address issues at the higher layers in the protocol stack, such as memory-consistency models and message-dependent deadlock. In this work we introduce an on-chip interconnect and protocol stack that combines streaming and distributed shared memory communication. The proposed interconnect offers an established memory-consistency model and does not restrict any higher-level protocol dependencies. We present the protocol stack and the architectural blocks and quantify the cost, both on the block level and for a complete SoC. For a multi-processor multi-application SoC with multiple communication paradigms and programming models, our proposed interconnect occupies only 4% of the chip area.
越来越多的具有不同需求的应用程序以硬件和软件知识产权(IP)的形式集成在同一个片上系统(SoC)上。不同的需求,再加上由不相关的设计团队开发的ip,导致了片上互连必须适应的多种通信范例、编程模型和接口协议。传统上,片上总线通过已建立的内存一致性模型提供分布式共享内存通信,但与特定的接口协议紧密耦合。另一方面,片上网络提供分层和接口抽象,但以点对点流通信为中心,并且不解决协议堆栈中更高层的问题,例如内存一致性模型和消息依赖死锁。在这项工作中,我们介绍了一个片上互连和协议栈,它结合了流和分布式共享内存通信。提议的互连提供了一个已建立的内存一致性模型,并且不限制任何更高级别的协议依赖关系。我们提出了协议栈和架构块,并量化了块级和完整SoC的成本。对于具有多种通信范式和编程模型的多处理器多应用SoC,我们提出的互连仅占芯片面积的4%。
{"title":"An on-chip interconnect and protocol stack for multiple communication paradigms and programming models","authors":"Andreas Hansson, K. Goossens","doi":"10.1145/1629435.1629450","DOIUrl":"https://doi.org/10.1145/1629435.1629450","url":null,"abstract":"A growing number of applications, with diverse requirements, are integrated on the same System on Chip (SoC) in the form of hardware and software Intellectual Property (IP). The diverse requirements, coupled with the IPs being developed by unrelated design teams, lead to multiple communication paradigms, programming models, and interface protocols that the on-chip interconnect must accommodate.\u0000 Traditionally, on-chip buses offer distributed shared memory communication with established memory-consistency models, but are tightly coupled to a specific interface protocol. On-chip networks, on the other hand, offer layering and interface abstraction, but are centred around point-to-point streaming communication, and do not address issues at the higher layers in the protocol stack, such as memory-consistency models and message-dependent deadlock.\u0000 In this work we introduce an on-chip interconnect and protocol stack that combines streaming and distributed shared memory communication. The proposed interconnect offers an established memory-consistency model and does not restrict any higher-level protocol dependencies. We present the protocol stack and the architectural blocks and quantify the cost, both on the block level and for a complete SoC. For a multi-processor multi-application SoC with multiple communication paradigms and programming models, our proposed interconnect occupies only 4% of the chip area.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"19 8","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114028686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 27
A scalable parallel H.264 decoder on the cell broadband engine architecture 基于蜂窝宽带引擎架构的可扩展并行H.264解码器
Pub Date : 2009-10-11 DOI: 10.1145/1629435.1629484
Michael A. Baker, Pravin Dalale, Karam S. Chatha, S. Vrudhula
The H.264 video codec provides exceptional video compression while imposing dramatic increases in computational complexity over previous standards. While exploiting parallelism in H.264 is notoriously difficult, successful parallel implementations promise substantial performance gains, particularly as High Definition (HD) content penetrates a widening variety of applications. We present a highly scalable parallelization scheme implemented on IBM's multicore Cell Broadband Engine (CBE) and based on FFmpeg's open source H.264 video decoder. We address resource limitations and complex data dependencies to achieve nearly ideal decoding speedup for the parallelizable portion of the encoded stream. Our decoder achieves better performance than previous implementations, and is deeply scalable for large format video. We discuss architecture and codec specific performance optimizations, code overlays, data structures, memory access scheduling, and vectorization.
H.264视频编解码器提供了出色的视频压缩,但与以前的标准相比,计算复杂性大幅增加。虽然在H.264中利用并行性是出了名的困难,但成功的并行实现保证了显著的性能提升,特别是当高清晰度(HD)内容渗透到越来越多的应用程序中时。我们提出了一种高度可扩展的并行化方案,该方案基于IBM的多核蜂窝宽带引擎(CBE)和FFmpeg的开源H.264视频解码器。我们解决了资源限制和复杂的数据依赖,为编码流的可并行部分实现了近乎理想的解码加速。我们的解码器比以前的实现实现了更好的性能,并且对于大格式视频具有深度可扩展性。我们将讨论架构和编解码器特定的性能优化、代码覆盖、数据结构、内存访问调度和向量化。
{"title":"A scalable parallel H.264 decoder on the cell broadband engine architecture","authors":"Michael A. Baker, Pravin Dalale, Karam S. Chatha, S. Vrudhula","doi":"10.1145/1629435.1629484","DOIUrl":"https://doi.org/10.1145/1629435.1629484","url":null,"abstract":"The H.264 video codec provides exceptional video compression while imposing dramatic increases in computational complexity over previous standards. While exploiting parallelism in H.264 is notoriously difficult, successful parallel implementations promise substantial performance gains, particularly as High Definition (HD) content penetrates a widening variety of applications. We present a highly scalable parallelization scheme implemented on IBM's multicore Cell Broadband Engine (CBE) and based on FFmpeg's open source H.264 video decoder. We address resource limitations and complex data dependencies to achieve nearly ideal decoding speedup for the parallelizable portion of the encoded stream. Our decoder achieves better performance than previous implementations, and is deeply scalable for large format video. We discuss architecture and codec specific performance optimizations, code overlays, data structures, memory access scheduling, and vectorization.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123629725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
A variation-tolerant scheduler for better than worst-case behavioral synthesis 一个比最坏情况下的行为综合更好的可变容错调度程序
Pub Date : 2009-10-11 DOI: 10.1145/1629435.1629467
J. Cong, Albert Liu, B. Liu
There has been a recent shift in design paradigms, with many turning towards yield-driven approaches to synthesize and design systems. A major cause of this shift is the continual scaling of transistors, making process variation impossible to ignore. Better than worst-case (BTW) designs also exploit these variation effects, while also addressing performance limits due to worst-case analysis. In this paper we first present the variation-tolerant stallable-FSM architecture, which provides fault detection and recovery, allowing circuits to be clocked at better than worst-case delays. Then we propose the BTW scheduler, a 0-1 integer linear programming (ILP) scheduling algorithm with the objective of minimizing the expected latency, to provide a high-level synthesis aid for the stallable-FSM architecture. We implemented the algorithm and ran it through many benchmarks, comparing the results with scheduling algorithms based on worst-case analysis. Our results were promising, showing up to 41% latency reduction for the BTW scheduler, and up to 43% latency reduction when combined with the variation-tolerant architecture.
最近设计范式发生了转变,许多人转向以产量为导向的方法来综合和设计系统。这种转变的一个主要原因是晶体管的不断缩小,使得工艺变化不可忽视。优于最坏情况(BTW)的设计也利用了这些变化效应,同时也解决了最坏情况分析带来的性能限制。在本文中,我们首先提出了一种容错的稳态fsm架构,它提供了故障检测和恢复,允许电路在比最坏情况下的延迟更好的情况下进行时钟。然后,我们提出了BTW调度算法,一种以最小化期望延迟为目标的0-1整数线性规划(ILP)调度算法,为可运行fsm架构提供了高级的综合辅助。我们实现了该算法并运行了许多基准测试,将结果与基于最坏情况分析的调度算法进行了比较。我们的结果很有希望,BTW调度器的延迟减少了41%,当与容忍变化的架构结合使用时,延迟减少了43%。
{"title":"A variation-tolerant scheduler for better than worst-case behavioral synthesis","authors":"J. Cong, Albert Liu, B. Liu","doi":"10.1145/1629435.1629467","DOIUrl":"https://doi.org/10.1145/1629435.1629467","url":null,"abstract":"There has been a recent shift in design paradigms, with many turning towards yield-driven approaches to synthesize and design systems. A major cause of this shift is the continual scaling of transistors, making process variation impossible to ignore. Better than worst-case (BTW) designs also exploit these variation effects, while also addressing performance limits due to worst-case analysis. In this paper we first present the variation-tolerant stallable-FSM architecture, which provides fault detection and recovery, allowing circuits to be clocked at better than worst-case delays. Then we propose the BTW scheduler, a 0-1 integer linear programming (ILP) scheduling algorithm with the objective of minimizing the expected latency, to provide a high-level synthesis aid for the stallable-FSM architecture. We implemented the algorithm and ran it through many benchmarks, comparing the results with scheduling algorithms based on worst-case analysis. Our results were promising, showing up to 41% latency reduction for the BTW scheduler, and up to 43% latency reduction when combined with the variation-tolerant architecture.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125252287","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
On compile-time evaluation of process partitioning transformations for Kahn process networks Kahn过程网络过程划分转换的编译时评估
Pub Date : 2009-10-11 DOI: 10.1145/1629435.1629441
Sjoerd Meijer, Hristo Nikolov, T. Stefanov
Kahn Process Networks is an appealing model of computation for programming and mapping applications onto multi-processor platforms. Autonomous processes communicate through unbounded FIFO channels in absence of a global scheduler. We derive Kahn process networks from sequential applications using the pn compiler, but the derived networks do not necessarily meet the performance requirements. Process partitioning transformations can achieve a more balanced network improving the performance results significantly. There are a number of process partitioning transformations that can be used, but no hints are given to the designer which transformation should be applied to minimize, for example, the execution time. Therefore, we investigate a compile-time approach for selecting the best transformation candidate and show results on a Xilinx Virtex 2 FPGA and the Cell BE processor.
Kahn进程网络是一种很有吸引力的计算模型,用于在多处理器平台上编程和映射应用程序。在没有全局调度程序的情况下,自治进程通过无界FIFO通道进行通信。我们使用pn编译器从顺序应用程序中导出Kahn进程网络,但导出的网络不一定满足性能要求。进程分区转换可以实现更加平衡的网络,从而显著提高性能结果。可以使用许多流程分区转换,但是没有提示设计人员应该应用哪种转换来最小化,例如,最小化执行时间。因此,我们研究了一种选择最佳转换候选的编译时方法,并在Xilinx Virtex 2 FPGA和Cell BE处理器上展示了结果。
{"title":"On compile-time evaluation of process partitioning transformations for Kahn process networks","authors":"Sjoerd Meijer, Hristo Nikolov, T. Stefanov","doi":"10.1145/1629435.1629441","DOIUrl":"https://doi.org/10.1145/1629435.1629441","url":null,"abstract":"Kahn Process Networks is an appealing model of computation for programming and mapping applications onto multi-processor platforms. Autonomous processes communicate through unbounded FIFO channels in absence of a global scheduler. We derive Kahn process networks from sequential applications using the pn compiler, but the derived networks do not necessarily meet the performance requirements. Process partitioning transformations can achieve a more balanced network improving the performance results significantly. There are a number of process partitioning transformations that can be used, but no hints are given to the designer which transformation should be applied to minimize, for example, the execution time. Therefore, we investigate a compile-time approach for selecting the best transformation candidate and show results on a Xilinx Virtex 2 FPGA and the Cell BE processor.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126800586","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Cycle count accurate memory modeling in system level design 系统级设计中的周期计数精确内存建模
Pub Date : 2009-10-11 DOI: 10.1145/1629435.1629475
Y. Lo, Mao Lin Li, R. Tsay
In this paper, we propose an effective automatic generation approach for a Cycle-Count Accurate Memory Model (CCAMM) from the Clocked Finite State Machine (CFSM) of the Cycle Accurate Memory Model (CAMM). Since memory accesses are gradually dominating system activities, a correct and efficient memory timing model is essential to system-level simulation. In general, a CCAMM provides sufficient timing accuracy with low simulation overhead, and hence is preferred over the Simple Fixed Delay Model (SFDM), which has low accuracy, or the CAMM, which has low performance. Our proposed approach can systematically generate the CCAMM and guarantee correctness. The experimental results show that the generated model is as accurate as the Register Transfer Level (RTL) model while running 100X faster.
本文提出了一种从周期精确记忆模型(CAMM)的时钟有限状态机(CFSM)自动生成周期计数精确记忆模型(CCAMM)的有效方法。由于内存访问逐渐主导了系统活动,因此正确有效的内存时序模型对于系统级仿真至关重要。一般来说,CCAMM提供了足够的定时精度和低仿真开销,因此优于具有低精度的简单固定延迟模型(SFDM)或具有低性能的CAMM。该方法可以系统地生成CCAMM并保证其正确性。实验结果表明,该模型的精度与RTL模型相当,运行速度提高了100倍。
{"title":"Cycle count accurate memory modeling in system level design","authors":"Y. Lo, Mao Lin Li, R. Tsay","doi":"10.1145/1629435.1629475","DOIUrl":"https://doi.org/10.1145/1629435.1629475","url":null,"abstract":"In this paper, we propose an effective automatic generation approach for a Cycle-Count Accurate Memory Model (CCAMM) from the Clocked Finite State Machine (CFSM) of the Cycle Accurate Memory Model (CAMM). Since memory accesses are gradually dominating system activities, a correct and efficient memory timing model is essential to system-level simulation. In general, a CCAMM provides sufficient timing accuracy with low simulation overhead, and hence is preferred over the Simple Fixed Delay Model (SFDM), which has low accuracy, or the CAMM, which has low performance. Our proposed approach can systematically generate the CCAMM and guarantee correctness. The experimental results show that the generated model is as accurate as the Register Transfer Level (RTL) model while running 100X faster.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134570168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Minimization of the reconfiguration latency for the mapping of applications on FPGA-based systems 最小化基于fpga的系统上应用程序映射的重新配置延迟
Pub Date : 2009-10-11 DOI: 10.1145/1629435.1629480
V. Rana, S. Murali, David Atienza Alonso, M. Santambrogio, L. Benini, D. Sciuto
Field-Programmable Gate Arrays (FPGAs) have become promising mapping fabric for the implementation of System-on-Chip (SoC) platforms, due to their large capacity and their enhanced support for dynamic and partial reconfigurability. Design automation support for partial reconfigurability includes several key challenges. In particular, reconfiguration algorithms need to be developed to effectively exploit the available area and run-time reconfiguration support for instantiating at run-time the hardware components needed to execute multiple applications concurrently. These new algorithms must be able to achieve maximum application execution performance at a minimum reconfiguration overhead. In this work, we propose a novel design flow that minimizes the amount of core reconfigurations needed to map multiple applications dynamically (i.e., using run-time reconfiguration) on FPGAs. This new mapping flow features a multi-stage design optimization algorithm that makes it possible to reduce the reconfiguration latency up to 43%, by taking into account the reconfiguration costs and SoC block reuse between the different applications that need to be executed dynamically on the FPGA. Moreover, we show that the proposed multi-stage optimization algorithm explores a large set of mapping trade-offs, by taking into account the traffic flows for each application, the run-time reconfiguration costs and the number of reconfigurable regions available on the FPGA.
现场可编程门阵列(fpga)由于其大容量和对动态和部分可重构性的增强支持,已成为实现片上系统(SoC)平台的有前途的映射结构。对部分可重构性的设计自动化支持包括几个关键挑战。特别是,需要开发重新配置算法,以便有效地利用可用区域和运行时重新配置支持,以便在运行时实例化并发执行多个应用程序所需的硬件组件。这些新算法必须能够以最小的重新配置开销实现最大的应用程序执行性能。在这项工作中,我们提出了一种新的设计流程,可以最大限度地减少fpga上动态映射多个应用程序(即使用运行时重新配置)所需的核心重新配置数量。这种新的映射流程具有多阶段设计优化算法,通过考虑需要在FPGA上动态执行的不同应用程序之间的重新配置成本和SoC块重用,可以将重新配置延迟减少多达43%。此外,我们表明所提出的多阶段优化算法通过考虑每个应用程序的流量、运行时重新配置成本和FPGA上可用的可重构区域数量,探索了一组大量的映射权衡。
{"title":"Minimization of the reconfiguration latency for the mapping of applications on FPGA-based systems","authors":"V. Rana, S. Murali, David Atienza Alonso, M. Santambrogio, L. Benini, D. Sciuto","doi":"10.1145/1629435.1629480","DOIUrl":"https://doi.org/10.1145/1629435.1629480","url":null,"abstract":"Field-Programmable Gate Arrays (FPGAs) have become promising mapping fabric for the implementation of System-on-Chip (SoC) platforms, due to their large capacity and their enhanced support for dynamic and partial reconfigurability. Design automation support for partial reconfigurability includes several key challenges. In particular, reconfiguration algorithms need to be developed to effectively exploit the available area and run-time reconfiguration support for instantiating at run-time the hardware components needed to execute multiple applications concurrently. These new algorithms must be able to achieve maximum application execution performance at a minimum reconfiguration overhead.\u0000 In this work, we propose a novel design flow that minimizes the amount of core reconfigurations needed to map multiple applications dynamically (i.e., using run-time reconfiguration) on FPGAs. This new mapping flow features a multi-stage design optimization algorithm that makes it possible to reduce the reconfiguration latency up to 43%, by taking into account the reconfiguration costs and SoC block reuse between the different applications that need to be executed dynamically on the FPGA. Moreover, we show that the proposed multi-stage optimization algorithm explores a large set of mapping trade-offs, by taking into account the traffic flows for each application, the run-time reconfiguration costs and the number of reconfigurable regions available on the FPGA.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133245500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
期刊
International Conference on Hardware/Software Codesign and System Synthesis
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1