Time redundancy (rollback-recovery) and hardware redundancy are commonly used in real-time systems to achieve fault tolerance. From an energy consumption point of view, time redundancy is generally more preferable than hardware redundancy. However, hard real-time systems often use hardware redundancy to meet high reliability requirements of safety-critical applications. In this paper we propose a hardware-redundancy technique with low energy-overhead for hard real-time systems. The proposed technique is based on standby-sparing, where the system is composed of a primary unit and a spare. Through analytical models, we have developed an online energy-management method which uses a slack reclamation scheme to reduce the energy consumption of both the primary and spare units. In this method, dynamic voltage scaling (DVS) is used for the primary unit and dynamic power management (DPM) is used for the spare. We conducted several experiments to compare the proposed system with a fault-tolerant real-time system which uses time redundancy for fault tolerance and DVS with slack reclamation for low energy consumption. The results show that for relaxed time constraints, the proposed system provides up to 24% energy saving as compared to the time-redundancy system. For tight deadlines when the time-redundancy system can tolerate no faults, the proposed system preserves its fault-tolerance but with about 32% more energy consumption.
{"title":"A standby-sparing technique with low energy-overhead for fault-tolerant hard real-time systems","authors":"A. Ejlali, B. Al-Hashimi, P. Eles","doi":"10.1145/1629435.1629463","DOIUrl":"https://doi.org/10.1145/1629435.1629463","url":null,"abstract":"Time redundancy (rollback-recovery) and hardware redundancy are commonly used in real-time systems to achieve fault tolerance. From an energy consumption point of view, time redundancy is generally more preferable than hardware redundancy. However, hard real-time systems often use hardware redundancy to meet high reliability requirements of safety-critical applications. In this paper we propose a hardware-redundancy technique with low energy-overhead for hard real-time systems. The proposed technique is based on standby-sparing, where the system is composed of a primary unit and a spare. Through analytical models, we have developed an online energy-management method which uses a slack reclamation scheme to reduce the energy consumption of both the primary and spare units. In this method, dynamic voltage scaling (DVS) is used for the primary unit and dynamic power management (DPM) is used for the spare. We conducted several experiments to compare the proposed system with a fault-tolerant real-time system which uses time redundancy for fault tolerance and DVS with slack reclamation for low energy consumption. The results show that for relaxed time constraints, the proposed system provides up to 24% energy saving as compared to the time-redundancy system. For tight deadlines when the time-redundancy system can tolerate no faults, the proposed system preserves its fault-tolerance but with about 32% more energy consumption.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127072885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Packet classification has become an important problem to solve in modern network processors used in networking embedded systems such as routers. Algorithms for matching incoming packets from the network to pre-defined rules, have been proposed by a number of researchers. Current software-based packet classification techniques have low performance, prompting many researchers to move their focus to new architectures encompassing both software and hardware components. Some of the newer hardware architectures exclusively utilize Ternary Content Addressable Memory (TCAM) to improve the performance of rule matching. However, this results in systems with high power consumption. TCAM consumes a high amount of power due to the fact that it reads the entire memory array during every access, much of which is unnecessary. In this paper, we propose LOP, a novel SRAM-based architecture where incoming packets are compared against parts of all rules simultaneously until a single matching rule is found for the compared bits in the packets. This method LOP significantly reduces power consumption as only a segment of the memory is compared against the incoming packet. Despite the additional time penalty to match a single packet, parallel comparison of multiple packets can improve throughput beyond that of the TCAMapproaches, while consuming significantly low power. Nine different benchmarks were tested in two classification systems, with results showing that LOP architectures provide high lookup rates and high throughput, and low power consumption. Compared with a state-of-the-art TCAM implementation (throughput of 495 Million Search per Second (Msps)) in 65nm CMOS technology, on average, LOP saves 43% of energy consumption with a throughput of 590Msps.
{"title":"LOP: a novel SRAM-based architecture for low power and high throughput packet classification","authors":"Xin He, Jorgen Peddersen, S. Parameswaran","doi":"10.1145/1629435.1629455","DOIUrl":"https://doi.org/10.1145/1629435.1629455","url":null,"abstract":"Packet classification has become an important problem to solve in modern network processors used in networking embedded systems such as routers. Algorithms for matching incoming packets from the network to pre-defined rules, have been proposed by a number of researchers. Current software-based packet classification techniques have low performance, prompting many researchers to move their focus to new architectures encompassing both software and hardware components. Some of the newer hardware architectures exclusively utilize Ternary Content Addressable Memory (TCAM) to improve the performance of rule matching. However, this results in systems with high power consumption. TCAM consumes a high amount of power due to the fact that it reads the entire memory array during every access, much of which is unnecessary. In this paper, we propose LOP, a novel SRAM-based architecture where incoming packets are compared against parts of all rules simultaneously until a single matching rule is found for the compared bits in the packets. This method LOP significantly reduces power consumption as only a segment of the memory is compared against the incoming packet. Despite the additional time penalty to match a single packet, parallel comparison of multiple packets can improve throughput beyond that of the TCAMapproaches, while consuming significantly low power. Nine different benchmarks were tested in two classification systems, with results showing that LOP architectures provide high lookup rates and high throughput, and low power consumption. Compared with a state-of-the-art TCAM implementation (throughput of 495 Million Search per Second (Msps)) in 65nm CMOS technology, on average, LOP saves 43% of energy consumption with a throughput of 590Msps.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114621289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dynamic routing is desirable because of its substantial improvement in communication bandwidth and intelligent adaptation to faulty links and congested traffics. However, implementation of adaptive routing in a network-on-chip (NoC) system is not trivial and further complicated by the requirements of deadlock-free and real-time optimal decision making. In this paper, we present a deadlock-free routing architecture which employs a dynamic programming (DP) network to provide on-the-fly optimal path planning and network monitoring for packet switching. Also, a new routing strategy called k-step look ahead is introduced. This new strategy can substantially reduced the size of routing table and maintain a high quality of adaptation which leads to a scalable dynamic routing solution with minimal hardware overhead. Our results based on a cycle-accurate simulator demonstrate the effectiveness of the DP-network, which outperforms both the deterministic and adaptive routing algorithms in average delay on various traffic scenarios by 22.3%. Moreover, the hardware overhead for DP-network is insignificant based on the results obtained from the hardware implementations.
{"title":"A DP-network for optimal dynamic routing in network-on-chip","authors":"T. Mak, P. Cheung, W. Luk, K. Lam","doi":"10.1145/1629435.1629452","DOIUrl":"https://doi.org/10.1145/1629435.1629452","url":null,"abstract":"Dynamic routing is desirable because of its substantial improvement in communication bandwidth and intelligent adaptation to faulty links and congested traffics. However, implementation of adaptive routing in a network-on-chip (NoC) system is not trivial and further complicated by the requirements of deadlock-free and real-time optimal decision making. In this paper, we present a deadlock-free routing architecture which employs a dynamic programming (DP) network to provide on-the-fly optimal path planning and network monitoring for packet switching. Also, a new routing strategy called k-step look ahead is introduced. This new strategy can substantially reduced the size of routing table and maintain a high quality of adaptation which leads to a scalable dynamic routing solution with minimal hardware overhead. Our results based on a cycle-accurate simulator demonstrate the effectiveness of the DP-network, which outperforms both the deterministic and adaptive routing algorithms in average delay on various traffic scenarios by 22.3%. Moreover, the hardware overhead for DP-network is insignificant based on the results obtained from the hardware implementations.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123405453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In a multiprocessor system-on-chip (MPSoC) private caches introduce the cache coherence problem. Here, we target at heterogeneous MPSoCs with a network-on-chip (NoC). Existing hardware cache coherence protocols are less suitable for MPSoCs because many off-the-shelf processors used in MPSoCs do not support these protocols. Furthermore, these protocols typically rely on global visibility and serialization of writes which does not match well with the parallel point-to-point communication provided by a NoC. Therefore, we propose a software cache coherence protocol, which can be applied in a heterogeneous MPSoC with a NoC. The software cache coherence protocol relies on explicit synchronization in the software. More specifically, caches are guaranteed to be coherent according to the Release Consistency model, on top of which we have implemented the standard Pthreads communication library. Heterogeneous MPSoCs with off-the-shelf processors can easily be supported, because processors are only required to provide cache control operations, e.g., clean and invalidate. All cache coherence operations are interruptible and do not impact the execution of tasks on other processors, therefore this protocol is suitable for predictable MPSoCs. Our software cache coherence protocol is implemented on an ARM926EJ-S MPSoC which is mapped on an FPGA. From experiments we conclude that the protocol overhead is low for the applications taken from the SPLASH-2 benchmark set. For these applications we observed a speedup between 1.89 and 2.01 on the two processor MPSoC.
{"title":"A tuneable software cache coherence protocol for heterogeneous MPSoCs","authors":"Frank E. B. Ophelders, M. Bekooij, H. Corporaal","doi":"10.1145/1629435.1629488","DOIUrl":"https://doi.org/10.1145/1629435.1629488","url":null,"abstract":"In a multiprocessor system-on-chip (MPSoC) private caches introduce the cache coherence problem. Here, we target at heterogeneous MPSoCs with a network-on-chip (NoC). Existing hardware cache coherence protocols are less suitable for MPSoCs because many off-the-shelf processors used in MPSoCs do not support these protocols. Furthermore, these protocols typically rely on global visibility and serialization of writes which does not match well with the parallel point-to-point communication provided by a NoC. Therefore, we propose a software cache coherence protocol, which can be applied in a heterogeneous MPSoC with a NoC. The software cache coherence protocol relies on explicit synchronization in the software. More specifically, caches are guaranteed to be coherent according to the Release Consistency model, on top of which we have implemented the standard Pthreads communication library. Heterogeneous MPSoCs with off-the-shelf processors can easily be supported, because processors are only required to provide cache control operations, e.g., clean and invalidate. All cache coherence operations are interruptible and do not impact the execution of tasks on other processors, therefore this protocol is suitable for predictable MPSoCs. Our software cache coherence protocol is implemented on an ARM926EJ-S MPSoC which is mapped on an FPGA. From experiments we conclude that the protocol overhead is low for the applications taken from the SPLASH-2 benchmark set. For these applications we observed a speedup between 1.89 and 2.01 on the two processor MPSoC.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132543101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A growing number of applications, with diverse requirements, are integrated on the same System on Chip (SoC) in the form of hardware and software Intellectual Property (IP). The diverse requirements, coupled with the IPs being developed by unrelated design teams, lead to multiple communication paradigms, programming models, and interface protocols that the on-chip interconnect must accommodate. Traditionally, on-chip buses offer distributed shared memory communication with established memory-consistency models, but are tightly coupled to a specific interface protocol. On-chip networks, on the other hand, offer layering and interface abstraction, but are centred around point-to-point streaming communication, and do not address issues at the higher layers in the protocol stack, such as memory-consistency models and message-dependent deadlock. In this work we introduce an on-chip interconnect and protocol stack that combines streaming and distributed shared memory communication. The proposed interconnect offers an established memory-consistency model and does not restrict any higher-level protocol dependencies. We present the protocol stack and the architectural blocks and quantify the cost, both on the block level and for a complete SoC. For a multi-processor multi-application SoC with multiple communication paradigms and programming models, our proposed interconnect occupies only 4% of the chip area.
{"title":"An on-chip interconnect and protocol stack for multiple communication paradigms and programming models","authors":"Andreas Hansson, K. Goossens","doi":"10.1145/1629435.1629450","DOIUrl":"https://doi.org/10.1145/1629435.1629450","url":null,"abstract":"A growing number of applications, with diverse requirements, are integrated on the same System on Chip (SoC) in the form of hardware and software Intellectual Property (IP). The diverse requirements, coupled with the IPs being developed by unrelated design teams, lead to multiple communication paradigms, programming models, and interface protocols that the on-chip interconnect must accommodate.\u0000 Traditionally, on-chip buses offer distributed shared memory communication with established memory-consistency models, but are tightly coupled to a specific interface protocol. On-chip networks, on the other hand, offer layering and interface abstraction, but are centred around point-to-point streaming communication, and do not address issues at the higher layers in the protocol stack, such as memory-consistency models and message-dependent deadlock.\u0000 In this work we introduce an on-chip interconnect and protocol stack that combines streaming and distributed shared memory communication. The proposed interconnect offers an established memory-consistency model and does not restrict any higher-level protocol dependencies. We present the protocol stack and the architectural blocks and quantify the cost, both on the block level and for a complete SoC. For a multi-processor multi-application SoC with multiple communication paradigms and programming models, our proposed interconnect occupies only 4% of the chip area.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"19 8","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114028686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michael A. Baker, Pravin Dalale, Karam S. Chatha, S. Vrudhula
The H.264 video codec provides exceptional video compression while imposing dramatic increases in computational complexity over previous standards. While exploiting parallelism in H.264 is notoriously difficult, successful parallel implementations promise substantial performance gains, particularly as High Definition (HD) content penetrates a widening variety of applications. We present a highly scalable parallelization scheme implemented on IBM's multicore Cell Broadband Engine (CBE) and based on FFmpeg's open source H.264 video decoder. We address resource limitations and complex data dependencies to achieve nearly ideal decoding speedup for the parallelizable portion of the encoded stream. Our decoder achieves better performance than previous implementations, and is deeply scalable for large format video. We discuss architecture and codec specific performance optimizations, code overlays, data structures, memory access scheduling, and vectorization.
{"title":"A scalable parallel H.264 decoder on the cell broadband engine architecture","authors":"Michael A. Baker, Pravin Dalale, Karam S. Chatha, S. Vrudhula","doi":"10.1145/1629435.1629484","DOIUrl":"https://doi.org/10.1145/1629435.1629484","url":null,"abstract":"The H.264 video codec provides exceptional video compression while imposing dramatic increases in computational complexity over previous standards. While exploiting parallelism in H.264 is notoriously difficult, successful parallel implementations promise substantial performance gains, particularly as High Definition (HD) content penetrates a widening variety of applications. We present a highly scalable parallelization scheme implemented on IBM's multicore Cell Broadband Engine (CBE) and based on FFmpeg's open source H.264 video decoder. We address resource limitations and complex data dependencies to achieve nearly ideal decoding speedup for the parallelizable portion of the encoded stream. Our decoder achieves better performance than previous implementations, and is deeply scalable for large format video. We discuss architecture and codec specific performance optimizations, code overlays, data structures, memory access scheduling, and vectorization.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123629725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
There has been a recent shift in design paradigms, with many turning towards yield-driven approaches to synthesize and design systems. A major cause of this shift is the continual scaling of transistors, making process variation impossible to ignore. Better than worst-case (BTW) designs also exploit these variation effects, while also addressing performance limits due to worst-case analysis. In this paper we first present the variation-tolerant stallable-FSM architecture, which provides fault detection and recovery, allowing circuits to be clocked at better than worst-case delays. Then we propose the BTW scheduler, a 0-1 integer linear programming (ILP) scheduling algorithm with the objective of minimizing the expected latency, to provide a high-level synthesis aid for the stallable-FSM architecture. We implemented the algorithm and ran it through many benchmarks, comparing the results with scheduling algorithms based on worst-case analysis. Our results were promising, showing up to 41% latency reduction for the BTW scheduler, and up to 43% latency reduction when combined with the variation-tolerant architecture.
{"title":"A variation-tolerant scheduler for better than worst-case behavioral synthesis","authors":"J. Cong, Albert Liu, B. Liu","doi":"10.1145/1629435.1629467","DOIUrl":"https://doi.org/10.1145/1629435.1629467","url":null,"abstract":"There has been a recent shift in design paradigms, with many turning towards yield-driven approaches to synthesize and design systems. A major cause of this shift is the continual scaling of transistors, making process variation impossible to ignore. Better than worst-case (BTW) designs also exploit these variation effects, while also addressing performance limits due to worst-case analysis. In this paper we first present the variation-tolerant stallable-FSM architecture, which provides fault detection and recovery, allowing circuits to be clocked at better than worst-case delays. Then we propose the BTW scheduler, a 0-1 integer linear programming (ILP) scheduling algorithm with the objective of minimizing the expected latency, to provide a high-level synthesis aid for the stallable-FSM architecture. We implemented the algorithm and ran it through many benchmarks, comparing the results with scheduling algorithms based on worst-case analysis. Our results were promising, showing up to 41% latency reduction for the BTW scheduler, and up to 43% latency reduction when combined with the variation-tolerant architecture.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125252287","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kahn Process Networks is an appealing model of computation for programming and mapping applications onto multi-processor platforms. Autonomous processes communicate through unbounded FIFO channels in absence of a global scheduler. We derive Kahn process networks from sequential applications using the pn compiler, but the derived networks do not necessarily meet the performance requirements. Process partitioning transformations can achieve a more balanced network improving the performance results significantly. There are a number of process partitioning transformations that can be used, but no hints are given to the designer which transformation should be applied to minimize, for example, the execution time. Therefore, we investigate a compile-time approach for selecting the best transformation candidate and show results on a Xilinx Virtex 2 FPGA and the Cell BE processor.
{"title":"On compile-time evaluation of process partitioning transformations for Kahn process networks","authors":"Sjoerd Meijer, Hristo Nikolov, T. Stefanov","doi":"10.1145/1629435.1629441","DOIUrl":"https://doi.org/10.1145/1629435.1629441","url":null,"abstract":"Kahn Process Networks is an appealing model of computation for programming and mapping applications onto multi-processor platforms. Autonomous processes communicate through unbounded FIFO channels in absence of a global scheduler. We derive Kahn process networks from sequential applications using the pn compiler, but the derived networks do not necessarily meet the performance requirements. Process partitioning transformations can achieve a more balanced network improving the performance results significantly. There are a number of process partitioning transformations that can be used, but no hints are given to the designer which transformation should be applied to minimize, for example, the execution time. Therefore, we investigate a compile-time approach for selecting the best transformation candidate and show results on a Xilinx Virtex 2 FPGA and the Cell BE processor.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126800586","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we propose an effective automatic generation approach for a Cycle-Count Accurate Memory Model (CCAMM) from the Clocked Finite State Machine (CFSM) of the Cycle Accurate Memory Model (CAMM). Since memory accesses are gradually dominating system activities, a correct and efficient memory timing model is essential to system-level simulation. In general, a CCAMM provides sufficient timing accuracy with low simulation overhead, and hence is preferred over the Simple Fixed Delay Model (SFDM), which has low accuracy, or the CAMM, which has low performance. Our proposed approach can systematically generate the CCAMM and guarantee correctness. The experimental results show that the generated model is as accurate as the Register Transfer Level (RTL) model while running 100X faster.
{"title":"Cycle count accurate memory modeling in system level design","authors":"Y. Lo, Mao Lin Li, R. Tsay","doi":"10.1145/1629435.1629475","DOIUrl":"https://doi.org/10.1145/1629435.1629475","url":null,"abstract":"In this paper, we propose an effective automatic generation approach for a Cycle-Count Accurate Memory Model (CCAMM) from the Clocked Finite State Machine (CFSM) of the Cycle Accurate Memory Model (CAMM). Since memory accesses are gradually dominating system activities, a correct and efficient memory timing model is essential to system-level simulation. In general, a CCAMM provides sufficient timing accuracy with low simulation overhead, and hence is preferred over the Simple Fixed Delay Model (SFDM), which has low accuracy, or the CAMM, which has low performance. Our proposed approach can systematically generate the CCAMM and guarantee correctness. The experimental results show that the generated model is as accurate as the Register Transfer Level (RTL) model while running 100X faster.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134570168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
V. Rana, S. Murali, David Atienza Alonso, M. Santambrogio, L. Benini, D. Sciuto
Field-Programmable Gate Arrays (FPGAs) have become promising mapping fabric for the implementation of System-on-Chip (SoC) platforms, due to their large capacity and their enhanced support for dynamic and partial reconfigurability. Design automation support for partial reconfigurability includes several key challenges. In particular, reconfiguration algorithms need to be developed to effectively exploit the available area and run-time reconfiguration support for instantiating at run-time the hardware components needed to execute multiple applications concurrently. These new algorithms must be able to achieve maximum application execution performance at a minimum reconfiguration overhead. In this work, we propose a novel design flow that minimizes the amount of core reconfigurations needed to map multiple applications dynamically (i.e., using run-time reconfiguration) on FPGAs. This new mapping flow features a multi-stage design optimization algorithm that makes it possible to reduce the reconfiguration latency up to 43%, by taking into account the reconfiguration costs and SoC block reuse between the different applications that need to be executed dynamically on the FPGA. Moreover, we show that the proposed multi-stage optimization algorithm explores a large set of mapping trade-offs, by taking into account the traffic flows for each application, the run-time reconfiguration costs and the number of reconfigurable regions available on the FPGA.
{"title":"Minimization of the reconfiguration latency for the mapping of applications on FPGA-based systems","authors":"V. Rana, S. Murali, David Atienza Alonso, M. Santambrogio, L. Benini, D. Sciuto","doi":"10.1145/1629435.1629480","DOIUrl":"https://doi.org/10.1145/1629435.1629480","url":null,"abstract":"Field-Programmable Gate Arrays (FPGAs) have become promising mapping fabric for the implementation of System-on-Chip (SoC) platforms, due to their large capacity and their enhanced support for dynamic and partial reconfigurability. Design automation support for partial reconfigurability includes several key challenges. In particular, reconfiguration algorithms need to be developed to effectively exploit the available area and run-time reconfiguration support for instantiating at run-time the hardware components needed to execute multiple applications concurrently. These new algorithms must be able to achieve maximum application execution performance at a minimum reconfiguration overhead.\u0000 In this work, we propose a novel design flow that minimizes the amount of core reconfigurations needed to map multiple applications dynamically (i.e., using run-time reconfiguration) on FPGAs. This new mapping flow features a multi-stage design optimization algorithm that makes it possible to reduce the reconfiguration latency up to 43%, by taking into account the reconfiguration costs and SoC block reuse between the different applications that need to be executed dynamically on the FPGA. Moreover, we show that the proposed multi-stage optimization algorithm explores a large set of mapping trade-offs, by taking into account the traffic flows for each application, the run-time reconfiguration costs and the number of reconfigurable regions available on the FPGA.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133245500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}