Field-programmable gate arrays (FPGAs) often achieve order of magnitude speedups compared to microprocessors, but typically have been unable to improve the performance of applications with irregular memory access patterns, such as traversals of pointer-based data structures. Due to the common use of these data structures, the applicability and widespread success of FPGAs has been limited. In this paper, we introduce the traversal cache framework - a first step towards improving the performance of FPGA applications that utilize pointer-based data structures. The traversal cache is a local FPGA memory that stores repeated traversals of pointer-based data structures, allowing for these traversals to be efficiently streamed into the FPGA. Although the cache is generally limited to improving applications that exhibit repeated traversals, we show that many applications in fact have this characteristic. Furthermore, we show that few repetitions are needed to achieve performance improvements. We present experimental results showing that FPGA implementations using the traversal cache framework achieve speedups ranging from 7x to 29x compared to pointer-based software on a 3.2 GHz Xeon.
{"title":"Traversal caches: a first step towards FPGA acceleration of pointer-based data structures","authors":"G. Stitt, Gaurav Chaudhari, J. Coole","doi":"10.1145/1450135.1450150","DOIUrl":"https://doi.org/10.1145/1450135.1450150","url":null,"abstract":"Field-programmable gate arrays (FPGAs) often achieve order of magnitude speedups compared to microprocessors, but typically have been unable to improve the performance of applications with irregular memory access patterns, such as traversals of pointer-based data structures. Due to the common use of these data structures, the applicability and widespread success of FPGAs has been limited. In this paper, we introduce the traversal cache framework - a first step towards improving the performance of FPGA applications that utilize pointer-based data structures. The traversal cache is a local FPGA memory that stores repeated traversals of pointer-based data structures, allowing for these traversals to be efficiently streamed into the FPGA. Although the cache is generally limited to improving applications that exhibit repeated traversals, we show that many applications in fact have this characteristic. Furthermore, we show that few repetitions are needed to achieve performance improvements. We present experimental results showing that FPGA implementations using the traversal cache framework achieve speedups ranging from 7x to 29x compared to pointer-based software on a 3.2 GHz Xeon.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134420644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
H. Shojaei, T. Basten, M. Geilen, Phillip Stanley-Marbell
The compositional computation of Pareto points in multi-dimensional optimization problems is an important means to efficiently explore the optimization space. This paper presents a symbolic Pareto calculator, SPaC, for the algebraic computation of multidimensional trade-offs. SPaC uses BDDs as a representation for solution sets and operations on them. The tool can be used in multi-criteria optimization and design-space exploration of embedded systems. The paper describes the design and implementation of Pareto algebra operations, and it shows that BDDs can be used effectively in Pareto optimization.
{"title":"SPaC: a symbolic pareto calculator","authors":"H. Shojaei, T. Basten, M. Geilen, Phillip Stanley-Marbell","doi":"10.1145/1450135.1450176","DOIUrl":"https://doi.org/10.1145/1450135.1450176","url":null,"abstract":"The compositional computation of Pareto points in multi-dimensional optimization problems is an important means to efficiently explore the optimization space. This paper presents a symbolic Pareto calculator, SPaC, for the algebraic computation of multidimensional trade-offs. SPaC uses BDDs as a representation for solution sets and operations on them. The tool can be used in multi-criteria optimization and design-space exploration of embedded systems. The paper describes the design and implementation of Pareto algebra operations, and it shows that BDDs can be used effectively in Pareto optimization.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133333591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Using traditional software profiling to optimize embedded software in an MPSoC design is not reliable. With multiple processors running concurrently and programs interacting, traditional profiling on individual processors cannot capture useful execution information to assist software optimization. A new method to model parallel executions of interacting programs is needed. In this paper, we consider the software optimization problem for throughput-constrained MPSoC designs. We define the "longest delay path" as a sequence of steps leading to a throughput constraint violation and propose an algorithm to build up the path dynamically during simulation. Using an industrial-strength MPEG-2 decoder design in our case study and custom instructions for software optimization, we show that we can optimize the software efficiently in MPSoC designs using frequently executed statement information from the longest delay path.
{"title":"Software optimization for MPSoC: a mpeg-2 decoder case study","authors":"Eric Cheung, H. Hsieh, F. Balarin","doi":"10.1145/1450135.1450146","DOIUrl":"https://doi.org/10.1145/1450135.1450146","url":null,"abstract":"Using traditional software profiling to optimize embedded software in an MPSoC design is not reliable. With multiple processors running concurrently and programs interacting, traditional profiling on individual processors cannot capture useful execution information to assist software optimization. A new method to model parallel executions of interacting programs is needed. In this paper, we consider the software optimization problem for throughput-constrained MPSoC designs. We define the \"longest delay path\" as a sequence of steps leading to a throughput constraint violation and propose an algorithm to build up the path dynamically during simulation. Using an industrial-strength MPEG-2 decoder design in our case study and custom instructions for software optimization, we show that we can optimize the software efficiently in MPSoC designs using frequently executed statement information from the longest delay path.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124565984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Architectures with software-writable parameters, or configurable architectures, enable runtime reconfiguration of computing platforms to the applications they execute. Such dynamic tuning can improve application performance, as well as energy. However, reconfiguring incurs a temporary performance cost. Thus, online algorithms are needed that decide when to reconfigure and which configuration to choose such that overall performance is optimized. We introduce the adaptive weighted window (AWW) algorithm, and compare with several other algorithms, including algorithms previously developed by the online algorithm community. We describe experiments showing that AWW results are within 4% of the offline optimal on average. AWW outperforms the other algorithms, and is robust across three datasets and across three categories of application sequences too. AWW improves a non-dynamic approach on average by 6%, and by up to 30% in low-reconfiguration-time situations.
{"title":"Dynamic tuning of configurable architectures: the AWW online algorithm","authors":"Chen-Chun Huang, David Sheldon, F. Vahid","doi":"10.1145/1450135.1450158","DOIUrl":"https://doi.org/10.1145/1450135.1450158","url":null,"abstract":"Architectures with software-writable parameters, or configurable architectures, enable runtime reconfiguration of computing platforms to the applications they execute. Such dynamic tuning can improve application performance, as well as energy. However, reconfiguring incurs a temporary performance cost. Thus, online algorithms are needed that decide when to reconfigure and which configuration to choose such that overall performance is optimized. We introduce the adaptive weighted window (AWW) algorithm, and compare with several other algorithms, including algorithms previously developed by the online algorithm community. We describe experiments showing that AWW results are within 4% of the offline optimal on average. AWW outperforms the other algorithms, and is robust across three datasets and across three categories of application sequences too. AWW improves a non-dynamic approach on average by 6%, and by up to 30% in low-reconfiguration-time situations.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127928321","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Functional validation is a major bottleneck in microprocessor design methodology. Simulation is the widely used method for functional validation using billions of random and biased-random test programs. Although directed tests require a smaller test set compared to random tests to achieve the same functional coverage goal, there is a lack of automated techniques for directed test generation. Furthermore, the number of directed tests can still be prohibitively large. This paper presents a methodology for specification-based coverage analysis and test generation. The primary contribution of this paper is a compaction technique that can drastically reduce the required number of directed test programs to achieve a coverage goal. Our experimental results using a MIPS processor and an industrial processor (e500) demonstrate more than 90% reduction in number of directed tests without sacrificing the functional coverage goal.
{"title":"Specification-based compaction of directed tests for functional validation of pipelined processors","authors":"Heon-Mo Koo, P. Mishra","doi":"10.1145/1450135.1450167","DOIUrl":"https://doi.org/10.1145/1450135.1450167","url":null,"abstract":"Functional validation is a major bottleneck in microprocessor design methodology. Simulation is the widely used method for functional validation using billions of random and biased-random test programs. Although directed tests require a smaller test set compared to random tests to achieve the same functional coverage goal, there is a lack of automated techniques for directed test generation. Furthermore, the number of directed tests can still be prohibitively large. This paper presents a methodology for specification-based coverage analysis and test generation. The primary contribution of this paper is a compaction technique that can drastically reduce the required number of directed test programs to achieve a coverage goal. Our experimental results using a MIPS processor and an industrial processor (e500) demonstrate more than 90% reduction in number of directed tests without sacrificing the functional coverage goal.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123530802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Abramovici, K. Goossens, B. Vermeulen, J. Greenbaum, N. Stollon, A. Donlin
In this special session we explore holistic approaches to hardware/software debug that use or integrate transaction level models (TLMs). We present several TLM-based approaches to system-level diagnostics, ranging from use of most popular transaction level modeling languages through to hybrid technologies that combine TLMs with other well known diagnostic tools like in-silicon trace logic.
{"title":"You can catch more bugs with transaction level honey","authors":"M. Abramovici, K. Goossens, B. Vermeulen, J. Greenbaum, N. Stollon, A. Donlin","doi":"10.1145/1450135.1450163","DOIUrl":"https://doi.org/10.1145/1450135.1450163","url":null,"abstract":"In this special session we explore holistic approaches to hardware/software debug that use or integrate transaction level models (TLMs). We present several TLM-based approaches to system-level diagnostics, ranging from use of most popular transaction level modeling languages through to hybrid technologies that combine TLMs with other well known diagnostic tools like in-silicon trace logic.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128869341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We study the problem of scheduling repetitive real-time tasks with the Earliest Deadline First (EDF) policy that can guarantee the given maximal temperature constraint. We show that the traditional scheduling approach, i.e., to repeat the schedule that is feasible through the range of one hyper-period, does not apply any more. Then, we present necessary and sufficient conditions for real-time schedules to guarantee the maximal temperature constraint. Based on these conditions, a novel scheduling algorithm is proposed for developing the appropriate schedule that can ensure the maximal temperature guarantee. Finally, we use experiments to evaluate the performance of our approach.
{"title":"Guaranteed scheduling for repetitive hard real-time tasks under the maximal temperature constraint","authors":"Gang Quan, Yan Zhang, William Wiles, Pei Pei","doi":"10.1145/1450135.1450196","DOIUrl":"https://doi.org/10.1145/1450135.1450196","url":null,"abstract":"We study the problem of scheduling repetitive real-time tasks with the Earliest Deadline First (EDF) policy that can guarantee the given maximal temperature constraint. We show that the traditional scheduling approach, i.e., to repeat the schedule that is feasible through the range of one hyper-period, does not apply any more. Then, we present necessary and sufficient conditions for real-time schedules to guarantee the maximal temperature constraint. Based on these conditions, a novel scheduling algorithm is proposed for developing the appropriate schedule that can ensure the maximal temperature guarantee. Finally, we use experiments to evaluate the performance of our approach.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133527383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper proposes a new link for asynchronous NoC communications that is resilient to transient faults on the wires of the link without impact on the data transfer capability. Resilience to transients is achieved by exploiting the phase relationship between data symbols and a common reference symbol where the symbols are transmitted using additional wires. Detection of transient faults is performed by comparison of the data symbol and the reference symbol. We demonstrate it is possible to achieve a similar number of transitions per bit as existing delay insensitive codes, from a power consumption point of view, but achieving resilience to transient faults. The link has been synthesized and validated using 0.12 ¼m technology and power, area and performance are given. It has been shown that the link area cost is 409 ¼m2 per data bit and energy per bit is 356 fJ/bit. Latency through the link is 0.8 ns and the maximum operating frequency or throughput of the link is 1.056 GHz.
{"title":"Asynchronous transient resilient links for NoC","authors":"S. Ogg, B. Al-Hashimi, A. Yakovlev","doi":"10.1145/1450135.1450182","DOIUrl":"https://doi.org/10.1145/1450135.1450182","url":null,"abstract":"This paper proposes a new link for asynchronous NoC communications that is resilient to transient faults on the wires of the link without impact on the data transfer capability. Resilience to transients is achieved by exploiting the phase relationship between data symbols and a common reference symbol where the symbols are transmitted using additional wires. Detection of transient faults is performed by comparison of the data symbol and the reference symbol. We demonstrate it is possible to achieve a similar number of transitions per bit as existing delay insensitive codes, from a power consumption point of view, but achieving resilience to transient faults. The link has been synthesized and validated using 0.12 ¼m technology and power, area and performance are given. It has been shown that the link area cost is 409 ¼m2 per data bit and energy per bit is 356 fJ/bit. Latency through the link is 0.8 ns and the maximum operating frequency or throughput of the link is 1.056 GHz.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"156 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114658004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Open Core Protocol (OCP) is a standard on-chip core interface specification. The current release is flexible and configurable to support the communication needs of a wide range of Intellectual Property cores, and is now in widespread use. However, it does not support system-level coherence. This paper summarizes an effort within the OCP-IP cache coherence working group on incorporating cache coherence extensions into OCP, which is expected to have strong impact on the MPSoC industry. In this paper, we propose a backward-compatible coherent Open Core Protocol interface and discuss the design challenges and implications introduced. This interface is flexible and can support a range of coherence protocols and schemes: we show how it can specify a snoopy bus-based scheme as well as a directory-based scheme. The correctness of the specification and models was verified using NuSMV, via exploring the entire state space for the two basic coherence schemes.
{"title":"Extending open core protocol to support system-level cache coherence","authors":"K. Aisopos, Chien-Chun Chou, L. Peh","doi":"10.1145/1450135.1450173","DOIUrl":"https://doi.org/10.1145/1450135.1450173","url":null,"abstract":"Open Core Protocol (OCP) is a standard on-chip core interface specification. The current release is flexible and configurable to support the communication needs of a wide range of Intellectual Property cores, and is now in widespread use. However, it does not support system-level coherence. This paper summarizes an effort within the OCP-IP cache coherence working group on incorporating cache coherence extensions into OCP, which is expected to have strong impact on the MPSoC industry. In this paper, we propose a backward-compatible coherent Open Core Protocol interface and discuss the design challenges and implications introduced. This interface is flexible and can support a range of coherence protocols and schemes: we show how it can specify a snoopy bus-based scheme as well as a directory-based scheme. The correctness of the specification and models was verified using NuSMV, via exploring the entire state space for the two basic coherence schemes.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"27 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129372115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Software-controlled scratchpad memory is increasingly employed in embedded systems as it offers better timing predictability compared to caches. Previous scratchpad allocation algorithms typically consider single process applications. But embedded applications are mostly multi-tasking with real-time constraints, where the scratchpad memory space has to be shared among interacting processes that may preempt each other. In this paper, we develop a novel dynamic scratchpad allocation technique that takes these process interferences into account to improve the performance and predictability of the memory system. We model the application as a Message Sequence Chart (MSC) to best capture the interprocess interactions. Our goal is to optimize the worst-case response time (WCRT) of the application through runtime reloading of the scratchpad memory content at appropriate execution points. We propose an iterative allocation algorithm that consists of two critical steps: (1) analyze the MSC along with the existing allocation to determine potential interference patterns, and (2) exploit this interference information to tune the scratchpad reloading points and content so as to best improve the WCRT. We evaluate our memory allocation scheme on a real-world embedded application controlling an Unmanned Aerial Vehicle (UAV).
{"title":"Scratchpad allocation for concurrent embedded software","authors":"Vivy Suhendra, Abhik Roychoudhury, T. Mitra","doi":"10.1145/1450135.1450145","DOIUrl":"https://doi.org/10.1145/1450135.1450145","url":null,"abstract":"Software-controlled scratchpad memory is increasingly employed in embedded systems as it offers better timing predictability compared to caches. Previous scratchpad allocation algorithms typically consider single process applications. But embedded applications are mostly multi-tasking with real-time constraints, where the scratchpad memory space has to be shared among interacting processes that may preempt each other. In this paper, we develop a novel dynamic scratchpad allocation technique that takes these process interferences into account to improve the performance and predictability of the memory system. We model the application as a Message Sequence Chart (MSC) to best capture the interprocess interactions. Our goal is to optimize the worst-case response time (WCRT) of the application through runtime reloading of the scratchpad memory content at appropriate execution points. We propose an iterative allocation algorithm that consists of two critical steps: (1) analyze the MSC along with the existing allocation to determine potential interference patterns, and (2) exploit this interference information to tune the scratchpad reloading points and content so as to best improve the WCRT. We evaluate our memory allocation scheme on a real-world embedded application controlling an Unmanned Aerial Vehicle (UAV).","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122249039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}