Simulation of an application is a popular and reliable approach to find the optimal configuration of level one cache memory for an application specific embedded system processor. However, long simulation time is one of the main disadvantages of simulation based approaches. In this paper, we propose a new and fast simulation method, Super Set Simulator (SuSeSim). While previous methods use Top-Down searching strategy, SuSeSim utilizes a Bottom-Up search strategy along with a new elaborate data structure to reduce the search space to determine a cache hit or miss. SuSeSim can simulate hundreds of cache configurations simultaneously by reading an application's memory request trace just once. Total number of cache hits and misses are accurately recorded. Depending on different cache block sizes and benchmark applications, SuSeSim can reduce the number of tags to be checked by up to 43% compared to the existing fastest simulation approach (the CRCB algorithm). With the help of a faster search and an easy to maintain data structure, SuSeSim can be up to 94% faster in simulating memory requests compared to the CRCB algorithm.
{"title":"SuSeSim: a fast simulation strategy to find optimal L1 cache configuration for embedded systems","authors":"M. S. Haque, Andhi Janapsatya, S. Parameswaran","doi":"10.1145/1629435.1629476","DOIUrl":"https://doi.org/10.1145/1629435.1629476","url":null,"abstract":"Simulation of an application is a popular and reliable approach to find the optimal configuration of level one cache memory for an application specific embedded system processor. However, long simulation time is one of the main disadvantages of simulation based approaches. In this paper, we propose a new and fast simulation method, Super Set Simulator (SuSeSim). While previous methods use Top-Down searching strategy, SuSeSim utilizes a Bottom-Up search strategy along with a new elaborate data structure to reduce the search space to determine a cache hit or miss. SuSeSim can simulate hundreds of cache configurations simultaneously by reading an application's memory request trace just once. Total number of cache hits and misses are accurately recorded. Depending on different cache block sizes and benchmark applications, SuSeSim can reduce the number of tags to be checked by up to 43% compared to the existing fastest simulation approach (the CRCB algorithm). With the help of a faster search and an easy to maintain data structure, SuSeSim can be up to 94% faster in simulating memory requests compared to the CRCB algorithm.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124706013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Ceng, Weihua Sheng, J. Castrillón, Anastasia Stulova, R. Leupers, G. Ascheid, H. Meyr
Multiprocessor System-on-Chips (MPSoCs) are nowadays widely used, but the problem of their software development persists to be one of the biggest challenges for developers. Virtual Platforms (VPs) are introduced to the industry, which allow MPSoC software development without a hardware prototype. Nevertheless, for developers in early design stage where no VP is available, the software programming support is not satisfactory. This paper introduces a High-level Virtual Platform (HVP) which aims at early MPSoC software development. The framework provides a set of tools for abstract MPSoC simulation and the corresponding application programming support in order to enable the development of reusable C code at a high level. The case study performed on several MPSoCs shows that the code developed on the HVP can be easily reused on different target platforms. Moreover, the high simulation speed achieved by the HVP also improves the design efficiency of software developers.
{"title":"A high-level virtual platform for early MPSoC software development","authors":"J. Ceng, Weihua Sheng, J. Castrillón, Anastasia Stulova, R. Leupers, G. Ascheid, H. Meyr","doi":"10.1145/1629435.1629438","DOIUrl":"https://doi.org/10.1145/1629435.1629438","url":null,"abstract":"Multiprocessor System-on-Chips (MPSoCs) are nowadays widely used, but the problem of their software development persists to be one of the biggest challenges for developers. Virtual Platforms (VPs) are introduced to the industry, which allow MPSoC software development without a hardware prototype. Nevertheless, for developers in early design stage where no VP is available, the software programming support is not satisfactory.\u0000 This paper introduces a High-level Virtual Platform (HVP) which aims at early MPSoC software development. The framework provides a set of tools for abstract MPSoC simulation and the corresponding application programming support in order to enable the development of reusable C code at a high level. The case study performed on several MPSoCs shows that the code developed on the HVP can be easily reused on different target platforms. Moreover, the high simulation speed achieved by the HVP also improves the design efficiency of software developers.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126024842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In high-end digital signal processing (DSP) system, multi-module memory provides high memory bandwidth and low power operating mode for energy savings. However, making full use of these architectural features is a challenging problem for code optimization. In this paper, we propose an integer linear programming model to optimize the performance and energy consumption of multi-module memories by solving variable assignment, instruction scheduling and operating mode setting problems simultaneously. The combined effect of performance and energy saving requirements also has been considered. We develop two optimization techniques to improve the computation efficiency of our ILP model. The experimental results show that the optimal performance and energy solution can be achieved within a reasonable amount of time.
{"title":"ILP optimal scheduling for multi-module memory","authors":"Meikang Qiu, Lei Zhang, E. Sha","doi":"10.1145/1629435.1629473","DOIUrl":"https://doi.org/10.1145/1629435.1629473","url":null,"abstract":"In high-end digital signal processing (DSP) system, multi-module memory provides high memory bandwidth and low power operating mode for energy savings. However, making full use of these architectural features is a challenging problem for code optimization. In this paper, we propose an integer linear programming model to optimize the performance and energy consumption of multi-module memories by solving variable assignment, instruction scheduling and operating mode setting problems simultaneously. The combined effect of performance and energy saving requirements also has been considered. We develop two optimization techniques to improve the computation efficiency of our ILP model. The experimental results show that the optimal performance and energy solution can be achieved within a reasonable amount of time.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134041553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the near future, generic System-on-Chip (SoC) platforms will be replacing custom designed SoCs. Such generic platforms require a highly flexible interconnect in order to support a wide variety of applications. The ReNoC architecture provides this by allowing power efficient, application specific topologies to be configured on top of a fixed but reconfigurable physical architecture through a mixture of packet switching and physical circuit switching. The first contribution of this paper is three novel algorithms that, given an abstract description of the application and the physical architecture, 1) synthesize the application specific topologies, 2) map them onto the physical architecture, and 3) create deadlock free, application specific routing algorithms. The second contribution is a novel physical architecture based on an extended mesh of ReNoC nodes. We apply our algorithms to a mixture of real and synthetic applications and three different physical architectures. Our results show that the different algorithms' performance are highly dependent on the physical architecture. On average, our novel physical architecture reduces power consumption by 58% compared to a conventional Network-on-Chip.
{"title":"Synthesis of topology configurations and deadlock free routing algorithms for ReNoC-based systems-on-chip","authors":"M.B. Stuart, M. B. Stensgaard, J. Sparsø","doi":"10.1145/1629435.1629500","DOIUrl":"https://doi.org/10.1145/1629435.1629500","url":null,"abstract":"In the near future, generic System-on-Chip (SoC) platforms will be replacing custom designed SoCs. Such generic platforms require a highly flexible interconnect in order to support a wide variety of applications. The ReNoC architecture provides this by allowing power efficient, application specific topologies to be configured on top of a fixed but reconfigurable physical architecture through a mixture of packet switching and physical circuit switching.\u0000 The first contribution of this paper is three novel algorithms that, given an abstract description of the application and the physical architecture, 1) synthesize the application specific topologies, 2) map them onto the physical architecture, and 3) create deadlock free, application specific routing algorithms.\u0000 The second contribution is a novel physical architecture based on an extended mesh of ReNoC nodes. We apply our algorithms to a mixture of real and synthetic applications and three different physical architectures. Our results show that the different algorithms' performance are highly dependent on the physical architecture. On average, our novel physical architecture reduces power consumption by 58% compared to a conventional Network-on-Chip.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134285888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deepak Gangadharan, S. Chakraborty, Roger Zimmermann
Currently, performance analysis of multimedia-MPSoC platforms largely rely on simulation. The execution of one or more applications on such a platform is simulated for a library of test video clips. If all specified performance constraints are satisfied for this library, then the architecture is assumed to be well-designed. This is similar to testing software for functional correctness. However, in contrast to functional testing, simulating a set of video clips for a complex application/architecture is extremely time consuming. In this paper we propose a technique for clustering a library of video clips, such that it is sufficient to simulate only one clip from each cluster rather than the entire library. Our clustering is scalable, i.e., the number of clusters may be determined based on the number of clips that the system designer wishes to simulate (which is independent of the input library size). For each video clip in the library, we perform a fast bitstream analysis from which the workload generated while processing this clip on the given architecture may be estimated. This workload information, in conjunction with a workload model and a performance model of the architecture, is used for the clustering. This entire process does not involve any simulation and is hence extremely fast. We illustrate its utility through a detailed case study using an MPEG-2 decoder application running on an MPSoC platform. As part of validation of our methodology, it was observed that video clips falling into the same cluster exhibit similar worst case buffer backlogs and worst case delays for one macroblock. Overall the results demonstrate that the proposed method provides a very fast and accurate analysis and hence can be of significant benefit to the system designer.
{"title":"Fast model-based test case classification for performance analysis of multimedia MPSoC platforms","authors":"Deepak Gangadharan, S. Chakraborty, Roger Zimmermann","doi":"10.1145/1629435.1629492","DOIUrl":"https://doi.org/10.1145/1629435.1629492","url":null,"abstract":"Currently, performance analysis of multimedia-MPSoC platforms largely rely on simulation. The execution of one or more applications on such a platform is simulated for a library of test video clips. If all specified performance constraints are satisfied for this library, then the architecture is assumed to be well-designed. This is similar to testing software for functional correctness. However, in contrast to functional testing, simulating a set of video clips for a complex application/architecture is extremely time consuming. In this paper we propose a technique for clustering a library of video clips, such that it is sufficient to simulate only one clip from each cluster rather than the entire library. Our clustering is scalable, i.e., the number of clusters may be determined based on the number of clips that the system designer wishes to simulate (which is independent of the input library size). For each video clip in the library, we perform a fast bitstream analysis from which the workload generated while processing this clip on the given architecture may be estimated. This workload information, in conjunction with a workload model and a performance model of the architecture, is used for the clustering. This entire process does not involve any simulation and is hence extremely fast. We illustrate its utility through a detailed case study using an MPEG-2 decoder application running on an MPSoC platform. As part of validation of our methodology, it was observed that video clips falling into the same cluster exhibit similar worst case buffer backlogs and worst case delays for one macroblock. Overall the results demonstrate that the proposed method provides a very fast and accurate analysis and hence can be of significant benefit to the system designer.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126634004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Technological advancements due to Moore's law have led to the proliferation of complex wireless sensor network (WSN) domains. One commonality across all WSN domains is the need to meet application requirements (i.e. lifetime, responsiveness, etc.) through domain specific sensor node design. Techniques such as sensor node parameter tuning enable WSN designers to specialize tunable parameters (i.e. processor voltage and frequency, sensing frequency, etc.) to meet these application requirements. However, given WSN domain diversity, varying environmental situations (stimuli), and sensor node complexity, sensor node parameter tuning is a very challenging task. In this paper, we propose an automated Markov Decision Process (MDP)-based methodology to prescribe optimal sensor node operation (selection of values for tunable parameters such as processor voltage, processor frequency, and sensing frequency) to meet application requirements and adapt to changing environmental stimuli. Numerical results confirm the optimality of our proposed methodology and reveal that our methodology more closely meets application requirements compared to other feasible policies.
{"title":"An MDP-based application oriented optimal policy for wireless sensor networks","authors":"Arslan Munir, A. Gordon-Ross","doi":"10.1145/1629435.1629461","DOIUrl":"https://doi.org/10.1145/1629435.1629461","url":null,"abstract":"Technological advancements due to Moore's law have led to the proliferation of complex wireless sensor network (WSN) domains. One commonality across all WSN domains is the need to meet application requirements (i.e. lifetime, responsiveness, etc.) through domain specific sensor node design. Techniques such as sensor node parameter tuning enable WSN designers to specialize tunable parameters (i.e. processor voltage and frequency, sensing frequency, etc.) to meet these application requirements. However, given WSN domain diversity, varying environmental situations (stimuli), and sensor node complexity, sensor node parameter tuning is a very challenging task. In this paper, we propose an automated Markov Decision Process (MDP)-based methodology to prescribe optimal sensor node operation (selection of values for tunable parameters such as processor voltage, processor frequency, and sensing frequency) to meet application requirements and adapt to changing environmental stimuli. Numerical results confirm the optimality of our proposed methodology and reveal that our methodology more closely meets application requirements compared to other feasible policies.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121966102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Microcoded customized IPs offer superior performance and direct programmability of micro-architectural structures compared to instruction-based processors, yet at the cost of drastically enlarged code sizes. Code compression can deliver size reductions but necessitates attention to performance issues, so that the performance benefits of microcoded IPs are not squandered in the process. To attain this goal, we propose in this paper a fast code compression technique through exploiting the fact that the microcodes contain a sizable amount of unspecified bits. Although the values and the positions of the specified bits are highly irregular, the proposed technique can still flexibly and precisely fill in these fully specified bits through utilizing a linear network. The linear property inherent in the compression strategy in turn enables the development of an extremely low-overhead decompression engine. At runtime, the decompressed code can be generated in such a way that all the specified bits can be filled as required by a fixed-bandwidth XOR network. The combination of the proposed flexible XOR-based network with a minimum two-level storage for highly specified fields, such as immediate values, offers utmost code compression, attained within a negligible amount of performance and hardware overhead.
{"title":"Squashing microcode stores to size in embedded systems while delivering rapid microcode accesses","authors":"Chengmo Yang, Mingjing Chen, A. Orailoglu","doi":"10.1145/1629435.1629471","DOIUrl":"https://doi.org/10.1145/1629435.1629471","url":null,"abstract":"Microcoded customized IPs offer superior performance and direct programmability of micro-architectural structures compared to instruction-based processors, yet at the cost of drastically enlarged code sizes. Code compression can deliver size reductions but necessitates attention to performance issues, so that the performance benefits of microcoded IPs are not squandered in the process. To attain this goal, we propose in this paper a fast code compression technique through exploiting the fact that the microcodes contain a sizable amount of unspecified bits. Although the values and the positions of the specified bits are highly irregular, the proposed technique can still flexibly and precisely fill in these fully specified bits through utilizing a linear network. The linear property inherent in the compression strategy in turn enables the development of an extremely low-overhead decompression engine. At runtime, the decompressed code can be generated in such a way that all the specified bits can be filled as required by a fixed-bandwidth XOR network. The combination of the proposed flexible XOR-based network with a minimum two-level storage for highly specified fields, such as immediate values, offers utmost code compression, attained within a negligible amount of performance and hardware overhead.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127807492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Antonino Tumeo, Marco Branca, L. Camerini, C. Pilato, P. Lanzi, Fabrizio Ferrandi, D. Sciuto
In this paper we propose a flow based on the Bayesian Optimization Algorithm (BOA) for mapping pipelined applications on a heterogeneous multiprocessor platform on Field Programmable Gate Array (FPGA) with customizable processors. BOA is a Probabilistic Model Building Genetic Algorithm (PMBGA) that, substituting the classical mutation and crossover operators with the construction and the sampling of a Bayesian network, is able to identify correlated sub-structures within the problem to be maintained while generating new solutions. The paper introduces the model adopted for pipelined applications and then shows why BOA fits the problem better than other search algorithms, like Genetic Algorithm (GA), Simulated Annealing (SA) and Tabu Search (TS). We also show that our algorithm is able to cope with data parallel pipelined algorithms. We finally validate our flow on realistic applications like JPEG and ADPCM coding by executing the resulting mapping on our platform.
{"title":"Mapping pipelined applications onto heterogeneous embedded systems: a bayesian optimization algorithm based approach","authors":"Antonino Tumeo, Marco Branca, L. Camerini, C. Pilato, P. Lanzi, Fabrizio Ferrandi, D. Sciuto","doi":"10.1145/1629435.1629495","DOIUrl":"https://doi.org/10.1145/1629435.1629495","url":null,"abstract":"In this paper we propose a flow based on the Bayesian Optimization Algorithm (BOA) for mapping pipelined applications on a heterogeneous multiprocessor platform on Field Programmable Gate Array (FPGA) with customizable processors. BOA is a Probabilistic Model Building Genetic Algorithm (PMBGA) that, substituting the classical mutation and crossover operators with the construction and the sampling of a Bayesian network, is able to identify correlated sub-structures within the problem to be maintained while generating new solutions.\u0000 The paper introduces the model adopted for pipelined applications and then shows why BOA fits the problem better than other search algorithms, like Genetic Algorithm (GA), Simulated Annealing (SA) and Tabu Search (TS). We also show that our algorithm is able to cope with data parallel pipelined algorithms. We finally validate our flow on realistic applications like JPEG and ADPCM coding by executing the resulting mapping on our platform.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"601 1-3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132453592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In order to face the growing complexity of embedded applications, we aim to build highly efficient Network-on-Chip (NoC) architectures which can connect in a scalable manner various computational modules of the platform. For such networked platforms, it is increasingly important to accurately model the traffic characteristics as this is intimately related to our ability to determine the optimal buffer size at various routers in the network and thus provide analytical metrics for various power-performance trade-offs. In this paper, we show that the main limitations of queueing theory and Markov chain approaches to solving the buffer sizing problem can be overcome by adopting a statistical physics approach to probability density characterization which incorporates the power law distribution, correlations, and scaling properties exhibited within an NoC architecture due to various network transactions. As experimental results show, this new approach represents a breakthrough in accurate traffic modeling under non-equilibrium conditions. As such, our results can be directly used to solve the buffer sizing problem for multiprocessor systems where communication happens via the NoC approach.
{"title":"Statistical physics approaches for network-on-chip traffic characterization","authors":"P. Bogdan, R. Marculescu","doi":"10.1145/1629435.1629498","DOIUrl":"https://doi.org/10.1145/1629435.1629498","url":null,"abstract":"In order to face the growing complexity of embedded applications, we aim to build highly efficient Network-on-Chip (NoC) architectures which can connect in a scalable manner various computational modules of the platform. For such networked platforms, it is increasingly important to accurately model the traffic characteristics as this is intimately related to our ability to determine the optimal buffer size at various routers in the network and thus provide analytical metrics for various power-performance trade-offs. In this paper, we show that the main limitations of queueing theory and Markov chain approaches to solving the buffer sizing problem can be overcome by adopting a statistical physics approach to probability density characterization which incorporates the power law distribution, correlations, and scaling properties exhibited within an NoC architecture due to various network transactions. As experimental results show, this new approach represents a breakthrough in accurate traffic modeling under non-equilibrium conditions. As such, our results can be directly used to solve the buffer sizing problem for multiprocessor systems where communication happens via the NoC approach.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132812655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Functional instruction set simulators perform instruction-accurate simulation of benchmarks at high instruction rates. Unlike their slower, but cycle-accurate counterparts however, they are not capable of providing cycle counts due to the higher level of hardware abstraction. In this paper we present a novel approach to performance prediction based on statistical machine learning utilizing a hybrid instruction- and cycle-accurate simulator. We introduce the concept of continuous machine learning to simulation whereby new training data points are acquired on demand and used for on-the-fly updates of the performance model. Furthermore, we show how statistical regression can be adapted to reduce the cost of these updates during a performance-critical simulation. For a state-of-the-art simulator modeling the ARC 750D embedded processor we demonstrate that our approach is highly accurate, with average error <2.5% while achieving a speed-up of approx. 50% over the baseline cycle-accurate simulation.
{"title":"Using continuous statistical machine learning to enable high-speed performance prediction in hybrid instruction-/cycle-accurate instruction set simulators","authors":"D. Powell, Björn Franke","doi":"10.1145/1629435.1629478","DOIUrl":"https://doi.org/10.1145/1629435.1629478","url":null,"abstract":"Functional instruction set simulators perform instruction-accurate simulation of benchmarks at high instruction rates. Unlike their slower, but cycle-accurate counterparts however, they are not capable of providing cycle counts due to the higher level of hardware abstraction. In this paper we present a novel approach to performance prediction based on statistical machine learning utilizing a hybrid instruction- and cycle-accurate simulator. We introduce the concept of continuous machine learning to simulation whereby new training data points are acquired on demand and used for on-the-fly updates of the performance model. Furthermore, we show how statistical regression can be adapted to reduce the cost of these updates during a performance-critical simulation. For a state-of-the-art simulator modeling the ARC 750D embedded processor we demonstrate that our approach is highly accurate, with average error <2.5% while achieving a speed-up of approx. 50% over the baseline cycle-accurate simulation.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121791219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}