Pub Date : 2012-07-16DOI: 10.1109/SAMOS.2012.6404162
M. Dehyadegari, A. Marongiu, M. R. Kakoee, L. Benini, S. Mohammadi, N. Yazdani
Tightly coupling hardware accelerators with processors is a well-known approach for boosting the efficiency of MPSoC platforms. The key design challenges in this area are: (i) streamlining accelerator definition and instantiation and (ii) developing architectural templates and run-time techniques for minimizing the cost of communication and synchronization between processors and accelerators. In this paper we present an architecture featuring tightly-coupled processors and hardware processing units (HWPU), with zero-copy communication. We also provide a simple programming API, which simplifies the process of offloading jobs to HWPUs.
{"title":"A tightly-coupled multi-core cluster with shared-memory HW accelerators","authors":"M. Dehyadegari, A. Marongiu, M. R. Kakoee, L. Benini, S. Mohammadi, N. Yazdani","doi":"10.1109/SAMOS.2012.6404162","DOIUrl":"https://doi.org/10.1109/SAMOS.2012.6404162","url":null,"abstract":"Tightly coupling hardware accelerators with processors is a well-known approach for boosting the efficiency of MPSoC platforms. The key design challenges in this area are: (i) streamlining accelerator definition and instantiation and (ii) developing architectural templates and run-time techniques for minimizing the cost of communication and synchronization between processors and accelerators. In this paper we present an architecture featuring tightly-coupled processors and hardware processing units (HWPU), with zero-copy communication. We also provide a simple programming API, which simplifies the process of offloading jobs to HWPUs.","PeriodicalId":130275,"journal":{"name":"2012 International Conference on Embedded Computer Systems (SAMOS)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134070980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-07-16DOI: 10.1109/SAMOS.2012.6404182
Ioannis Koutras, A. Bartzas, D. Soudris
Modern applications are becoming more complex and dynamic and try to efficiently utilize the amount of available resources on the computing platforms. Efficient memory utilization is a key challenge for application developers, especially since memory is a scarce resource and often becomes systems bottleneck. Thus, the developers can resort to dynamic memory management, i.e., dynamic memory allocation and de-allocation, to efficiently utilize the memory resources. A high-performance adaptive memory allocator is presented in this paper. A memory allocator helps applications to manage more efficiently the memory space that operating systems bestow to them. In our approach, we tune the memory allocator at runtime by predicting the amount of memory to be requested. Experimental results obtained using applications from the PARSEC benchmark suite and dmmlib, a memory allocator framework written in C. Results show that adaptive memory allocators can improve the fragmentation problems leading to a more efficient memory usage.
{"title":"Adaptive dynamic memory allocators by estimating application workloads","authors":"Ioannis Koutras, A. Bartzas, D. Soudris","doi":"10.1109/SAMOS.2012.6404182","DOIUrl":"https://doi.org/10.1109/SAMOS.2012.6404182","url":null,"abstract":"Modern applications are becoming more complex and dynamic and try to efficiently utilize the amount of available resources on the computing platforms. Efficient memory utilization is a key challenge for application developers, especially since memory is a scarce resource and often becomes systems bottleneck. Thus, the developers can resort to dynamic memory management, i.e., dynamic memory allocation and de-allocation, to efficiently utilize the memory resources. A high-performance adaptive memory allocator is presented in this paper. A memory allocator helps applications to manage more efficiently the memory space that operating systems bestow to them. In our approach, we tune the memory allocator at runtime by predicting the amount of memory to be requested. Experimental results obtained using applications from the PARSEC benchmark suite and dmmlib, a memory allocator framework written in C. Results show that adaptive memory allocators can improve the fragmentation problems leading to a more efficient memory usage.","PeriodicalId":130275,"journal":{"name":"2012 International Conference on Embedded Computer Systems (SAMOS)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132199072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-07-16DOI: 10.1109/SAMOS.2012.6404170
K. Desnos, M. Pelcat, J. Nezan, Slaheddine Aridhi
This paper presents an application analysis technique to define the boundary of shared memory requirements of Multiprocessor System-on-Chip (MPSoC) in early stages of development. This technique is part of a rapid prototyping process and is based on the analysis of a hierarchical Synchronous Data-Flow (SDF) graph description of the system application. The analysis does not require any knowledge of the system architecture, the mapping or the scheduling of the system application tasks. The initial step of the method consists of applying a set of transformations to the SDF graph so as to reveal its memory characteristics. These transformations produce a weighted graph that represents the different memory objects of the application as well as the memory allocation constraints due to their relationships. The memory boundaries are then derived from this weighted graph using analogous graph theory problems, in particular the Maximum-Weight Clique (MWC) problem. State-of-the-art algorithms to solve these problems are presented and a heuristic approach is proposed to provide a near-optimal solution of the MWC problem. A performance evaluation of the heuristic approach is presented, and is based on hierarchical SDF graphs of realistic applications. This evaluation shows the efficiency of proposed heuristic approach in finding near optimal solutions.
{"title":"Memory bounds for the distributed execution of a hierarchical Synchronous Data-Flow graph","authors":"K. Desnos, M. Pelcat, J. Nezan, Slaheddine Aridhi","doi":"10.1109/SAMOS.2012.6404170","DOIUrl":"https://doi.org/10.1109/SAMOS.2012.6404170","url":null,"abstract":"This paper presents an application analysis technique to define the boundary of shared memory requirements of Multiprocessor System-on-Chip (MPSoC) in early stages of development. This technique is part of a rapid prototyping process and is based on the analysis of a hierarchical Synchronous Data-Flow (SDF) graph description of the system application. The analysis does not require any knowledge of the system architecture, the mapping or the scheduling of the system application tasks. The initial step of the method consists of applying a set of transformations to the SDF graph so as to reveal its memory characteristics. These transformations produce a weighted graph that represents the different memory objects of the application as well as the memory allocation constraints due to their relationships. The memory boundaries are then derived from this weighted graph using analogous graph theory problems, in particular the Maximum-Weight Clique (MWC) problem. State-of-the-art algorithms to solve these problems are presented and a heuristic approach is proposed to provide a near-optimal solution of the MWC problem. A performance evaluation of the heuristic approach is presented, and is based on hierarchical SDF graphs of realistic applications. This evaluation shows the efficiency of proposed heuristic approach in finding near optimal solutions.","PeriodicalId":130275,"journal":{"name":"2012 International Conference on Embedded Computer Systems (SAMOS)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126930621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-07-16DOI: 10.1109/SAMOS.2012.6404161
Alessandro Strano, D. Bertozzi, F. Triviño, J. L. Sánchez, F. J. Alfaro, J. Flich
Current and future on-chip networks will feature an enhanced degree of reconfigurability. Power management and virtualization strategies as well as the need to survive to the progressive onset of wear-out faults are root causes for that. In all these cases, a non-intrusive and efficient reconfiguration method is needed to allow the network to function uninterruptedly over the course of the reconfiguration process while remaining deadlock-free. This paper is inspired by the overlapped static reconfiguration (OSR) protocol developed for off-chip networks. However, in its native form its implementation in NoCs is out-of-reach. Therefore, we provide a careful engineering of the NoC switch architecture and of the system-level infrastructure to support a cost-effective, complete and transparent reconfiguration process. Performance during the reconfiguration process is not affected and implementation costs (critical path and area overhead) are proved to be fully affordable for a constrained system. Less than 250 cycles are needed for the reconfiguration process of an 8×8 2D mesh with marginal impact on system performance.
{"title":"OSR-Lite: Fast and deadlock-free NoC reconfiguration framework","authors":"Alessandro Strano, D. Bertozzi, F. Triviño, J. L. Sánchez, F. J. Alfaro, J. Flich","doi":"10.1109/SAMOS.2012.6404161","DOIUrl":"https://doi.org/10.1109/SAMOS.2012.6404161","url":null,"abstract":"Current and future on-chip networks will feature an enhanced degree of reconfigurability. Power management and virtualization strategies as well as the need to survive to the progressive onset of wear-out faults are root causes for that. In all these cases, a non-intrusive and efficient reconfiguration method is needed to allow the network to function uninterruptedly over the course of the reconfiguration process while remaining deadlock-free. This paper is inspired by the overlapped static reconfiguration (OSR) protocol developed for off-chip networks. However, in its native form its implementation in NoCs is out-of-reach. Therefore, we provide a careful engineering of the NoC switch architecture and of the system-level infrastructure to support a cost-effective, complete and transparent reconfiguration process. Performance during the reconfiguration process is not affected and implementation costs (critical path and area overhead) are proved to be fully affordable for a constrained system. Less than 250 cycles are needed for the reconfiguration process of an 8×8 2D mesh with marginal impact on system performance.","PeriodicalId":130275,"journal":{"name":"2012 International Conference on Embedded Computer Systems (SAMOS)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123097832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-07-16DOI: 10.1109/SAMOS.2012.6404201
S. Bhattacharyya
Dataflow methods are widely used for the design and implementation of signal processing functionality in cyber-physical systems. Systematically integrating instrumentation methods into dataflow-based design processes is important to facilitate trade-off assessment and tuning of alternative scheduling strategies. Such instrumentation-driven scheduler development is particularly important for dynamically structured signal processing computations. In this talk, we will present methods developed in the targeted dataflow interchange format (TDIF) environment for rigorously supporting instrumentation throughout the scheduling process. TDIF, a software tool for design and implementation of signal processing systems, emphasizes processes for retargetable design, analysis, and optimization of hardware and software. We will present an internal representation used within TDIF called the instrumented generalized schedule tree (IGST), and demonstrate the utility of IGSTs for constructing, representing, and manipulating dataflow graph schedules in connection with diverse forms of instrumentation functionality, including monitoring associated with memory usage, performance and energy consumption. This talk is based on joint work with Chung-Ching Shen, Hsiang-Huang Wu, Nimish Sane, and William Plishker.
{"title":"Instrumentation techniques for cyber-physical systems using the targeted dataflow interchange format","authors":"S. Bhattacharyya","doi":"10.1109/SAMOS.2012.6404201","DOIUrl":"https://doi.org/10.1109/SAMOS.2012.6404201","url":null,"abstract":"Dataflow methods are widely used for the design and implementation of signal processing functionality in cyber-physical systems. Systematically integrating instrumentation methods into dataflow-based design processes is important to facilitate trade-off assessment and tuning of alternative scheduling strategies. Such instrumentation-driven scheduler development is particularly important for dynamically structured signal processing computations. In this talk, we will present methods developed in the targeted dataflow interchange format (TDIF) environment for rigorously supporting instrumentation throughout the scheduling process. TDIF, a software tool for design and implementation of signal processing systems, emphasizes processes for retargetable design, analysis, and optimization of hardware and software. We will present an internal representation used within TDIF called the instrumented generalized schedule tree (IGST), and demonstrate the utility of IGSTs for constructing, representing, and manipulating dataflow graph schedules in connection with diverse forms of instrumentation functionality, including monitoring associated with memory usage, performance and energy consumption. This talk is based on joint work with Chung-Ching Shen, Hsiang-Huang Wu, Nimish Sane, and William Plishker.","PeriodicalId":130275,"journal":{"name":"2012 International Conference on Embedded Computer Systems (SAMOS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115377262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-07-16DOI: 10.1109/SAMOS.2012.6404167
Matthew Milford, J. McAllister
Realising high performance image and signal processing applications on modern FPGA presents a challenging implementation problem due to the large data frames streaming through these systems. Specifically, to meet the high bandwidth and data storage demands of these applications, complex hierarchical memory architectures must be manually specified at the Register Transfer Level (RTL). Automated approaches which convert high-level operation descriptions, for instance in the form of C programs, to an FPGA architecture, are unable to automatically realise such architectures. This paper presents a solution to this problem. It presents a compiler to automatically derive such memory architectures from a C program. By transforming the input C program to a unique dataflow modelling dialect, known as Valved Dataflow (VDF), a mapping and synthesis approach developed for this dialect can be exploited to automatically create high performance image and video processing architectures. Memory intensive C kernels for Motion Estimation (CIF Frames at 30 fps), Matrix Multiplication (128×128 @ 500 iter/sec) and Sobel Edge Detection (720p @ 30 fps), which are unrealisable by current state-of-the-art C-based synthesis tools, are automatically derived from a C description of the algorithm.
{"title":"Automatic FPGA synthesis of memory intensive C-based kernels","authors":"Matthew Milford, J. McAllister","doi":"10.1109/SAMOS.2012.6404167","DOIUrl":"https://doi.org/10.1109/SAMOS.2012.6404167","url":null,"abstract":"Realising high performance image and signal processing applications on modern FPGA presents a challenging implementation problem due to the large data frames streaming through these systems. Specifically, to meet the high bandwidth and data storage demands of these applications, complex hierarchical memory architectures must be manually specified at the Register Transfer Level (RTL). Automated approaches which convert high-level operation descriptions, for instance in the form of C programs, to an FPGA architecture, are unable to automatically realise such architectures. This paper presents a solution to this problem. It presents a compiler to automatically derive such memory architectures from a C program. By transforming the input C program to a unique dataflow modelling dialect, known as Valved Dataflow (VDF), a mapping and synthesis approach developed for this dialect can be exploited to automatically create high performance image and video processing architectures. Memory intensive C kernels for Motion Estimation (CIF Frames at 30 fps), Matrix Multiplication (128×128 @ 500 iter/sec) and Sobel Edge Detection (720p @ 30 fps), which are unrealisable by current state-of-the-art C-based synthesis tools, are automatically derived from a C description of the algorithm.","PeriodicalId":130275,"journal":{"name":"2012 International Conference on Embedded Computer Systems (SAMOS)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128773880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-07-16DOI: 10.1109/SAMOS.2012.6404165
Aanjhan Ranganathan, Ali Galip Bayrak, Theo Kluter, P. Brisk, E. Charbon, P. Ienne
We introduce a counting stream register snoop filter, which improves the performance of existing snoop filters based on stream registers. Over time, this class of snoop filters loses the ability to filter memory addresses that have been loaded, and then evicted, from the caches that are filtered; they include cache wrap detection logic, which resets the filter whenever the contents of the cache have been completely replaced. The counting stream register snoop filter introduced here replaces the cache wrap detection logic with a direct-mapped update unit and augments each stream register with a counter, which acts as a validity checker; loading new data into the cache increments the counter, while replacements, snoopy invalidations, and evictions decrement it. A cache wrap is detected whenever the counter reaches zero. Our experimental evaluation shows that the counting stream register snoop filter architecture improves the accuracy compared to traditional stream register snoop filters for representative embedded workloads.
{"title":"Counting stream registers: An efficient and effective snoop filter architecture","authors":"Aanjhan Ranganathan, Ali Galip Bayrak, Theo Kluter, P. Brisk, E. Charbon, P. Ienne","doi":"10.1109/SAMOS.2012.6404165","DOIUrl":"https://doi.org/10.1109/SAMOS.2012.6404165","url":null,"abstract":"We introduce a counting stream register snoop filter, which improves the performance of existing snoop filters based on stream registers. Over time, this class of snoop filters loses the ability to filter memory addresses that have been loaded, and then evicted, from the caches that are filtered; they include cache wrap detection logic, which resets the filter whenever the contents of the cache have been completely replaced. The counting stream register snoop filter introduced here replaces the cache wrap detection logic with a direct-mapped update unit and augments each stream register with a counter, which acts as a validity checker; loading new data into the cache increments the counter, while replacements, snoopy invalidations, and evictions decrement it. A cache wrap is detected whenever the counter reaches zero. Our experimental evaluation shows that the counting stream register snoop filter architecture improves the accuracy compared to traditional stream register snoop filters for representative embedded workloads.","PeriodicalId":130275,"journal":{"name":"2012 International Conference on Embedded Computer Systems (SAMOS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123715581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-07-16DOI: 10.1109/SAMOS.2012.6404175
A. Bonetto, Andrea Cazzaniga, Gianluca Durelli, C. Pilato, D. Sciuto, M. Santambrogio
Nowadays, the usual embedded design flow makes use of different tools to perform the several steps required to obtain a running application on a reconfigurable platform. The integration among these tools is usually not fully automated, forcing the developer to take care of these intermediate steps. This process slows down the application development and delays its time to market. In this work we present the TaBit framework, intended for FPGA designers, that is able to guide the developer from the original partitioned application, described as a task graph, down to its deployment onto the target device. Moreover, this framework defines a set of interfaces that allows the developer to integrate custom scheduling and floor placing techniques. The framework takes care of the integration between the different steps and, based on the designer inputs, it is able to automatically generate a software Scheduling Engine and the set of bitstreams ready to be deployed onto the target device.
{"title":"TaBit: A framework for task graph to bitstream generation","authors":"A. Bonetto, Andrea Cazzaniga, Gianluca Durelli, C. Pilato, D. Sciuto, M. Santambrogio","doi":"10.1109/SAMOS.2012.6404175","DOIUrl":"https://doi.org/10.1109/SAMOS.2012.6404175","url":null,"abstract":"Nowadays, the usual embedded design flow makes use of different tools to perform the several steps required to obtain a running application on a reconfigurable platform. The integration among these tools is usually not fully automated, forcing the developer to take care of these intermediate steps. This process slows down the application development and delays its time to market. In this work we present the TaBit framework, intended for FPGA designers, that is able to guide the developer from the original partitioned application, described as a task graph, down to its deployment onto the target device. Moreover, this framework defines a set of interfaces that allows the developer to integrate custom scheduling and floor placing techniques. The framework takes care of the integration between the different steps and, based on the designer inputs, it is able to automatically generate a software Scheduling Engine and the set of bitstreams ready to be deployed onto the target device.","PeriodicalId":130275,"journal":{"name":"2012 International Conference on Embedded Computer Systems (SAMOS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129247119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-07-01DOI: 10.1109/SAMOS.2012.6404188
P. Sakellariou, I. Tsatsaragkos, N. Kanistras, A. Mahdi, Vassilis Paliouras
This paper introduces a methodology for forward error correction (FEC) architectures prototyping, oriented to system verification and characterization. A complete design flow is described, which satisfies the requirement for error-free hardware design and acceleration of FEC simulations. FPGA devices give the designer the ability to observe rare events, due to tremendous speed-up of FEC operations. A Matlab-based system assists the investigation of the impact of very rare decoding failure events on the FEC system performance and the finding of solutions which aim to parameters optimization and BER performance improvement of LDPC codes in the error floor region. Furthermore, the development of an embedded system, which offers remote access to the system under test and verification process automation, is explored. The presented here prototyping approach exploits the high-processing speed of FPGA-based emulators and the observability and usability of software-based models.
{"title":"An FPGA-based prototyping method for verification, characterization and optimization of LDPC error correction systems","authors":"P. Sakellariou, I. Tsatsaragkos, N. Kanistras, A. Mahdi, Vassilis Paliouras","doi":"10.1109/SAMOS.2012.6404188","DOIUrl":"https://doi.org/10.1109/SAMOS.2012.6404188","url":null,"abstract":"This paper introduces a methodology for forward error correction (FEC) architectures prototyping, oriented to system verification and characterization. A complete design flow is described, which satisfies the requirement for error-free hardware design and acceleration of FEC simulations. FPGA devices give the designer the ability to observe rare events, due to tremendous speed-up of FEC operations. A Matlab-based system assists the investigation of the impact of very rare decoding failure events on the FEC system performance and the finding of solutions which aim to parameters optimization and BER performance improvement of LDPC codes in the error floor region. Furthermore, the development of an embedded system, which offers remote access to the system under test and verification process automation, is explored. The presented here prototyping approach exploits the high-processing speed of FPGA-based emulators and the observability and usability of software-based models.","PeriodicalId":130275,"journal":{"name":"2012 International Conference on Embedded Computer Systems (SAMOS)","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121737216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1900-01-01DOI: 10.1109/SAMOS.2012.6404158
Ricardo A. Velásquez, P. Michaud, André Seznec
Microarchitecture research and development rely heavily on simulators. The ideal simulator should be simple and easy to develop, it should be precise, accurate and very fast. But the ideal simulator does not exist, and microarchitects use different sorts of simulators at different stages of the development of a processor, depending on which is most important, accuracy or simulation speed. Approximate microarchitecture models, which trade accuracy for simulation speed, are very useful for research and design space exploration, provided the loss of accuracy remains acceptable. Behavioral superscalar core modeling is a possible way to trade accuracy for simulation speed in situations where the focus of the study is not the core itself. In this approach, a superscalar core is viewed as a black box emitting requests to the uncore at certain times. A behavioral core model can be connected to a cycle-accurate uncore model. Behavioral core models are built from cycle-accurate simulations. Once the time to build the model is amortized, important simulation speedups can be obtained. We describe and study a new method for defining behavioral models for modern superscalar cores. The proposed Behavioral Application-Dependent Superscalar Core model, BADCO, predicts the execution time of a thread running on a superscalar core with an error less than 10% in most cases. We show that BADCO is qualitatively accurate, being able to predict how performance changes when we change the uncore. The simulation speedups we obtained are typically between one and two orders of magnitude.
{"title":"BADCO: Behavioral Application-Dependent Superscalar Core model","authors":"Ricardo A. Velásquez, P. Michaud, André Seznec","doi":"10.1109/SAMOS.2012.6404158","DOIUrl":"https://doi.org/10.1109/SAMOS.2012.6404158","url":null,"abstract":"Microarchitecture research and development rely heavily on simulators. The ideal simulator should be simple and easy to develop, it should be precise, accurate and very fast. But the ideal simulator does not exist, and microarchitects use different sorts of simulators at different stages of the development of a processor, depending on which is most important, accuracy or simulation speed. Approximate microarchitecture models, which trade accuracy for simulation speed, are very useful for research and design space exploration, provided the loss of accuracy remains acceptable. Behavioral superscalar core modeling is a possible way to trade accuracy for simulation speed in situations where the focus of the study is not the core itself. In this approach, a superscalar core is viewed as a black box emitting requests to the uncore at certain times. A behavioral core model can be connected to a cycle-accurate uncore model. Behavioral core models are built from cycle-accurate simulations. Once the time to build the model is amortized, important simulation speedups can be obtained. We describe and study a new method for defining behavioral models for modern superscalar cores. The proposed Behavioral Application-Dependent Superscalar Core model, BADCO, predicts the execution time of a thread running on a superscalar core with an error less than 10% in most cases. We show that BADCO is qualitatively accurate, being able to predict how performance changes when we change the uncore. The simulation speedups we obtained are typically between one and two orders of magnitude.","PeriodicalId":130275,"journal":{"name":"2012 International Conference on Embedded Computer Systems (SAMOS)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123363316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}