Pub Date : 2020-07-01DOI: 10.1109/asap49362.2020.00005
Dirk Koch, Frank Hannig, J. Navaridas
{"title":"Message from the Conference Chairs - ASAP 2020","authors":"Dirk Koch, Frank Hannig, J. Navaridas","doi":"10.1109/asap49362.2020.00005","DOIUrl":"https://doi.org/10.1109/asap49362.2020.00005","url":null,"abstract":"","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"42 1","pages":"i-ii"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84124038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-07-01DOI: 10.1109/ASAP.2016.7760764
David B. Thomas, Suhaib A. Fahmy
We welcome you to the 27th IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP 2016). This year's event takes place in London, United Kingdom on the campus of the Imperial College London. Prior to this year's visit to London, the conference has been held in many places around the globe including Oxford (1986), San Diego (1988), Killarney (1989), Princeton (1990), Barcelona (1991), Berkeley (1992), Venice (1993), San Francisco (1994), Strasbourg (1995), Chicago (1996), Zurich (1997), Boston (2000), San Jose (2002), The Hague (2003), Galveston (2004), Samos (2005), Steamboat Springs (2006), Montreal (2007), Leuven (2008), Boston (2009), Rennes (2010), Santa Monica (2011), Delft (2012), and Washington, D.C (2013), Zurich (2014), and Toronto (2015). Though this is the 27th iteration of ASAP, it is actually the 30 year anniversary of the first conference in Oxford.
{"title":"Message from the ASAP 2016 chairs","authors":"David B. Thomas, Suhaib A. Fahmy","doi":"10.1109/ASAP.2016.7760764","DOIUrl":"https://doi.org/10.1109/ASAP.2016.7760764","url":null,"abstract":"We welcome you to the 27th IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP 2016). This year's event takes place in London, United Kingdom on the campus of the Imperial College London. Prior to this year's visit to London, the conference has been held in many places around the globe including Oxford (1986), San Diego (1988), Killarney (1989), Princeton (1990), Barcelona (1991), Berkeley (1992), Venice (1993), San Francisco (1994), Strasbourg (1995), Chicago (1996), Zurich (1997), Boston (2000), San Jose (2002), The Hague (2003), Galveston (2004), Samos (2005), Steamboat Springs (2006), Montreal (2007), Leuven (2008), Boston (2009), Rennes (2010), Santa Monica (2011), Delft (2012), and Washington, D.C (2013), Zurich (2014), and Toronto (2015). Though this is the 27th iteration of ASAP, it is actually the 30 year anniversary of the first conference in Oxford.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"15 1","pages":"iii-iv"},"PeriodicalIF":0.0,"publicationDate":"2016-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81850938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-07-27DOI: 10.1109/ASAP.2015.7245731
D. Rudolph, G. Stitt
To effectively optimize applications for emerging heterogeneous architectures, compilers and synthesis tools must perform the challenging task of estimating the performance of different implementations and optimizations for different numbers and types of computational resources. Many performance-prediction techniques exist, but those approaches are specific to particular resources or applications, and are often not capable of prediction for all combinations of inputs. In this paper, we introduce an approach to multi-parameter performance modeling based on sampling and interpolation. This approach can be used in conjunction with execution time data, simulated or observed, to quickly perform performance estimation for any function, on any resource, with any combination of inputs. By evaluating a Kriging-based interpolator on a variety of functions and computational resources, we determine bounds on the accuracy of this approach, and show that an interpolation-based approach utilizing Kriging can effectively model execution time for most applications. We also show that Kriging is a highly effective interpolation technique for execution time, and can be up to four orders of magnitude more accurate than nearest-neighbor interpolation or radial basis function interpolation.
{"title":"An interpolation-based approach to multi-parameter performance modeling for heterogeneous systems","authors":"D. Rudolph, G. Stitt","doi":"10.1109/ASAP.2015.7245731","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245731","url":null,"abstract":"To effectively optimize applications for emerging heterogeneous architectures, compilers and synthesis tools must perform the challenging task of estimating the performance of different implementations and optimizations for different numbers and types of computational resources. Many performance-prediction techniques exist, but those approaches are specific to particular resources or applications, and are often not capable of prediction for all combinations of inputs. In this paper, we introduce an approach to multi-parameter performance modeling based on sampling and interpolation. This approach can be used in conjunction with execution time data, simulated or observed, to quickly perform performance estimation for any function, on any resource, with any combination of inputs. By evaluating a Kriging-based interpolator on a variety of functions and computational resources, we determine bounds on the accuracy of this approach, and show that an interpolation-based approach utilizing Kriging can effectively model execution time for most applications. We also show that Kriging is a highly effective interpolation technique for execution time, and can be up to four orders of magnitude more accurate than nearest-neighbor interpolation or radial basis function interpolation.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"62 1","pages":"174-180"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78404650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-07-27DOI: 10.1109/ASAP.2015.7245736
Andreea-Ingrid Funie, Paul Grigoras, P. Burovskiy, W. Luk, Mark Salmon
Over the past years, examining financial markets has become a crucial part of both the trading and regulatory processes. Recently, genetic programs have been used to identify patterns in financial markets which may lead to more advanced trading strategies. We investigate the use of Field Programmable Gate Arrays to accelerate the evaluation of the fitness function which is an important kernel in genetic programming. Our pipelined design makes use of the massive amounts of parallelism available on chip to evaluate the fitness of multiple genetic programs simultaneously. An evaluation of our designs on both synthetic and historical market data shows that our implementation evaluates fitness function up to 21.56 times faster than a multi-threaded C++11 implementation running on two six-core Intel Xeon E5-2640 processors using OpenMP.
{"title":"Reconfigurable acceleration of fitness evaluation in trading strategies","authors":"Andreea-Ingrid Funie, Paul Grigoras, P. Burovskiy, W. Luk, Mark Salmon","doi":"10.1109/ASAP.2015.7245736","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245736","url":null,"abstract":"Over the past years, examining financial markets has become a crucial part of both the trading and regulatory processes. Recently, genetic programs have been used to identify patterns in financial markets which may lead to more advanced trading strategies. We investigate the use of Field Programmable Gate Arrays to accelerate the evaluation of the fitness function which is an important kernel in genetic programming. Our pipelined design makes use of the massive amounts of parallelism available on chip to evaluate the fitness of multiple genetic programs simultaneously. An evaluation of our designs on both synthetic and historical market data shows that our implementation evaluates fitness function up to 21.56 times faster than a multi-threaded C++11 implementation running on two six-core Intel Xeon E5-2640 processors using OpenMP.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"3 1","pages":"210-217"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79519353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-07-27DOI: 10.1109/ASAP.2015.7245711
Abdelhamid Dine, A. Elouardi, B. Vincke, S. Bouaziz
In this paper we present a study of using an heterogeneous architecture to implement the graph-based SLAM algorithm. The study aims to investigate the performances of an ARM-GPU based architecture by offloading some critical compute-intensive tasks of the algorithm to the integrated GPU.
{"title":"Speeding up graph-based SLAM algorithm: A GPU-based heterogeneous architecture study","authors":"Abdelhamid Dine, A. Elouardi, B. Vincke, S. Bouaziz","doi":"10.1109/ASAP.2015.7245711","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245711","url":null,"abstract":"In this paper we present a study of using an heterogeneous architecture to implement the graph-based SLAM algorithm. The study aims to investigate the performances of an ARM-GPU based architecture by offloading some critical compute-intensive tasks of the algorithm to the integrated GPU.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"24 1","pages":"72-73"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91535158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-07-27DOI: 10.1109/ASAP.2015.7245708
T. Hoang, Amirali Shambayati, H. Hoffmann, A. Chien
In this paper, we perform a systematic comparison to study the energy cost of varying data formats and data types w.r.t. arithmetic logic and data movement for accelerator-based heterogeneous systems in which both compute-intensive (FFT accelerator) and data-intensive accelerators (DLT accelerator) are added. We explore evaluation for a wide range of design processes (e.g. 32nm bulk-CMOS and projected 7nm FinFET) and memory systems (e.g. DDR3 and HMC). First, our result shows that when varying data formats, the energy costs of using floating point over fixed point are 5.3% (DDR3), 6.2% (HMC) for core and 0.8% (DDR3), 1.5% (HMC) for system in 32nm process. These energy costs are negligible as 0.2% and 0.01% for core and system in 7nm FinFET process in DDR3 memory and slightly increasing in HMC. Second, we identify that the core and system energy of systems using fixed point, 16-bit, FFT accelerator is nearly half of using 32-bit if data movement is also accelerated. This evidence implies that system energy is highly proportional to the amount of moving data when varying data types.
{"title":"Does arithmetic logic dominate data movement? a systematic comparison of energy-efficiency for FFT accelerators","authors":"T. Hoang, Amirali Shambayati, H. Hoffmann, A. Chien","doi":"10.1109/ASAP.2015.7245708","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245708","url":null,"abstract":"In this paper, we perform a systematic comparison to study the energy cost of varying data formats and data types w.r.t. arithmetic logic and data movement for accelerator-based heterogeneous systems in which both compute-intensive (FFT accelerator) and data-intensive accelerators (DLT accelerator) are added. We explore evaluation for a wide range of design processes (e.g. 32nm bulk-CMOS and projected 7nm FinFET) and memory systems (e.g. DDR3 and HMC). First, our result shows that when varying data formats, the energy costs of using floating point over fixed point are 5.3% (DDR3), 6.2% (HMC) for core and 0.8% (DDR3), 1.5% (HMC) for system in 32nm process. These energy costs are negligible as 0.2% and 0.01% for core and system in 7nm FinFET process in DDR3 memory and slightly increasing in HMC. Second, we identify that the core and system energy of systems using fixed point, 16-bit, FFT accelerator is nearly half of using 32-bit if data movement is also accelerated. This evidence implies that system energy is highly proportional to the amount of moving data when varying data types.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"66 1","pages":"66-67"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80641890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-07-27DOI: 10.1109/ASAP.2015.7245735
Aniruddha Shastri, G. Stitt, Eduardo Riccio
Space computing systems commonly use field-programmable gate arrays to provide fault tolerance by applying triple modular redundancy (TMR) to existing register-transfer-level (RTL) code. Although effective, this approach has a 3× area overhead that can be prohibitive for many designs that often allocate resources before considering effects of redundancy. Although a designer could modify existing RTL code to reduce resource usage, such a process is time consuming and error prone. Integrating redundancy into high-level synthesis is a more attractive approach that enables synthesis to rapidly explore different tradeoffs at no cost to the designer. In this paper, we introduce a scheduling and binding heuristic for high-level synthesis that explores tradeoffs between resource usage, latency, and the amount of redundancy. In many cases, an application will not require 100% error correction, which enables significant flexibility for scheduling and binding to reduce resources. Even for applications that require 100% error correction, our heuristic is able to explore solutions that sacrifice latency for reduced resources, and typically save up to 47% when relaxing the latency up to 2×. When the error constraint is reduced to 70%, our heuristic achieves typical resource savings ranging from 18% to 49% when relaxing the latency up to 2×, with a maximum of 77%. Even when comparing with optimized RTL designs, our heuristic uses up to 61% fewer resources than TMR.
{"title":"A scheduling and binding heuristic for high-level synthesis of fault-tolerant FPGA applications","authors":"Aniruddha Shastri, G. Stitt, Eduardo Riccio","doi":"10.1109/ASAP.2015.7245735","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245735","url":null,"abstract":"Space computing systems commonly use field-programmable gate arrays to provide fault tolerance by applying triple modular redundancy (TMR) to existing register-transfer-level (RTL) code. Although effective, this approach has a 3× area overhead that can be prohibitive for many designs that often allocate resources before considering effects of redundancy. Although a designer could modify existing RTL code to reduce resource usage, such a process is time consuming and error prone. Integrating redundancy into high-level synthesis is a more attractive approach that enables synthesis to rapidly explore different tradeoffs at no cost to the designer. In this paper, we introduce a scheduling and binding heuristic for high-level synthesis that explores tradeoffs between resource usage, latency, and the amount of redundancy. In many cases, an application will not require 100% error correction, which enables significant flexibility for scheduling and binding to reduce resources. Even for applications that require 100% error correction, our heuristic is able to explore solutions that sacrifice latency for reduced resources, and typically save up to 47% when relaxing the latency up to 2×. When the error constraint is reduced to 70%, our heuristic achieves typical resource savings ranging from 18% to 49% when relaxing the latency up to 2×, with a maximum of 77%. Even when comparing with optimized RTL designs, our heuristic uses up to 61% fewer resources than TMR.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"21 1","pages":"202-209"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75888933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-07-27DOI: 10.1109/ASAP.2015.7245727
Rodrigo Devigo, Liana Duenha, R. Azevedo, R. Santos
This paper proposes MultiExplorer, a new toolset for MPSoCs modelling, experimentation, and design space exploration, by combining fast high-abstraction simulation and low-level physical estimates (power, area, and timing). The MultiExplorer infrastructure takes a range of high and low-level parameters to improve accuracy in the design of a multiprocessor system on a chip. Our toolset results show a viable alternative to explore multiprocessor scalability (1-64 cores) on affordable simulation times.
{"title":"MultiExplorer: A tool set for multicore system-on-chip design exploration","authors":"Rodrigo Devigo, Liana Duenha, R. Azevedo, R. Santos","doi":"10.1109/ASAP.2015.7245727","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245727","url":null,"abstract":"This paper proposes MultiExplorer, a new toolset for MPSoCs modelling, experimentation, and design space exploration, by combining fast high-abstraction simulation and low-level physical estimates (power, area, and timing). The MultiExplorer infrastructure takes a range of high and low-level parameters to improve accuracy in the design of a multiprocessor system on a chip. Our toolset results show a viable alternative to explore multiprocessor scalability (1-64 cores) on affordable simulation times.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"28 1","pages":"160-161"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82166762","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-07-27DOI: 10.1109/ASAP.2015.7245715
S. Tade, Hiroki Matsutani, H. Amano, M. Koibuchi
A metamorphotic Network-on-Chip (NoC) architecture is proposed in order to customize for performance or energy consumption on a per-application basis. Adding reconfigurability on conventional topologies has been studied so far especially for application workloads that can be statically analyzed. In this context, we propose such a platform to take care of both the static and the dynamic cases where application workloads cannot be statically analyzed while performance or energy constraints are given. Our metamorphotic NoC reconfigures its topology, routing, operating frequency, and supply voltage based on the following three modes. 1) Regular mode uses a traditional mesh topology for neighboring communications. As the link length is short and uniform, it can be operated at a higher frequency and higher voltage, while a long-range communication increases the path length. 2) Random mode uses a random topology for unknown workloads to reduce the path length by exploiting the small-world effect. As the path length is reduced but the wire delay is increased, it is intended for a lower operating frequency and lower voltage. 3) Custom mode uses an optimized topology for given workloads. To support Random and Custom modes, assembled multiplexers are embedded into the metamorphotic NoC. Random and Regular/Custom modes are generated by randomly or selectively reconfiguring these multiplexers, respectively, based on the performance or energy constraints. This paper explores the design space of assembled multiplexers and provides a reasonable design recommendation through a graph analysis. It is demonstrated based on experimental results on the area overhead, operating frequency, network performance, and energy consumption. The results show that Regular mode can operate at 1.27GHz and Random mode can reduce the average network latency by 19.6% and the energy consumption by 44.2% compared with a traditional NoC that has mesh topology with little overhead. Custom mode can reduce them as well as Random mode.
{"title":"A metamorphotic Network-on-Chip for various types of parallel applications","authors":"S. Tade, Hiroki Matsutani, H. Amano, M. Koibuchi","doi":"10.1109/ASAP.2015.7245715","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245715","url":null,"abstract":"A metamorphotic Network-on-Chip (NoC) architecture is proposed in order to customize for performance or energy consumption on a per-application basis. Adding reconfigurability on conventional topologies has been studied so far especially for application workloads that can be statically analyzed. In this context, we propose such a platform to take care of both the static and the dynamic cases where application workloads cannot be statically analyzed while performance or energy constraints are given. Our metamorphotic NoC reconfigures its topology, routing, operating frequency, and supply voltage based on the following three modes. 1) Regular mode uses a traditional mesh topology for neighboring communications. As the link length is short and uniform, it can be operated at a higher frequency and higher voltage, while a long-range communication increases the path length. 2) Random mode uses a random topology for unknown workloads to reduce the path length by exploiting the small-world effect. As the path length is reduced but the wire delay is increased, it is intended for a lower operating frequency and lower voltage. 3) Custom mode uses an optimized topology for given workloads. To support Random and Custom modes, assembled multiplexers are embedded into the metamorphotic NoC. Random and Regular/Custom modes are generated by randomly or selectively reconfiguring these multiplexers, respectively, based on the performance or energy constraints. This paper explores the design space of assembled multiplexers and provides a reasonable design recommendation through a graph analysis. It is demonstrated based on experimental results on the area overhead, operating frequency, network performance, and energy consumption. The results show that Regular mode can operate at 1.27GHz and Random mode can reduce the average network latency by 19.6% and the energy consumption by 44.2% compared with a traditional NoC that has mesh topology with little overhead. Custom mode can reduce them as well as Random mode.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"23 1","pages":"98-105"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83943718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-07-27DOI: 10.1109/ASAP.2015.7245701
Tanvir Ahmed, Yuko Hara-Azumi
Performance, area, and power are important issues for many embedded systems. One area- and power-efficient way to improve performance is instruction set architecture (ISA) extension. Although existing works have introduced application-specific accelerators co-operating with a basic processor, most of them are still not suitable for embedded systems with stringent resource and/or power constraints because of excess, power-hungry resources in the basic processor. In this paper, we propose ISA extension for such stringently constrained embedded systems. Contrary to previous works, our work rather simplifies the basic processor by replacing original power-hungry resources with power-efficient alternatives. Then, considering the application features (not only input patterns but also instruction sequence), we extend software binary with new instructions executable on the simplified processor. These hardware and software extensions can jointly work well for timing speculation (TS). To the best of our knowledge, this is the first TS-aware ISA extension applicable to embedded systems with stringent area- and/or power-constraints. In our evaluation, we achieved 29.9% speedup in execution time and 1.5× aggressive clock scaling along with 8.7% and 48.3% reduction in circuit area and power-delay product, respectively, compared with the traditional worst-case design.
{"title":"Timing speculation-aware instruction set extension for resource-constrained embedded systems","authors":"Tanvir Ahmed, Yuko Hara-Azumi","doi":"10.1109/ASAP.2015.7245701","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245701","url":null,"abstract":"Performance, area, and power are important issues for many embedded systems. One area- and power-efficient way to improve performance is instruction set architecture (ISA) extension. Although existing works have introduced application-specific accelerators co-operating with a basic processor, most of them are still not suitable for embedded systems with stringent resource and/or power constraints because of excess, power-hungry resources in the basic processor. In this paper, we propose ISA extension for such stringently constrained embedded systems. Contrary to previous works, our work rather simplifies the basic processor by replacing original power-hungry resources with power-efficient alternatives. Then, considering the application features (not only input patterns but also instruction sequence), we extend software binary with new instructions executable on the simplified processor. These hardware and software extensions can jointly work well for timing speculation (TS). To the best of our knowledge, this is the first TS-aware ISA extension applicable to embedded systems with stringent area- and/or power-constraints. In our evaluation, we achieved 29.9% speedup in execution time and 1.5× aggressive clock scaling along with 8.7% and 48.3% reduction in circuit area and power-delay product, respectively, compared with the traditional worst-case design.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"39 1","pages":"30-34"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87866414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}