Giovanni Mariani, G. Palermo, V. Zaccaria, C. Silvano
The design space exploration (DSE) phase is used to tune configurable system parameters and it generally consists of a multiobjective optimization (MOO) problem. It is usually done at pre-design phase and consists of the evaluation of large design spaces where each configuration requires long simulation. Several heuristic techniques have been proposed in the past and the recent trend is reducing the exploration time by using analytic prediction models to approximate the system metrics, effectively pruning sub-optimal configurations from the exploration scope. However, there is still a missing path towards the effective usage of the underlying computing resources used by the DSE process. In this work, we will show that an alternative and almost orthogonal approach - focused on exploiting the available parallelism in terms of computing resources - can be used to better schedule the simulations and to obtain a high speedup with respect to state of the art approaches, without compromising the accuracy of exploration results. Experimental results will be presented by dealing with the DSE problem of a shared memory multi-core system considering a variable number of available parallel resources to support the DSE phase1.
{"title":"DeSpErate: Speeding-up design space exploration by using predictive simulation scheduling","authors":"Giovanni Mariani, G. Palermo, V. Zaccaria, C. Silvano","doi":"10.7873/DATE.2014.231","DOIUrl":"https://doi.org/10.7873/DATE.2014.231","url":null,"abstract":"The design space exploration (DSE) phase is used to tune configurable system parameters and it generally consists of a multiobjective optimization (MOO) problem. It is usually done at pre-design phase and consists of the evaluation of large design spaces where each configuration requires long simulation. Several heuristic techniques have been proposed in the past and the recent trend is reducing the exploration time by using analytic prediction models to approximate the system metrics, effectively pruning sub-optimal configurations from the exploration scope. However, there is still a missing path towards the effective usage of the underlying computing resources used by the DSE process. In this work, we will show that an alternative and almost orthogonal approach - focused on exploiting the available parallelism in terms of computing resources - can be used to better schedule the simulations and to obtain a high speedup with respect to state of the art approaches, without compromising the accuracy of exploration results. Experimental results will be presented by dealing with the DSE problem of a shared memory multi-core system considering a variable number of available parallel resources to support the DSE phase1.","PeriodicalId":6550,"journal":{"name":"2014 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"16 1","pages":"1-4"},"PeriodicalIF":0.0,"publicationDate":"2014-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89517126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jan Weinstock, Christoph Schumacher, R. Leupers, G. Ascheid, L. Tosoratto
With increasing system size and complexity, designers of embedded systems face the challenge of efficiently simulating these systems in order to enable target specific software development and design space exploration as early as possible. Today's multicore workstations offer enormous computational power, but traditional simulation engines like the OSCI SystemC kernel only operate on a single thread, thereby leaving a lot of computational potential unused. Most modern embedded system designs include multiple processors. This work proposes SCope, a SystemC kernel that aims at exploiting the inherent parallelism of such systems by simulating the processors on different threads. A lookahead mechanism is employed to reduce the required synchronization between the simulation threads, thereby further increasing simulation speed. The virtual prototype of the European FP7 project EURETILE system simulator is used as demonstrator for the proposed work, showing a speedup of 4.01× on a four core host system compared to sequential simulation.
{"title":"Time-decoupled parallel SystemC simulation","authors":"Jan Weinstock, Christoph Schumacher, R. Leupers, G. Ascheid, L. Tosoratto","doi":"10.7873/DATE.2014.204","DOIUrl":"https://doi.org/10.7873/DATE.2014.204","url":null,"abstract":"With increasing system size and complexity, designers of embedded systems face the challenge of efficiently simulating these systems in order to enable target specific software development and design space exploration as early as possible. Today's multicore workstations offer enormous computational power, but traditional simulation engines like the OSCI SystemC kernel only operate on a single thread, thereby leaving a lot of computational potential unused. Most modern embedded system designs include multiple processors. This work proposes SCope, a SystemC kernel that aims at exploiting the inherent parallelism of such systems by simulating the processors on different threads. A lookahead mechanism is employed to reduce the required synchronization between the simulation threads, thereby further increasing simulation speed. The virtual prototype of the European FP7 project EURETILE system simulator is used as demonstrator for the proposed work, showing a speedup of 4.01× on a four core host system compared to sequential simulation.","PeriodicalId":6550,"journal":{"name":"2014 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"38 16","pages":"1-4"},"PeriodicalIF":0.0,"publicationDate":"2014-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91438963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Burgio, Giuseppe Tagliavini, Francesco Conti, A. Marongiu, L. Benini
Modern designs for embedded systems are increasingly embracing cluster-based architectures, where small sets of cores communicate through tightly-coupled shared memory banks and high-performance interconnections. At the same time, the complexity of modern applications requires new programming abstractions to exploit dynamic and/or irregular parallelism on such platforms. Supporting dynamic parallelism in systems which i) are resource-constrained and ii) run applications with small units of work calls for a runtime environment which has minimal overhead for the scheduling of parallel tasks. In this work, we study the major sources of overhead in the implementation of OpenMP dynamic loops, sections and tasks, and propose a hardware implementation of a generic Scheduling Engine (HWSE) which fits the semantics of the three constructs. The HWSE is designed as a tightly-coupled block to the PEs within a multi-core cluster, communicating through a shared-memory interface. This allows very fast programming and synchronization with the controlling PEs, fundamental to achieving fast dynamic scheduling, and ultimately to enable fine-grained parallelism. We prove the effectiveness of our solutions with real applications and synthetic benchmarks, using a cycle-accurate virtual platform.
{"title":"Tightly-coupled hardware support to dynamic parallelism acceleration in embedded shared memory clusters","authors":"P. Burgio, Giuseppe Tagliavini, Francesco Conti, A. Marongiu, L. Benini","doi":"10.7873/DATE.2014.169","DOIUrl":"https://doi.org/10.7873/DATE.2014.169","url":null,"abstract":"Modern designs for embedded systems are increasingly embracing cluster-based architectures, where small sets of cores communicate through tightly-coupled shared memory banks and high-performance interconnections. At the same time, the complexity of modern applications requires new programming abstractions to exploit dynamic and/or irregular parallelism on such platforms. Supporting dynamic parallelism in systems which i) are resource-constrained and ii) run applications with small units of work calls for a runtime environment which has minimal overhead for the scheduling of parallel tasks. In this work, we study the major sources of overhead in the implementation of OpenMP dynamic loops, sections and tasks, and propose a hardware implementation of a generic Scheduling Engine (HWSE) which fits the semantics of the three constructs. The HWSE is designed as a tightly-coupled block to the PEs within a multi-core cluster, communicating through a shared-memory interface. This allows very fast programming and synchronization with the controlling PEs, fundamental to achieving fast dynamic scheduling, and ultimately to enable fine-grained parallelism. We prove the effectiveness of our solutions with real applications and synthetic benchmarks, using a cycle-accurate virtual platform.","PeriodicalId":6550,"journal":{"name":"2014 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"55 11 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2014-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82345148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As the resolution of digital images increases, accessing raw image data from memory has become a major consideration during the design of image/video processing systems. This is due to the fact that the bandwidth requirement and energy consumption of such image accessing process has increased. Inspired by the successful application of progressive image sampling techniques in many image processing tasks, this work proposes to apply similar concept within hardware systems to efficiently trade image quality for reduced memory bandwidth requirement and lower energy consumption. Based on this idea, a hardware system is proposed that is placed between the memory subsystem and the processing core of the design. The proposed system alters the conventional memory access pattern to progressively and adaptively access pixels from a target memory external to the system. The sampled pixels are used to reconstruct an approximation to the ground truth, which is stored in an internal image buffer for further processing. The system is prototyped on FPGA and its performance evaluation shows that a saving of up to 85% of memory accessing time and 33%/45% of image acquisition time/energy is achieved on the benchmark image “lena” while maintaining a PSNR of about 30 dB.
{"title":"Image progressive acquisition for hardware systems","authors":"Jianxiong Liu, C. Bouganis, P. Cheung","doi":"10.5555/2616606.2617107","DOIUrl":"https://doi.org/10.5555/2616606.2617107","url":null,"abstract":"As the resolution of digital images increases, accessing raw image data from memory has become a major consideration during the design of image/video processing systems. This is due to the fact that the bandwidth requirement and energy consumption of such image accessing process has increased. Inspired by the successful application of progressive image sampling techniques in many image processing tasks, this work proposes to apply similar concept within hardware systems to efficiently trade image quality for reduced memory bandwidth requirement and lower energy consumption. Based on this idea, a hardware system is proposed that is placed between the memory subsystem and the processing core of the design. The proposed system alters the conventional memory access pattern to progressively and adaptively access pixels from a target memory external to the system. The sampled pixels are used to reconstruct an approximation to the ground truth, which is stored in an internal image buffer for further processing. The system is prototyped on FPGA and its performance evaluation shows that a saving of up to 85% of memory accessing time and 33%/45% of image acquisition time/energy is achieved on the benchmark image “lena” while maintaining a PSNR of about 30 dB.","PeriodicalId":6550,"journal":{"name":"2014 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"28 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2014-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80895940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Antonio Iannopollo, P. Nuzzo, S. Tripakis, A. Sangiovanni-Vincentelli
Given a global specification contract and a system described by a composition of contracts, system verification reduces to checking that the composite contract refines the specification contract, i.e. that any implementation of the composite contract implements the specification contract and is able to operate in any environment admitted by it. Contracts are captured using high-level declarative languages, for example, linear temporal logic (LTL). In this case, refinement checking reduces to an LTL satisfiability checking problem, which can be very expensive to solve for large composite contracts. This paper proposes a scalable refinement checking approach that relies on a library of contracts and local refinement assertions. We propose an algorithm that, given such a library, breaks down the refinement checking problem into multiple successive refinement checks, each of smaller scale. We illustrate the benefits of the approach on an industrial case study of an aircraft electric power system, with up to two orders of magnitude improvement in terms of execution time.
{"title":"Library-based scalable refinement checking for contract-based design","authors":"Antonio Iannopollo, P. Nuzzo, S. Tripakis, A. Sangiovanni-Vincentelli","doi":"10.7873/DATE2014.167","DOIUrl":"https://doi.org/10.7873/DATE2014.167","url":null,"abstract":"Given a global specification contract and a system described by a composition of contracts, system verification reduces to checking that the composite contract refines the specification contract, i.e. that any implementation of the composite contract implements the specification contract and is able to operate in any environment admitted by it. Contracts are captured using high-level declarative languages, for example, linear temporal logic (LTL). In this case, refinement checking reduces to an LTL satisfiability checking problem, which can be very expensive to solve for large composite contracts. This paper proposes a scalable refinement checking approach that relies on a library of contracts and local refinement assertions. We propose an algorithm that, given such a library, breaks down the refinement checking problem into multiple successive refinement checks, each of smaller scale. We illustrate the benefits of the approach on an industrial case study of an aircraft electric power system, with up to two orders of magnitude improvement in terms of execution time.","PeriodicalId":6550,"journal":{"name":"2014 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"10 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2014-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78323837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
GPU architecture has traditionally been used in graphics application because of its enormous computing capability. Moreover, GPU architecture has also been used for general purpose computing in these days. Most of the current scheduling frameworks that are developed to handle GPGPU workload operate sequentially. This is problematic since this sequential approach may not be scalable for real-time systems, which is a consequence of the approach's inability to support preemption. We propose a novel scheduling framework that provides real-time support for the GPGPU platform. In contrast to existing frameworks, our proposed framework considers both concurrent execution of applications on the GPU and mapping between streaming multiprocessors and thread blocks. By considering both concurrent execution and mapping, our framework is able to guarantee timing up to 6.4 times as many applications compared to TimeGraph [9] and Global EDF [5]. In addition, our experimental applications use up to 20% less power under our scheduling framework compared to [5], [9].
{"title":"GPU-EvR: Run-time event based real-time scheduling framework on GPGPU platform","authors":"Haeseung Lee, M. A. Faruque","doi":"10.7873/DATE.2014.233","DOIUrl":"https://doi.org/10.7873/DATE.2014.233","url":null,"abstract":"GPU architecture has traditionally been used in graphics application because of its enormous computing capability. Moreover, GPU architecture has also been used for general purpose computing in these days. Most of the current scheduling frameworks that are developed to handle GPGPU workload operate sequentially. This is problematic since this sequential approach may not be scalable for real-time systems, which is a consequence of the approach's inability to support preemption. We propose a novel scheduling framework that provides real-time support for the GPGPU platform. In contrast to existing frameworks, our proposed framework considers both concurrent execution of applications on the GPU and mapping between streaming multiprocessors and thread blocks. By considering both concurrent execution and mapping, our framework is able to guarantee timing up to 6.4 times as many applications compared to TimeGraph [9] and Global EDF [5]. In addition, our experimental applications use up to 20% less power under our scheduling framework compared to [5], [9].","PeriodicalId":6550,"journal":{"name":"2014 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"13 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2014-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79239805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Juan Fernando Eusse Giraldo, R. Leupers, G. Ascheid, Patrick Sudowe, B. Leibe, Tamon Sadasue
Real-time identification of connected regions of pixels in large (e.g. FullHD) frames is a mandatory and expensive step in many computer vision applications that are becoming increasingly popular in embedded mobile devices such as smart-phones, tablets and head mounted devices. Standard off-the-shelf embedded processors are not yet able to cope with the performance/flexibility trade-offs required by such applications. Therefore, in this work we present an Application Specific Instruction Set Processor (ASIP) tailored to concurrently execute thresholding, connected components labeling and basic feature extraction of image frames. The proposed architecture is capable to cope with frame complexities ranging from QCIF to FullHD frames with 1 to 4 bytes-per-pixel formats, while achieving an average frame rate of 30 frames-per-second (fps). Synthesis was performed for a standard 65nm CMOS library, obtaining an operating frequency of 350MHz and 2.1mm2 area. Moreover, evaluations were conducted both on typical and synthetic data sets, in order to thoroughly assess the achievable performance. Finally, an entire planar-marker based augmented reality application was developed and simulated for the ASIP.
{"title":"A flexible ASIP architecture for connected components labeling in embedded vision applications","authors":"Juan Fernando Eusse Giraldo, R. Leupers, G. Ascheid, Patrick Sudowe, B. Leibe, Tamon Sadasue","doi":"10.7873/DATE.2014.367","DOIUrl":"https://doi.org/10.7873/DATE.2014.367","url":null,"abstract":"Real-time identification of connected regions of pixels in large (e.g. FullHD) frames is a mandatory and expensive step in many computer vision applications that are becoming increasingly popular in embedded mobile devices such as smart-phones, tablets and head mounted devices. Standard off-the-shelf embedded processors are not yet able to cope with the performance/flexibility trade-offs required by such applications. Therefore, in this work we present an Application Specific Instruction Set Processor (ASIP) tailored to concurrently execute thresholding, connected components labeling and basic feature extraction of image frames. The proposed architecture is capable to cope with frame complexities ranging from QCIF to FullHD frames with 1 to 4 bytes-per-pixel formats, while achieving an average frame rate of 30 frames-per-second (fps). Synthesis was performed for a standard 65nm CMOS library, obtaining an operating frequency of 350MHz and 2.1mm2 area. Moreover, evaluations were conducted both on typical and synthetic data sets, in order to thoroughly assess the achievable performance. Finally, an entire planar-marker based augmented reality application was developed and simulated for the ASIP.","PeriodicalId":6550,"journal":{"name":"2014 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"9 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2014-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79547022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Swaminathan Narayanaswamy, S. Steinhorst, M. Lukasiewycz, M. Kauer, S. Chakraborty
This paper presents an approach to optimal dimensioning of active cell balancing architectures, which are of increasing relevance in Electrical Energy Storages (EESs) for Electric Vehicles (EVs) or stationary applications such as smart grids. Active cell balancing equalizes the state of charge of cells within a battery pack via charge transfers, increasing the effective capacity and lifetime. While optimization approaches have been introduced into the design process of several aspects of EESs, active cell balancing architectures have, until now, not been systematically optimized in terms of their components. Therefore, this paper analyzes existing architectures to develop design metrics for energy dissipation, installation volume, and balancing current. Based on these design metrics, a methodology to efficiently obtain Pareto-optimal configurations for a wide range of inductors and transistors at different balancing currents is developed. Our methodology is then applied to a case study, optimizing two state-of-the-art architectures using realistic balancing algorithms. The results give evidence of the applicability of systematic optimization in the domain of cell balancing, leading to higher energy efficiencies with minimized installation space.
{"title":"Optimal dimensioning of active cell balancing architectures","authors":"Swaminathan Narayanaswamy, S. Steinhorst, M. Lukasiewycz, M. Kauer, S. Chakraborty","doi":"10.7873/DATE.2014.153","DOIUrl":"https://doi.org/10.7873/DATE.2014.153","url":null,"abstract":"This paper presents an approach to optimal dimensioning of active cell balancing architectures, which are of increasing relevance in Electrical Energy Storages (EESs) for Electric Vehicles (EVs) or stationary applications such as smart grids. Active cell balancing equalizes the state of charge of cells within a battery pack via charge transfers, increasing the effective capacity and lifetime. While optimization approaches have been introduced into the design process of several aspects of EESs, active cell balancing architectures have, until now, not been systematically optimized in terms of their components. Therefore, this paper analyzes existing architectures to develop design metrics for energy dissipation, installation volume, and balancing current. Based on these design metrics, a methodology to efficiently obtain Pareto-optimal configurations for a wide range of inductors and transistors at different balancing currents is developed. Our methodology is then applied to a case study, optimizing two state-of-the-art architectures using realistic balancing algorithms. The results give evidence of the applicability of systematic optimization in the domain of cell balancing, leading to higher energy efficiencies with minimized installation space.","PeriodicalId":6550,"journal":{"name":"2014 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"150 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2014-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77426582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
B. Singh, Arunprasath Shankar, F. Wolff, C. Papachristou, D. Weyer, Steve Clay
Semiconductor companies often use 3rd party IPs in order to improve their design productivity. In practice, there are risks involved in using a 3rd party IP as bugs may creep in due to versioning issues, poor documentation, and mismatches between specification and RTL. As a result of this, 3rd party IP specification and RTL must be carefully evaluated. Our methodology addresses this issue, which cross-correlates specification and RTL to discover these discrepancies. The key innovative ideas in our approach are to use prior and trusted experience about designs, which include their specs and RTL code. Also, we have captured this trusted experience into two knowledge bases (KB), Spec-KB and RTL-KB. Finally, knowledge base rules are used to cross-correlate the RTL blocks to the specs. We have tested our approach by analyzing several 3rd party IPs. We have defined metrics for specification coverage and RTL identification coverage to quantify our results.
{"title":"Cross-correlation of specification and RTL for soft IP analysis","authors":"B. Singh, Arunprasath Shankar, F. Wolff, C. Papachristou, D. Weyer, Steve Clay","doi":"10.7873/DATE.2014.303","DOIUrl":"https://doi.org/10.7873/DATE.2014.303","url":null,"abstract":"Semiconductor companies often use 3rd party IPs in order to improve their design productivity. In practice, there are risks involved in using a 3rd party IP as bugs may creep in due to versioning issues, poor documentation, and mismatches between specification and RTL. As a result of this, 3rd party IP specification and RTL must be carefully evaluated. Our methodology addresses this issue, which cross-correlates specification and RTL to discover these discrepancies. The key innovative ideas in our approach are to use prior and trusted experience about designs, which include their specs and RTL code. Also, we have captured this trusted experience into two knowledge bases (KB), Spec-KB and RTL-KB. Finally, knowledge base rules are used to cross-correlate the RTL blocks to the specs. We have tested our approach by analyzing several 3rd party IPs. We have defined metrics for specification coverage and RTL identification coverage to quantify our results.","PeriodicalId":6550,"journal":{"name":"2014 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"43 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2014-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79485273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Novo, Nazanin Farahpour, P. Ienne, U. Ahmad, F. Catthoor
Worst-case design is one of the keys to practical engineering: create solutions that can withstand the most adverse possible conditions. Yet, the ever-growing need for higher energy efficiency suggest a grim outlook for worst-case design in the future. In this paper, we propose opportunistic runtime approximations to enable a continuous adaptation of the processing precision (operator type and bitwidth) to the actual execution context without modifying the algorithm functionality. We show that by relaxing the processing precision whenever possible, a VLSI implementation of an advanced wireless receiver algorithm based on opportunistic run-time approximations can save about 40% of the energy consumed by an optimized static implementation. These energy savings are achieved at the expense of a slight increase in overall chip area.
{"title":"Energy efficient MIMO processing: A case study of opportunistic run-time approximations","authors":"D. Novo, Nazanin Farahpour, P. Ienne, U. Ahmad, F. Catthoor","doi":"10.7873/DATE.2014.220","DOIUrl":"https://doi.org/10.7873/DATE.2014.220","url":null,"abstract":"Worst-case design is one of the keys to practical engineering: create solutions that can withstand the most adverse possible conditions. Yet, the ever-growing need for higher energy efficiency suggest a grim outlook for worst-case design in the future. In this paper, we propose opportunistic runtime approximations to enable a continuous adaptation of the processing precision (operator type and bitwidth) to the actual execution context without modifying the algorithm functionality. We show that by relaxing the processing precision whenever possible, a VLSI implementation of an advanced wireless receiver algorithm based on opportunistic run-time approximations can save about 40% of the energy consumed by an optimized static implementation. These energy savings are achieved at the expense of a slight increase in overall chip area.","PeriodicalId":6550,"journal":{"name":"2014 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"162 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2014-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78011960","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}