Pub Date : 2017-08-09DOI: 10.1109/TMSCS.2017.2737625
Bing Han;Aayush Ankit;Abhronil Sengupta;Kaushik Roy
Deep learning convolutional artificial neural networks have achieved success in a large number of visual processing tasks and are currently utilized for many real-world applications like image search and speech recognition among others. However, despite achieving high accuracy in such classification problems, they involve significant computational resources. Over the past few years, non-spiking deep convolutional artificial neural network models have evolved into more biologically realistic and event-driven spiking deep convolutional artificial neural networks. Recent research efforts have been directed at developing mechanisms to convert traditional non-spiking deep convolutional artificial neural networks to the spiking ones where neurons communicate by means of spikes. However, there have been limited studies providing insights on the specific power, area, and energy benefits offered by the spiking deep convolutional artificial neural networks in comparison to their non-spiking counterparts. We perform a comprehensive study for hardware implementation of spiking/non-spiking deep convolutional artificial neural networks on MNIST, CIFAR10, and SVHN datasets. To this effect, we design AccelNN - a Neural Network Accelerator to execute neural network benchmarks and analyze the effects of circuit-architecture level techniques to harness event-drivenness. A comparative analysis between spiking and non-spiking versions of deep convolutional artificial neural networks is presented by performing trade-offs between recognition accuracy and corresponding power, latency and energy requirements.
{"title":"Cross-Layer Design Exploration for Energy-Quality Tradeoffs in Spiking and Non-Spiking Deep Artificial Neural Networks","authors":"Bing Han;Aayush Ankit;Abhronil Sengupta;Kaushik Roy","doi":"10.1109/TMSCS.2017.2737625","DOIUrl":"https://doi.org/10.1109/TMSCS.2017.2737625","url":null,"abstract":"Deep learning convolutional artificial neural networks have achieved success in a large number of visual processing tasks and are currently utilized for many real-world applications like image search and speech recognition among others. However, despite achieving high accuracy in such classification problems, they involve significant computational resources. Over the past few years, non-spiking deep convolutional artificial neural network models have evolved into more biologically realistic and event-driven spiking deep convolutional artificial neural networks. Recent research efforts have been directed at developing mechanisms to convert traditional non-spiking deep convolutional artificial neural networks to the spiking ones where neurons communicate by means of spikes. However, there have been limited studies providing insights on the specific power, area, and energy benefits offered by the spiking deep convolutional artificial neural networks in comparison to their non-spiking counterparts. We perform a comprehensive study for hardware implementation of spiking/non-spiking deep convolutional artificial neural networks on MNIST, CIFAR10, and SVHN datasets. To this effect, we design AccelNN - a Neural Network Accelerator to execute neural network benchmarks and analyze the effects of circuit-architecture level techniques to harness event-drivenness. A comparative analysis between spiking and non-spiking versions of deep convolutional artificial neural networks is presented by performing trade-offs between recognition accuracy and corresponding power, latency and energy requirements.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 4","pages":"613-623"},"PeriodicalIF":0.0,"publicationDate":"2017-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2017.2737625","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68024190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-06-28DOI: 10.1109/TMSCS.2017.2721160
Mohammed A. Zidan;YeonJoo Jeong;Jong Hoon Shin;Chao Du;Zhengya Zhang;Wei D. Lu
For decades, advances in electronics were directly driven by the scaling of CMOS transistors according to Moore's law. However, both the CMOS scaling and the classical computer architecture are approaching fundamental and practical limits, and new computing architectures based on emerging devices, such as resistive random-access memory (RRAM) devices, are expected to sustain the exponential growth of computing capability. Here, we propose a novel memory-centric, reconfigurable, general purpose computing platform that is capable of handling the explosive amount of data in a fast and energy-efficient manner. The proposed computing architecture is based on a uniform, physical, resistive, memory-centric fabric that can be optimally reconfigured and utilized to perform different computing and data storage tasks in a massively parallel approach. The system can be tailored to achieve maximal energy efficiency based on the data flow by dynamically allocating the basic computing fabric for storage, arithmetic, and analog computing including neuromorphic computing tasks.
{"title":"Field-Programmable Crossbar Array (FPCA) for Reconfigurable Computing","authors":"Mohammed A. Zidan;YeonJoo Jeong;Jong Hoon Shin;Chao Du;Zhengya Zhang;Wei D. Lu","doi":"10.1109/TMSCS.2017.2721160","DOIUrl":"https://doi.org/10.1109/TMSCS.2017.2721160","url":null,"abstract":"For decades, advances in electronics were directly driven by the scaling of CMOS transistors according to Moore's law. However, both the CMOS scaling and the classical computer architecture are approaching fundamental and practical limits, and new computing architectures based on emerging devices, such as resistive random-access memory (RRAM) devices, are expected to sustain the exponential growth of computing capability. Here, we propose a novel memory-centric, reconfigurable, general purpose computing platform that is capable of handling the explosive amount of data in a fast and energy-efficient manner. The proposed computing architecture is based on a uniform, physical, resistive, memory-centric fabric that can be optimally reconfigured and utilized to perform different computing and data storage tasks in a massively parallel approach. The system can be tailored to achieve maximal energy efficiency based on the data flow by dynamically allocating the basic computing fabric for storage, arithmetic, and analog computing including neuromorphic computing tasks.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 4","pages":"698-710"},"PeriodicalIF":0.0,"publicationDate":"2017-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2017.2721160","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68024065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-06-27DOI: 10.1109/TMSCS.2017.2720660
Arsalan Mosenia;Susmita Sur-Kolay;Anand Raghunathan;Niraj K. Jha
Rapid technological advances in microelectronics, networking, and computer science have resulted in an exponential increase in the number of cyber-physical systems (CPSs) that enable numerous services in various application domains, e.g., smart homes and smart grids. Moreover, the emergence of the Internet-of-Things (IoT) paradigm has led to the pervasive use of IoT-enabled CPSs in our everyday lives. Unfortunately, as a side effect, the numberof potential threats and feasible security attacks against CPSs has grown significantly. In this paper, we introduce a new class of attacks against CPSs, called dedicated intelligent security attacks against sensor-triggered emergency responses (DISASTER). DISASTER targets safety mechanisms deployed in automation/monitoring CPSs and exploits design flaws and security weaknesses of such mechanisms to trigger emergency responses even in the absence of a real emergency. Launching DISASTER can lead to serious consequences forthree main reasons. First, almost all CPSs offer specific emergency responses and, as a result, are potentially susceptible to such attacks. Second, DISASTER can be easily designed to target a large number of CPSs, e.g., the anti-theft systems of all buildings in a residential community. Third, the widespread deployment of insecure sensors in already-in-use safety mechanisms along with the endless variety of CPS-based applications magnifies the impact of launching DISASTER. In addition to introducing DISASTER, we describe the serious consequences of such attacks. We demonstrate the feasibility of launching DISASTER against the two most widely-used CPSs: residential and industrial automation/monitoring systems. Moreover, we suggest several countermeasures that can potentially prevent DISASTER and discuss their advantages and drawbacks.
{"title":"DISASTER: Dedicated Intelligent Security Attacks on Sensor-Triggered Emergency Responses","authors":"Arsalan Mosenia;Susmita Sur-Kolay;Anand Raghunathan;Niraj K. Jha","doi":"10.1109/TMSCS.2017.2720660","DOIUrl":"https://doi.org/10.1109/TMSCS.2017.2720660","url":null,"abstract":"Rapid technological advances in microelectronics, networking, and computer science have resulted in an exponential increase in the number of cyber-physical systems (CPSs) that enable numerous services in various application domains, e.g., smart homes and smart grids. Moreover, the emergence of the Internet-of-Things (IoT) paradigm has led to the pervasive use of IoT-enabled CPSs in our everyday lives. Unfortunately, as a side effect, the numberof potential threats and feasible security attacks against CPSs has grown significantly. In this paper, we introduce a new class of attacks against CPSs, called dedicated intelligent security attacks against sensor-triggered emergency responses (DISASTER). DISASTER targets safety mechanisms deployed in automation/monitoring CPSs and exploits design flaws and security weaknesses of such mechanisms to trigger emergency responses even in the absence of a real emergency. Launching DISASTER can lead to serious consequences forthree main reasons. First, almost all CPSs offer specific emergency responses and, as a result, are potentially susceptible to such attacks. Second, DISASTER can be easily designed to target a large number of CPSs, e.g., the anti-theft systems of all buildings in a residential community. Third, the widespread deployment of insecure sensors in already-in-use safety mechanisms along with the endless variety of CPS-based applications magnifies the impact of launching DISASTER. In addition to introducing DISASTER, we describe the serious consequences of such attacks. We demonstrate the feasibility of launching DISASTER against the two most widely-used CPSs: residential and industrial automation/monitoring systems. Moreover, we suggest several countermeasures that can potentially prevent DISASTER and discuss their advantages and drawbacks.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"3 4","pages":"255-268"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2017.2720660","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68021198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-04-28DOI: 10.1109/TMSCS.2017.2699647
Marco Lattuada;Fabrizio Ferrandi;Maxime Perrotin
The integration of Field Programmable Gate Arrays (FPGAs) in an aerospace system improves its efficiency and its flexibility thanks to their programmability, but increases the design complexity. The design flows indeed have to be composed of several steps to fill the gap between the starting solution, which is usually a reference sequential implementation, and the final heterogeneous solution which includes custom hardware accelerators. Among these steps, there are the analysis of the application to identify the functionalities that gain advantages in execution on hardware and the generation of their implementations by means of Hardware Description Languages. Generating these descriptions for a software developer can be a very difficult task because of the different programming paradigms of software programs and hardware descriptions. To facilitate the developer in this activity, High Level Synthesis techniques have been developed aiming at (semi-)automatically generating hardware implementations of specifications written in high level languages (e.g., C). With respect to other embedded systems scenarios, the aerospace systems introduce further constraints that have to be taken into account during the design of these heterogeneous systems. In this type of systems explicit data transfers to and from FPGAs are preferred to the adoption of a shared memory architecture. The first approach indeed potentially improves the predictability of the produced solutions, but the sizes of all the data transferred to and from any devices must be known at design time. Identifying the sizes in presence of complex C applications which use pointers can be a not so easy task. In this paper, a semi-automatic design flow based on the integration of an aerospace design flow, an application analysis technique, and High Level Synthesis methodologies is presented. The initial reference application is analyzed to identify which are the sizes of the data exchanged among the different components of the application. Next, starting from the high level specification and from the results of this analysis, High Level Synthesis techniques are applied to automatically produce the hardware accelerators.
{"title":"Data Transfers Analysis in Computer Assisted Design Flow of FPGA Accelerators for Aerospace Systems","authors":"Marco Lattuada;Fabrizio Ferrandi;Maxime Perrotin","doi":"10.1109/TMSCS.2017.2699647","DOIUrl":"https://doi.org/10.1109/TMSCS.2017.2699647","url":null,"abstract":"The integration of Field Programmable Gate Arrays (FPGAs) in an aerospace system improves its efficiency and its flexibility thanks to their programmability, but increases the design complexity. The design flows indeed have to be composed of several steps to fill the gap between the starting solution, which is usually a reference sequential implementation, and the final heterogeneous solution which includes custom hardware accelerators. Among these steps, there are the analysis of the application to identify the functionalities that gain advantages in execution on hardware and the generation of their implementations by means of Hardware Description Languages. Generating these descriptions for a software developer can be a very difficult task because of the different programming paradigms of software programs and hardware descriptions. To facilitate the developer in this activity, High Level Synthesis techniques have been developed aiming at (semi-)automatically generating hardware implementations of specifications written in high level languages (e.g., C). With respect to other embedded systems scenarios, the aerospace systems introduce further constraints that have to be taken into account during the design of these heterogeneous systems. In this type of systems explicit data transfers to and from FPGAs are preferred to the adoption of a shared memory architecture. The first approach indeed potentially improves the predictability of the produced solutions, but the sizes of all the data transferred to and from any devices must be known at design time. Identifying the sizes in presence of complex C applications which use pointers can be a not so easy task. In this paper, a semi-automatic design flow based on the integration of an aerospace design flow, an application analysis technique, and High Level Synthesis methodologies is presented. The initial reference application is analyzed to identify which are the sizes of the data exchanged among the different components of the application. Next, starting from the high level specification and from the results of this analysis, High Level Synthesis techniques are applied to automatically produce the hardware accelerators.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 1","pages":"3-16"},"PeriodicalIF":0.0,"publicationDate":"2017-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2017.2699647","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68003401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-04-24DOI: 10.1109/TMSCS.2017.2696941
Somesh Kumar;Rohit Sharma
In planar on-chip copper interconnects, conductor losses due to surface roughness demands explicit consideration for accurate modeling of their performance metrics. This is quite pertinent for high-performance manycore processors/servers, where on-chip interconnects are increasingly emerging as one of the key performance bottlenecks. This paper presents a novel analytical model for parameter extraction in current and future on-chip interconnects. Our proposed model aids in analyzing the impact of spatial and vertical surface roughness on their electrical performance. Our analysis clearly depicts that as the technology nodes scale down; the effect of the surface roughness becomes dominant and cannot be ignored. Based on AFM images of fabricated ultra-thin copper sheets, we have extracted roughness parameters to define realistic surface profiles using the well-known Mandelbrot-Weierstrass (MW) fractal function. For our analysis, we have considered four current and future interconnect technology nodes (i.e., 45, 22, 13, 7 nm) and evaluated the impact of surface roughness on typical performance metrics, such as delay, energy, and bandwidth. Results obtained using our model are verified by comparing with industry standard field solver Ansys HFSS as well as available experimental data that exhibits accuracy within 9 percent. We present signal integrity analysis using the eye diagram at 1, 5, 10, and 18 Gbps bit rates to find the increase in frequency dependent losses due to surface roughness. Finally, simulating a standard three line on-chip interconnect structure, we also report the computational overhead incurred for different values of roughness and technology nodes.
{"title":"Analytical Modeling and Performance Benchmarking of On-Chip Interconnects with Rough Surfaces","authors":"Somesh Kumar;Rohit Sharma","doi":"10.1109/TMSCS.2017.2696941","DOIUrl":"https://doi.org/10.1109/TMSCS.2017.2696941","url":null,"abstract":"In planar on-chip copper interconnects, conductor losses due to surface roughness demands explicit consideration for accurate modeling of their performance metrics. This is quite pertinent for high-performance manycore processors/servers, where on-chip interconnects are increasingly emerging as one of the key performance bottlenecks. This paper presents a novel analytical model for parameter extraction in current and future on-chip interconnects. Our proposed model aids in analyzing the impact of spatial and vertical surface roughness on their electrical performance. Our analysis clearly depicts that as the technology nodes scale down; the effect of the surface roughness becomes dominant and cannot be ignored. Based on AFM images of fabricated ultra-thin copper sheets, we have extracted roughness parameters to define realistic surface profiles using the well-known Mandelbrot-Weierstrass (MW) fractal function. For our analysis, we have considered four current and future interconnect technology nodes (i.e., 45, 22, 13, 7 nm) and evaluated the impact of surface roughness on typical performance metrics, such as delay, energy, and bandwidth. Results obtained using our model are verified by comparing with industry standard field solver Ansys HFSS as well as available experimental data that exhibits accuracy within 9 percent. We present signal integrity analysis using the eye diagram at 1, 5, 10, and 18 Gbps bit rates to find the increase in frequency dependent losses due to surface roughness. Finally, simulating a standard three line on-chip interconnect structure, we also report the computational overhead incurred for different values of roughness and technology nodes.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 3","pages":"272-284"},"PeriodicalIF":0.0,"publicationDate":"2017-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2017.2696941","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68023871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper proposes a new approximate scheme for coordinate rotation digital computer (CORDIC) design. This scheme is based on modifying the existing Para-CORDIC architecture with an approximation that is inserted in multiple parts and made possible by relaxing the CORDIC algorithm itself. A fully parallel approximate CORDIC (FPAX-CORDIC) scheme is proposed. This scheme avoids the memory register of Para-CORDIC and makes the generation of the rotation direction fully parallel. A comprehensive analysis and the evaluation of the error introduced by the approximation together with different circuit-related metrics are pursued using HSPICE as the simulation tool. This error analysis also combines existing figures of merit for approximate computing (such as the Mean Error Distance (MED) and MED Power Product (MPP)) with CORDIC specific parameters. It is shown that a good agreement between expected and simulated error values is found. The Discrete Cosine Transformation (DCT) and the Inverse DCT (IDCT) transformations as case study of approximate computing to image processing are investigated by utilizing the proposed approximate FPAX-CORDIC architecture with different accuracy requirements. The results confirm the viability of the proposed scheme.
{"title":"Algorithm and Design of a Fully Parallel Approximate Coordinate Rotation Digital Computer (CORDIC)","authors":"Linbin Chen;Jie Han;Weiqiang Liu;Fabrizio Lombardi","doi":"10.1109/TMSCS.2017.2696003","DOIUrl":"https://doi.org/10.1109/TMSCS.2017.2696003","url":null,"abstract":"This paper proposes a new approximate scheme for coordinate rotation digital computer (CORDIC) design. This scheme is based on modifying the existing Para-CORDIC architecture with an approximation that is inserted in multiple parts and made possible by relaxing the CORDIC algorithm itself. A fully parallel approximate CORDIC (FPAX-CORDIC) scheme is proposed. This scheme avoids the memory register of Para-CORDIC and makes the generation of the rotation direction fully parallel. A comprehensive analysis and the evaluation of the error introduced by the approximation together with different circuit-related metrics are pursued using HSPICE as the simulation tool. This error analysis also combines existing figures of merit for approximate computing (such as the Mean Error Distance (MED) and MED Power Product (MPP)) with CORDIC specific parameters. It is shown that a good agreement between expected and simulated error values is found. The Discrete Cosine Transformation (DCT) and the Inverse DCT (IDCT) transformations as case study of approximate computing to image processing are investigated by utilizing the proposed approximate FPAX-CORDIC architecture with different accuracy requirements. The results confirm the viability of the proposed scheme.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"3 3","pages":"139-151"},"PeriodicalIF":0.0,"publicationDate":"2017-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2017.2696003","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68071086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-04-19DOI: 10.1109/CoolChips.2017.7946386
Keita Azegami, Hayate Okuhara, H. Amano
Sensor nodes used in Internet of Things (IoT) are required to work an extremely long time without replacing the battery. Natural renewable energy such as a solar battery is a hopeful candidate for such nodes. Here, a power model for operating an Silicon on Insulator (SOI) device with a solar battery including a large inner resistance is proposed, and applied to a micro-controller V850E-star and an accelerator CMA-SOTB2. Unlike the ideal case, the maximum operational frequency was achieved with reverse biasing by suppressing the leakage current which decreases the supply voltage. Under the room light with a large inner resistance, the strong reverse bias is effective, while a relatively weak reverse bias is advantageous under the bright light. The proposed model is appeared to be useful to estimate the appropriate body bias voltage both for V850E-star and CMA-SOTB2. In the V850E-star, the estimated operational frequencies were different from the real chip, while they were relatively matched when CMA-SOTB2 was used under the low illuminance.
{"title":"Body Bias Control for Renewable Energy Source with a High Inner Resistance","authors":"Keita Azegami, Hayate Okuhara, H. Amano","doi":"10.1109/CoolChips.2017.7946386","DOIUrl":"https://doi.org/10.1109/CoolChips.2017.7946386","url":null,"abstract":"Sensor nodes used in Internet of Things (IoT) are required to work an extremely long time without replacing the battery. Natural renewable energy such as a solar battery is a hopeful candidate for such nodes. Here, a power model for operating an Silicon on Insulator (SOI) device with a solar battery including a large inner resistance is proposed, and applied to a micro-controller V850E-star and an accelerator CMA-SOTB2. Unlike the ideal case, the maximum operational frequency was achieved with reverse biasing by suppressing the leakage current which decreases the supply voltage. Under the room light with a large inner resistance, the strong reverse bias is effective, while a relatively weak reverse bias is advantageous under the bright light. The proposed model is appeared to be useful to estimate the appropriate body bias voltage both for V850E-star and CMA-SOTB2. In the V850E-star, the estimated operational frequencies were different from the real chip, while they were relatively matched when CMA-SOTB2 was used under the low illuminance.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"487 1","pages":"605-612"},"PeriodicalIF":0.0,"publicationDate":"2017-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88943346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-04-18DOI: 10.1109/TMSCS.2017.2695338
Michael J. Doyle;Ciarán Tuohy;Michael Manzke
The ever-increasing demands of computer graphics applications have motivated the evolution of computer graphics hardware over the last 20 years. Early commodity graphics hardware was largely based on fixed-function components offering little flexibility. The gradual replacement of fixed-function hardware with more general-purpose instruction processors, has enabled GPUs to deliver visual experiences more tailored to specific applications. This trend has culminated in modern GPUs essentially being programmable stream processors, capable of supporting a wide variety of applications in and outside of computer graphics. However, the growing concern of power efficiency in modern processors, coupled with an increasing demand for supporting next-generation graphics pipelines, has re-invigorated the debate on the use of fixed-function accelerators in these platforms. In this paper, we conduct a study of a heterogeneous, system-on-chip solution for the construction of a highly important data structure for computer graphics: the bounding volume hierarchy. This design incorporates conventional CPU cores alongside a fixed-function accelerator prototyped on a reconfigurable logic fabric. Our study supports earlier, simulation-only studies which argue for the introduction of this class of accelerator in future graphics processors.
{"title":"Evaluation of a BVH Construction Accelerator Architecture for High-Quality Visualization","authors":"Michael J. Doyle;Ciarán Tuohy;Michael Manzke","doi":"10.1109/TMSCS.2017.2695338","DOIUrl":"https://doi.org/10.1109/TMSCS.2017.2695338","url":null,"abstract":"The ever-increasing demands of computer graphics applications have motivated the evolution of computer graphics hardware over the last 20 years. Early commodity graphics hardware was largely based on fixed-function components offering little flexibility. The gradual replacement of fixed-function hardware with more general-purpose instruction processors, has enabled GPUs to deliver visual experiences more tailored to specific applications. This trend has culminated in modern GPUs essentially being programmable stream processors, capable of supporting a wide variety of applications in and outside of computer graphics. However, the growing concern of power efficiency in modern processors, coupled with an increasing demand for supporting next-generation graphics pipelines, has re-invigorated the debate on the use of fixed-function accelerators in these platforms. In this paper, we conduct a study of a heterogeneous, system-on-chip solution for the construction of a highly important data structure for computer graphics: the bounding volume hierarchy. This design incorporates conventional CPU cores alongside a fixed-function accelerator prototyped on a reconfigurable logic fabric. Our study supports earlier, simulation-only studies which argue for the introduction of this class of accelerator in future graphics processors.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 1","pages":"83-94"},"PeriodicalIF":0.0,"publicationDate":"2017-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2017.2695338","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68003398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-04-18DOI: 10.1109/TMSCS.2017.2695588
Mohammed Alawad;Mingjie Lin
High memory/storage complexity poses severe challenges to achieving high throughput and high energy efficiency in discrete 2-D FIR filtering. This performance bottleneck is especially acute for embedded image or video applications, that use 2-D FIR processing extensively, because real-time processing and low power consumption are their paramount design objectives. Fortunately, most of such perception-based embedded applications possess so-called “inherent fault tolerance”, meaning slight computing accuracy degradation has a little negative effect on their quality of results, but has significant implication to their throughput, hardware implementation cost, and energy efficiency. This paper develops a novel stochastic-based 2-D FIR filtering architecture that exploits the well-known probabilistic convolution theorem to achieve both low hardware cost and high energy efficiency while achieving very high throughput and computing robustness. Our ASIC synthesis results show that stochastic-based architecture achieves L outputs per cycle with 97 and 81 percent less area-delay-product (ADP), and 77 and 67 percent less power consumption compared with the conventional structure and recently published state-of-the-art architecture, respectively, when the 2-D FIR filter size is 4 × 4, the input block size is L 1/4 4, and the image size is 512 × 512.
{"title":"Memory-Efficient Probabilistic 2-D Finite Impulse Response (FIR) Filter","authors":"Mohammed Alawad;Mingjie Lin","doi":"10.1109/TMSCS.2017.2695588","DOIUrl":"https://doi.org/10.1109/TMSCS.2017.2695588","url":null,"abstract":"High memory/storage complexity poses severe challenges to achieving high throughput and high energy efficiency in discrete 2-D FIR filtering. This performance bottleneck is especially acute for embedded image or video applications, that use 2-D FIR processing extensively, because real-time processing and low power consumption are their paramount design objectives. Fortunately, most of such perception-based embedded applications possess so-called “inherent fault tolerance”, meaning slight computing accuracy degradation has a little negative effect on their quality of results, but has significant implication to their throughput, hardware implementation cost, and energy efficiency. This paper develops a novel stochastic-based 2-D FIR filtering architecture that exploits the well-known probabilistic convolution theorem to achieve both low hardware cost and high energy efficiency while achieving very high throughput and computing robustness. Our ASIC synthesis results show that stochastic-based architecture achieves L outputs per cycle with 97 and 81 percent less area-delay-product (ADP), and 77 and 67 percent less power consumption compared with the conventional structure and recently published state-of-the-art architecture, respectively, when the 2-D FIR filter size is 4 × 4, the input block size is L 1/4 4, and the image size is 512 × 512.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 1","pages":"69-82"},"PeriodicalIF":0.0,"publicationDate":"2017-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2017.2695588","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68003400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As task preemption/relocation with acceptably low overheads become a reality in today's reconfigurable FPGAs, they are starting to show bright prospects as platforms for executing performance critical task sets while allowing high resource utilization. Many performance sensitive real-time systems including those in automotive and avionics systems, chemical reactors, etc., often execute a set of persistent periodic safety critical control tasks along with dynamic event driven aperiodic tasks. This work presents a co-scheduling framework for the combined execution of such periodic and aperiodic real-time tasks on fully and run-time partially reconfigurable platforms. Specifically, we present an admission control strategy and preemptive scheduling methodology for dynamic aperiodic tasks in the presence of a set of persistent periodic tasks such that aperiodic task rejections may be minimized, thus resulting in high resource utilization. We used the 2D slotted area model where the floor of the FPGA is assumed to be statically equipartitioned into a set of tiles in which any arbitrary task may be feasibly mapped. The experimental results reveal that the proposed scheduling strategies are able to achieve high resource utilization with low task rejection rates over various simulation scenarios.
{"title":"Co-Scheduling Persistent Periodic and Dynamic Aperiodic Real-Time Tasks on Reconfigurable Platforms","authors":"Sangeet Saha;Arnab Sarkar;Amlan Chakrabarti;Ranjan Ghosh","doi":"10.1109/TMSCS.2017.2691701","DOIUrl":"https://doi.org/10.1109/TMSCS.2017.2691701","url":null,"abstract":"As task preemption/relocation with acceptably low overheads become a reality in today's reconfigurable FPGAs, they are starting to show bright prospects as platforms for executing performance critical task sets while allowing high resource utilization. Many performance sensitive real-time systems including those in automotive and avionics systems, chemical reactors, etc., often execute a set of persistent periodic safety critical control tasks along with dynamic event driven aperiodic tasks. This work presents a co-scheduling framework for the combined execution of such periodic and aperiodic real-time tasks on fully and run-time partially reconfigurable platforms. Specifically, we present an admission control strategy and preemptive scheduling methodology for dynamic aperiodic tasks in the presence of a set of persistent periodic tasks such that aperiodic task rejections may be minimized, thus resulting in high resource utilization. We used the 2D slotted area model where the floor of the FPGA is assumed to be statically equipartitioned into a set of tiles in which any arbitrary task may be feasibly mapped. The experimental results reveal that the proposed scheduling strategies are able to achieve high resource utilization with low task rejection rates over various simulation scenarios.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 1","pages":"41-54"},"PeriodicalIF":0.0,"publicationDate":"2017-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2017.2691701","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68003397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}