IEEE Transactions on Multi-Scale Computing Systems最新文献_第8页

Cross-Layer Design Exploration for Energy-Quality Tradeoffs in Spiking and Non-Spiking Deep Artificial Neural Networks Spiking和NonSpiking深度人工神经网络中能量质量权衡的跨层设计探索

IEEE Transactions on Multi-Scale Computing Systems

Pub Date : 2017-08-09 DOI: 10.1109/TMSCS.2017.2737625

Bing Han;Aayush Ankit;Abhronil Sengupta;Kaushik Roy

Deep learning convolutional artificial neural networks have achieved success in a large number of visual processing tasks and are currently utilized for many real-world applications like image search and speech recognition among others. However, despite achieving high accuracy in such classification problems, they involve significant computational resources. Over the past few years, non-spiking deep convolutional artificial neural network models have evolved into more biologically realistic and event-driven spiking deep convolutional artificial neural networks. Recent research efforts have been directed at developing mechanisms to convert traditional non-spiking deep convolutional artificial neural networks to the spiking ones where neurons communicate by means of spikes. However, there have been limited studies providing insights on the specific power, area, and energy benefits offered by the spiking deep convolutional artificial neural networks in comparison to their non-spiking counterparts. We perform a comprehensive study for hardware implementation of spiking/non-spiking deep convolutional artificial neural networks on MNIST, CIFAR10, and SVHN datasets. To this effect, we design AccelNN - a Neural Network Accelerator to execute neural network benchmarks and analyze the effects of circuit-architecture level techniques to harness event-drivenness. A comparative analysis between spiking and non-spiking versions of deep convolutional artificial neural networks is presented by performing trade-offs between recognition accuracy and corresponding power, latency and energy requirements.

深度学习卷积人工神经网络在大量视觉处理任务中取得了成功，目前被用于许多现实世界的应用，如图像搜索和语音识别等。然而，尽管在这样的分类问题中实现了高精度，但它们涉及大量的计算资源。在过去的几年里，非尖峰深度卷积人工神经网络模型已经进化成更具生物学现实性和事件驱动的尖峰深度卷积神经网络。最近的研究工作致力于开发将传统的非尖峰深度卷积人工神经网络转换为尖峰神经网络的机制，其中神经元通过尖峰进行通信。然而，与非尖峰人工神经网络相比，尖峰深度卷积人工神经网络提供的特定功率、面积和能量效益的研究有限。我们在MNIST、CIFAR10和SVHN数据集上对尖峰/非尖峰深度卷积人工神经网络的硬件实现进行了全面研究。为此，我们设计了AccelNN——一种神经网络加速器，用于执行神经网络基准测试，并分析电路架构级别技术对利用事件驱动性的影响。通过在识别精度和相应的功率、延迟和能量要求之间进行权衡，对深度卷积人工神经网络的尖峰和非尖峰版本进行了比较分析。

{"title":"Cross-Layer Design Exploration for Energy-Quality Tradeoffs in Spiking and Non-Spiking Deep Artificial Neural Networks","authors":"Bing Han;Aayush Ankit;Abhronil Sengupta;Kaushik Roy","doi":"10.1109/TMSCS.2017.2737625","DOIUrl":"https://doi.org/10.1109/TMSCS.2017.2737625","url":null,"abstract":"Deep learning convolutional artificial neural networks have achieved success in a large number of visual processing tasks and are currently utilized for many real-world applications like image search and speech recognition among others. However, despite achieving high accuracy in such classification problems, they involve significant computational resources. Over the past few years, non-spiking deep convolutional artificial neural network models have evolved into more biologically realistic and event-driven spiking deep convolutional artificial neural networks. Recent research efforts have been directed at developing mechanisms to convert traditional non-spiking deep convolutional artificial neural networks to the spiking ones where neurons communicate by means of spikes. However, there have been limited studies providing insights on the specific power, area, and energy benefits offered by the spiking deep convolutional artificial neural networks in comparison to their non-spiking counterparts. We perform a comprehensive study for hardware implementation of spiking/non-spiking deep convolutional artificial neural networks on MNIST, CIFAR10, and SVHN datasets. To this effect, we design AccelNN - a Neural Network Accelerator to execute neural network benchmarks and analyze the effects of circuit-architecture level techniques to harness event-drivenness. A comparative analysis between spiking and non-spiking versions of deep convolutional artificial neural networks is presented by performing trade-offs between recognition accuracy and corresponding power, latency and energy requirements.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 4","pages":"613-623"},"PeriodicalIF":0.0,"publicationDate":"2017-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2017.2737625","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68024190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Field-Programmable Crossbar Array (FPCA) for Reconfigurable Computing 用于可重构计算的现场可编程横杆阵列（FPCA）

IEEE Transactions on Multi-Scale Computing Systems

Pub Date : 2017-06-28 DOI: 10.1109/TMSCS.2017.2721160

Mohammed A. Zidan;YeonJoo Jeong;Jong Hoon Shin;Chao Du;Zhengya Zhang;Wei D. Lu

For decades, advances in electronics were directly driven by the scaling of CMOS transistors according to Moore's law. However, both the CMOS scaling and the classical computer architecture are approaching fundamental and practical limits, and new computing architectures based on emerging devices, such as resistive random-access memory (RRAM) devices, are expected to sustain the exponential growth of computing capability. Here, we propose a novel memory-centric, reconfigurable, general purpose computing platform that is capable of handling the explosive amount of data in a fast and energy-efficient manner. The proposed computing architecture is based on a uniform, physical, resistive, memory-centric fabric that can be optimally reconfigured and utilized to perform different computing and data storage tasks in a massively parallel approach. The system can be tailored to achieve maximal energy efficiency based on the data flow by dynamically allocating the basic computing fabric for storage, arithmetic, and analog computing including neuromorphic computing tasks.

几十年来，根据摩尔定律，CMOS晶体管的缩放直接推动了电子技术的进步。然而，CMOS缩放和经典计算机体系结构都接近基本和实际的极限，基于新兴器件（如电阻随机存取存储器（RRAM）器件）的新计算体系结构有望维持计算能力的指数增长。在这里，我们提出了一种新的以内存为中心、可重新配置的通用计算平台，该平台能够以快速高效的方式处理爆炸性的数据量。所提出的计算架构基于统一的、物理的、电阻的、以内存为中心的结构，该结构可以优化地重新配置并用于以大规模并行的方式执行不同的计算和数据存储任务。通过动态分配用于存储、算术和模拟计算（包括神经形态计算任务）的基本计算结构，可以基于数据流对系统进行定制，以实现最大能效。

{"title":"Field-Programmable Crossbar Array (FPCA) for Reconfigurable Computing","authors":"Mohammed A. Zidan;YeonJoo Jeong;Jong Hoon Shin;Chao Du;Zhengya Zhang;Wei D. Lu","doi":"10.1109/TMSCS.2017.2721160","DOIUrl":"https://doi.org/10.1109/TMSCS.2017.2721160","url":null,"abstract":"For decades, advances in electronics were directly driven by the scaling of CMOS transistors according to Moore's law. However, both the CMOS scaling and the classical computer architecture are approaching fundamental and practical limits, and new computing architectures based on emerging devices, such as resistive random-access memory (RRAM) devices, are expected to sustain the exponential growth of computing capability. Here, we propose a novel memory-centric, reconfigurable, general purpose computing platform that is capable of handling the explosive amount of data in a fast and energy-efficient manner. The proposed computing architecture is based on a uniform, physical, resistive, memory-centric fabric that can be optimally reconfigured and utilized to perform different computing and data storage tasks in a massively parallel approach. The system can be tailored to achieve maximal energy efficiency based on the data flow by dynamically allocating the basic computing fabric for storage, arithmetic, and analog computing including neuromorphic computing tasks.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 4","pages":"698-710"},"PeriodicalIF":0.0,"publicationDate":"2017-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2017.2721160","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68024065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 34

DISASTER: Dedicated Intelligent Security Attacks on Sensor-Triggered Emergency Responses 灾难：传感器触发的紧急响应专用智能安全攻击

IEEE Transactions on Multi-Scale Computing Systems

Pub Date : 2017-06-27 DOI: 10.1109/TMSCS.2017.2720660

Arsalan Mosenia;Susmita Sur-Kolay;Anand Raghunathan;Niraj K. Jha

Rapid technological advances in microelectronics, networking, and computer science have resulted in an exponential increase in the number of cyber-physical systems (CPSs) that enable numerous services in various application domains, e.g., smart homes and smart grids. Moreover, the emergence of the Internet-of-Things (IoT) paradigm has led to the pervasive use of IoT-enabled CPSs in our everyday lives. Unfortunately, as a side effect, the numberof potential threats and feasible security attacks against CPSs has grown significantly. In this paper, we introduce a new class of attacks against CPSs, called dedicated intelligent security attacks against sensor-triggered emergency responses (DISASTER). DISASTER targets safety mechanisms deployed in automation/monitoring CPSs and exploits design flaws and security weaknesses of such mechanisms to trigger emergency responses even in the absence of a real emergency. Launching DISASTER can lead to serious consequences forthree main reasons. First, almost all CPSs offer specific emergency responses and, as a result, are potentially susceptible to such attacks. Second, DISASTER can be easily designed to target a large number of CPSs, e.g., the anti-theft systems of all buildings in a residential community. Third, the widespread deployment of insecure sensors in already-in-use safety mechanisms along with the endless variety of CPS-based applications magnifies the impact of launching DISASTER. In addition to introducing DISASTER, we describe the serious consequences of such attacks. We demonstrate the feasibility of launching DISASTER against the two most widely-used CPSs: residential and industrial automation/monitoring systems. Moreover, we suggest several countermeasures that can potentially prevent DISASTER and discuss their advantages and drawbacks.

微电子、网络和计算机科学的快速技术进步导致网络物理系统（CPS）的数量呈指数级增长，这些系统能够在各种应用领域提供大量服务，例如智能家居和智能电网。此外，物联网（IoT）范式的出现导致了物联网消费品在我们日常生活中的广泛使用。不幸的是，作为副作用，针对消费品安全系统的潜在威胁和可行安全攻击的数量显著增加。在本文中，我们介绍了一类针对CPSs的新攻击，称为针对传感器触发的应急响应（DISASTER）的专用智能安全攻击。DISASTER针对自动化/监控CPSs中部署的安全机制，并利用此类机制的设计缺陷和安全弱点，即使在没有真正紧急情况的情况下也能触发应急响应。发动灾难会导致严重后果，主要有三个原因。首先，几乎所有CPSs都提供特定的应急响应，因此可能容易受到此类攻击。第二，可以很容易地将DISASTER设计为针对大量CPSs，例如住宅社区中所有建筑物的防盗系统。第三，在已经在使用的安全机制中广泛部署不安全的传感器，以及层出不穷的基于CPS的应用程序，放大了引发灾难的影响。除了介绍“灾难”，我们还描述了此类袭击的严重后果。我们展示了针对两种最广泛使用的消费品安全系统（住宅和工业自动化/监控系统）发动灾难的可行性。此外，我们还提出了几种可能预防灾难的对策，并讨论了它们的优缺点。

{"title":"DISASTER: Dedicated Intelligent Security Attacks on Sensor-Triggered Emergency Responses","authors":"Arsalan Mosenia;Susmita Sur-Kolay;Anand Raghunathan;Niraj K. Jha","doi":"10.1109/TMSCS.2017.2720660","DOIUrl":"https://doi.org/10.1109/TMSCS.2017.2720660","url":null,"abstract":"Rapid technological advances in microelectronics, networking, and computer science have resulted in an exponential increase in the number of cyber-physical systems (CPSs) that enable numerous services in various application domains, e.g., smart homes and smart grids. Moreover, the emergence of the Internet-of-Things (IoT) paradigm has led to the pervasive use of IoT-enabled CPSs in our everyday lives. Unfortunately, as a side effect, the numberof potential threats and feasible security attacks against CPSs has grown significantly. In this paper, we introduce a new class of attacks against CPSs, called dedicated intelligent security attacks against sensor-triggered emergency responses (DISASTER). DISASTER targets safety mechanisms deployed in automation/monitoring CPSs and exploits design flaws and security weaknesses of such mechanisms to trigger emergency responses even in the absence of a real emergency. Launching DISASTER can lead to serious consequences forthree main reasons. First, almost all CPSs offer specific emergency responses and, as a result, are potentially susceptible to such attacks. Second, DISASTER can be easily designed to target a large number of CPSs, e.g., the anti-theft systems of all buildings in a residential community. Third, the widespread deployment of insecure sensors in already-in-use safety mechanisms along with the endless variety of CPS-based applications magnifies the impact of launching DISASTER. In addition to introducing DISASTER, we describe the serious consequences of such attacks. We demonstrate the feasibility of launching DISASTER against the two most widely-used CPSs: residential and industrial automation/monitoring systems. Moreover, we suggest several countermeasures that can potentially prevent DISASTER and discuss their advantages and drawbacks.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"3 4","pages":"255-268"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2017.2720660","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68021198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Data Transfers Analysis in Computer Assisted Design Flow of FPGA Accelerators for Aerospace Systems 航天系统FPGA加速器计算机辅助设计流程中的数据传输分析

IEEE Transactions on Multi-Scale Computing Systems

Pub Date : 2017-04-28 DOI: 10.1109/TMSCS.2017.2699647

Marco Lattuada;Fabrizio Ferrandi;Maxime Perrotin

The integration of Field Programmable Gate Arrays (FPGAs) in an aerospace system improves its efficiency and its flexibility thanks to their programmability, but increases the design complexity. The design flows indeed have to be composed of several steps to fill the gap between the starting solution, which is usually a reference sequential implementation, and the final heterogeneous solution which includes custom hardware accelerators. Among these steps, there are the analysis of the application to identify the functionalities that gain advantages in execution on hardware and the generation of their implementations by means of Hardware Description Languages. Generating these descriptions for a software developer can be a very difficult task because of the different programming paradigms of software programs and hardware descriptions. To facilitate the developer in this activity, High Level Synthesis techniques have been developed aiming at (semi-)automatically generating hardware implementations of specifications written in high level languages (e.g., C). With respect to other embedded systems scenarios, the aerospace systems introduce further constraints that have to be taken into account during the design of these heterogeneous systems. In this type of systems explicit data transfers to and from FPGAs are preferred to the adoption of a shared memory architecture. The first approach indeed potentially improves the predictability of the produced solutions, but the sizes of all the data transferred to and from any devices must be known at design time. Identifying the sizes in presence of complex C applications which use pointers can be a not so easy task. In this paper, a semi-automatic design flow based on the integration of an aerospace design flow, an application analysis technique, and High Level Synthesis methodologies is presented. The initial reference application is analyzed to identify which are the sizes of the data exchanged among the different components of the application. Next, starting from the high level specification and from the results of this analysis, High Level Synthesis techniques are applied to automatically produce the hardware accelerators.

将现场可编程门阵列（FPGA）集成到航空航天系统中，由于其可编程性，提高了其效率和灵活性，但增加了设计复杂性。设计流程确实必须由几个步骤组成，以填补起始解决方案（通常是参考顺序实现）和最终异构解决方案（包括自定义硬件加速器）之间的空白。在这些步骤中，有对应用程序的分析，以确定在硬件上执行时获得优势的功能，以及通过硬件描述语言生成它们的实现。由于软件程序和硬件描述的编程范式不同，为软件开发人员生成这些描述可能是一项非常困难的任务。为了方便开发人员进行这项活动，开发了高级综合技术，旨在（半）自动生成用高级语言（如C）编写的规范的硬件实现。关于其他嵌入式系统场景，航空航天系统引入了在设计这些异构系统时必须考虑的进一步约束。在这种类型的系统中，与采用共享存储器架构相比，与FPGA之间的显式数据传输是优选的。第一种方法确实有可能提高生成的解决方案的可预测性，但在设计时必须知道传输到任何设备和从任何设备传输的所有数据的大小。识别使用指针的复杂C应用程序的大小可能不是一项容易的任务。本文提出了一种基于航空航天设计流程、应用分析技术和高级综合方法的半自动设计流程。分析初始参考应用程序以识别在应用程序的不同组件之间交换的数据的大小。接下来，从高级规范和该分析的结果开始，应用高级综合技术来自动生成硬件加速器。

{"title":"Data Transfers Analysis in Computer Assisted Design Flow of FPGA Accelerators for Aerospace Systems","authors":"Marco Lattuada;Fabrizio Ferrandi;Maxime Perrotin","doi":"10.1109/TMSCS.2017.2699647","DOIUrl":"https://doi.org/10.1109/TMSCS.2017.2699647","url":null,"abstract":"The integration of Field Programmable Gate Arrays (FPGAs) in an aerospace system improves its efficiency and its flexibility thanks to their programmability, but increases the design complexity. The design flows indeed have to be composed of several steps to fill the gap between the starting solution, which is usually a reference sequential implementation, and the final heterogeneous solution which includes custom hardware accelerators. Among these steps, there are the analysis of the application to identify the functionalities that gain advantages in execution on hardware and the generation of their implementations by means of Hardware Description Languages. Generating these descriptions for a software developer can be a very difficult task because of the different programming paradigms of software programs and hardware descriptions. To facilitate the developer in this activity, High Level Synthesis techniques have been developed aiming at (semi-)automatically generating hardware implementations of specifications written in high level languages (e.g., C). With respect to other embedded systems scenarios, the aerospace systems introduce further constraints that have to be taken into account during the design of these heterogeneous systems. In this type of systems explicit data transfers to and from FPGAs are preferred to the adoption of a shared memory architecture. The first approach indeed potentially improves the predictability of the produced solutions, but the sizes of all the data transferred to and from any devices must be known at design time. Identifying the sizes in presence of complex C applications which use pointers can be a not so easy task. In this paper, a semi-automatic design flow based on the integration of an aerospace design flow, an application analysis technique, and High Level Synthesis methodologies is presented. The initial reference application is analyzed to identify which are the sizes of the data exchanged among the different components of the application. Next, starting from the high level specification and from the results of this analysis, High Level Synthesis techniques are applied to automatically produce the hardware accelerators.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 1","pages":"3-16"},"PeriodicalIF":0.0,"publicationDate":"2017-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2017.2699647","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68003401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Analytical Modeling and Performance Benchmarking of On-Chip Interconnects with Rough Surfaces 具有粗糙表面的片上互连的分析建模和性能基准

IEEE Transactions on Multi-Scale Computing Systems

Pub Date : 2017-04-24 DOI: 10.1109/TMSCS.2017.2696941

Somesh Kumar;Rohit Sharma

In planar on-chip copper interconnects, conductor losses due to surface roughness demands explicit consideration for accurate modeling of their performance metrics. This is quite pertinent for high-performance manycore processors/servers, where on-chip interconnects are increasingly emerging as one of the key performance bottlenecks. This paper presents a novel analytical model for parameter extraction in current and future on-chip interconnects. Our proposed model aids in analyzing the impact of spatial and vertical surface roughness on their electrical performance. Our analysis clearly depicts that as the technology nodes scale down; the effect of the surface roughness becomes dominant and cannot be ignored. Based on AFM images of fabricated ultra-thin copper sheets, we have extracted roughness parameters to define realistic surface profiles using the well-known Mandelbrot-Weierstrass (MW) fractal function. For our analysis, we have considered four current and future interconnect technology nodes (i.e., 45, 22, 13, 7 nm) and evaluated the impact of surface roughness on typical performance metrics, such as delay, energy, and bandwidth. Results obtained using our model are verified by comparing with industry standard field solver Ansys HFSS as well as available experimental data that exhibits accuracy within 9 percent. We present signal integrity analysis using the eye diagram at 1, 5, 10, and 18 Gbps bit rates to find the increase in frequency dependent losses due to surface roughness. Finally, simulating a standard three line on-chip interconnect structure, we also report the computational overhead incurred for different values of roughness and technology nodes.

在平面片上铜互连中，由于表面粗糙度导致的导体损耗需要明确考虑，以便对其性能指标进行精确建模。这与高性能多核处理器/服务器非常相关，片上互连正日益成为关键的性能瓶颈之一。本文提出了一种新的分析模型，用于当前和未来片上互连的参数提取。我们提出的模型有助于分析空间和垂直表面粗糙度对其电气性能的影响。我们的分析清楚地表明，随着技术节点的缩小；表面粗糙度的影响变得占主导地位，并且不能被忽视。基于制造的超薄铜片的AFM图像，我们提取了粗糙度参数，使用著名的Mandelbrot-Weierstrass（MW）分形函数来定义真实的表面轮廓。在我们的分析中，我们考虑了四个当前和未来的互连技术节点（即45、22、13、7 nm），并评估了表面粗糙度对典型性能指标（如延迟、能量和带宽）的影响。使用我们的模型获得的结果通过与行业标准现场求解器Ansys HFSS以及显示精度在9%以内的可用实验数据进行比较来验证。我们使用眼图在1、5、10和18Gbps比特率下进行信号完整性分析，以发现由于表面粗糙度导致的频率相关损耗的增加。最后，模拟一个标准的三线片上互连结构，我们还报告了不同粗糙度值和技术节点所产生的计算开销。

{"title":"Analytical Modeling and Performance Benchmarking of On-Chip Interconnects with Rough Surfaces","authors":"Somesh Kumar;Rohit Sharma","doi":"10.1109/TMSCS.2017.2696941","DOIUrl":"https://doi.org/10.1109/TMSCS.2017.2696941","url":null,"abstract":"In planar on-chip copper interconnects, conductor losses due to surface roughness demands explicit consideration for accurate modeling of their performance metrics. This is quite pertinent for high-performance manycore processors/servers, where on-chip interconnects are increasingly emerging as one of the key performance bottlenecks. This paper presents a novel analytical model for parameter extraction in current and future on-chip interconnects. Our proposed model aids in analyzing the impact of spatial and vertical surface roughness on their electrical performance. Our analysis clearly depicts that as the technology nodes scale down; the effect of the surface roughness becomes dominant and cannot be ignored. Based on AFM images of fabricated ultra-thin copper sheets, we have extracted roughness parameters to define realistic surface profiles using the well-known Mandelbrot-Weierstrass (MW) fractal function. For our analysis, we have considered four current and future interconnect technology nodes (i.e., 45, 22, 13, 7 nm) and evaluated the impact of surface roughness on typical performance metrics, such as delay, energy, and bandwidth. Results obtained using our model are verified by comparing with industry standard field solver Ansys HFSS as well as available experimental data that exhibits accuracy within 9 percent. We present signal integrity analysis using the eye diagram at 1, 5, 10, and 18 Gbps bit rates to find the increase in frequency dependent losses due to surface roughness. Finally, simulating a standard three line on-chip interconnect structure, we also report the computational overhead incurred for different values of roughness and technology nodes.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 3","pages":"272-284"},"PeriodicalIF":0.0,"publicationDate":"2017-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2017.2696941","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68023871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Algorithm and Design of a Fully Parallel Approximate Coordinate Rotation Digital Computer (CORDIC) 全并行近似坐标旋转数字计算机（CORDIC）的算法与设计

IEEE Transactions on Multi-Scale Computing Systems

Pub Date : 2017-04-19 DOI: 10.1109/TMSCS.2017.2696003

Linbin Chen;Jie Han;Weiqiang Liu;Fabrizio Lombardi

This paper proposes a new approximate scheme for coordinate rotation digital computer (CORDIC) design. This scheme is based on modifying the existing Para-CORDIC architecture with an approximation that is inserted in multiple parts and made possible by relaxing the CORDIC algorithm itself. A fully parallel approximate CORDIC (FPAX-CORDIC) scheme is proposed. This scheme avoids the memory register of Para-CORDIC and makes the generation of the rotation direction fully parallel. A comprehensive analysis and the evaluation of the error introduced by the approximation together with different circuit-related metrics are pursued using HSPICE as the simulation tool. This error analysis also combines existing figures of merit for approximate computing (such as the Mean Error Distance (MED) and MED Power Product (MPP)) with CORDIC specific parameters. It is shown that a good agreement between expected and simulated error values is found. The Discrete Cosine Transformation (DCT) and the Inverse DCT (IDCT) transformations as case study of approximate computing to image processing are investigated by utilizing the proposed approximate FPAX-CORDIC architecture with different accuracy requirements. The results confirm the viability of the proposed scheme.

本文提出了一种新的坐标旋转数字计算机（CORDIC）设计的近似方案。该方案基于修改现有的Para CORDIC架构，该架构具有插入多个部分的近似，并通过放松CORDIC算法本身而成为可能。提出了一种全并行近似CORDIC（FPAX-CORDIC）方案。该方案避免了Para CORDIC的存储寄存器，使旋转方向的生成完全平行。使用HSPICE作为仿真工具，对近似引入的误差以及不同的电路相关度量进行了全面的分析和评估。该误差分析还将现有的近似计算优值（如平均误差距离（MED）和MED功率积（MPP））与CORDIC特定参数相结合。结果表明，在期望误差值和模拟误差值之间存在良好的一致性。利用所提出的具有不同精度要求的近似FPAX-CORDIC架构，研究了离散余弦变换（DCT）和逆DCT变换作为图像处理近似计算的实例。结果证实了拟议方案的可行性。

{"title":"Algorithm and Design of a Fully Parallel Approximate Coordinate Rotation Digital Computer (CORDIC)","authors":"Linbin Chen;Jie Han;Weiqiang Liu;Fabrizio Lombardi","doi":"10.1109/TMSCS.2017.2696003","DOIUrl":"https://doi.org/10.1109/TMSCS.2017.2696003","url":null,"abstract":"This paper proposes a new approximate scheme for coordinate rotation digital computer (CORDIC) design. This scheme is based on modifying the existing Para-CORDIC architecture with an approximation that is inserted in multiple parts and made possible by relaxing the CORDIC algorithm itself. A fully parallel approximate CORDIC (FPAX-CORDIC) scheme is proposed. This scheme avoids the memory register of Para-CORDIC and makes the generation of the rotation direction fully parallel. A comprehensive analysis and the evaluation of the error introduced by the approximation together with different circuit-related metrics are pursued using HSPICE as the simulation tool. This error analysis also combines existing figures of merit for approximate computing (such as the Mean Error Distance (MED) and MED Power Product (MPP)) with CORDIC specific parameters. It is shown that a good agreement between expected and simulated error values is found. The Discrete Cosine Transformation (DCT) and the Inverse DCT (IDCT) transformations as case study of approximate computing to image processing are investigated by utilizing the proposed approximate FPAX-CORDIC architecture with different accuracy requirements. The results confirm the viability of the proposed scheme.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"3 3","pages":"139-151"},"PeriodicalIF":0.0,"publicationDate":"2017-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2017.2696003","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68071086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 26

Body Bias Control for Renewable Energy Source with a High Inner Resistance 高内阻可再生能源体偏置控制

IEEE Transactions on Multi-Scale Computing Systems

Pub Date : 2017-04-19 DOI: 10.1109/CoolChips.2017.7946386

Keita Azegami, Hayate Okuhara, H. Amano

Sensor nodes used in Internet of Things (IoT) are required to work an extremely long time without replacing the battery. Natural renewable energy such as a solar battery is a hopeful candidate for such nodes. Here, a power model for operating an Silicon on Insulator (SOI) device with a solar battery including a large inner resistance is proposed, and applied to a micro-controller V850E-star and an accelerator CMA-SOTB2. Unlike the ideal case, the maximum operational frequency was achieved with reverse biasing by suppressing the leakage current which decreases the supply voltage. Under the room light with a large inner resistance, the strong reverse bias is effective, while a relatively weak reverse bias is advantageous under the bright light. The proposed model is appeared to be useful to estimate the appropriate body bias voltage both for V850E-star and CMA-SOTB2. In the V850E-star, the estimated operational frequencies were different from the real chip, while they were relatively matched when CMA-SOTB2 was used under the low illuminance.

物联网(IoT)中使用的传感器节点需要在不更换电池的情况下工作很长时间。太阳能电池等自然可再生能源是这种节点的有希望的候选者。本文提出了一种基于大内阻太阳能电池的绝缘体上硅(SOI)器件的功率模型，并应用于微控制器V850E-star和加速器CMA-SOTB2。与理想情况不同的是，通过抑制泄漏电流降低电源电压，反向偏置可以实现最大工作频率。在内阻较大的室内光线下，较强的反向偏压是有效的，而较弱的反向偏压在明亮光线下是有利的。所提出的模型对于V850E-star和CMA-SOTB2的适当体偏置电压估计是有用的。在V850E-star中，估算的工作频率与实际芯片不同，而在低照度下使用CMA-SOTB2时，估算的工作频率与实际芯片相对匹配。

引用次数: 0

Evaluation of a BVH Construction Accelerator Architecture for High-Quality Visualization 用于高质量可视化的BVH施工加速器体系结构的评估

IEEE Transactions on Multi-Scale Computing Systems

Pub Date : 2017-04-18 DOI: 10.1109/TMSCS.2017.2695338

Michael J. Doyle;Ciarán Tuohy;Michael Manzke

The ever-increasing demands of computer graphics applications have motivated the evolution of computer graphics hardware over the last 20 years. Early commodity graphics hardware was largely based on fixed-function components offering little flexibility. The gradual replacement of fixed-function hardware with more general-purpose instruction processors, has enabled GPUs to deliver visual experiences more tailored to specific applications. This trend has culminated in modern GPUs essentially being programmable stream processors, capable of supporting a wide variety of applications in and outside of computer graphics. However, the growing concern of power efficiency in modern processors, coupled with an increasing demand for supporting next-generation graphics pipelines, has re-invigorated the debate on the use of fixed-function accelerators in these platforms. In this paper, we conduct a study of a heterogeneous, system-on-chip solution for the construction of a highly important data structure for computer graphics: the bounding volume hierarchy. This design incorporates conventional CPU cores alongside a fixed-function accelerator prototyped on a reconfigurable logic fabric. Our study supports earlier, simulation-only studies which argue for the introduction of this class of accelerator in future graphics processors.

在过去的20年里，计算机图形应用的不断增长的需求推动了计算机图形硬件的发展。早期的商品图形硬件主要基于固定功能组件，灵活性很低。固定功能硬件逐渐被更通用的指令处理器取代，使GPU能够提供更适合特定应用的视觉体验。这一趋势最终导致现代GPU本质上是可编程流处理器，能够支持计算机图形内外的各种应用。然而，对现代处理器功率效率的日益关注，加上对支持下一代图形管道的需求不断增加，重新激发了关于在这些平台中使用固定功能加速器的辩论。在本文中，我们研究了一种异构的片上系统解决方案，用于构建计算机图形学中非常重要的数据结构：包围体层次结构。这种设计结合了传统的CPU核心和在可重构逻辑结构上原型化的固定功能加速器。我们的研究支持早期的仅模拟的研究，这些研究主张在未来的图形处理器中引入此类加速器。

{"title":"Evaluation of a BVH Construction Accelerator Architecture for High-Quality Visualization","authors":"Michael J. Doyle;Ciarán Tuohy;Michael Manzke","doi":"10.1109/TMSCS.2017.2695338","DOIUrl":"https://doi.org/10.1109/TMSCS.2017.2695338","url":null,"abstract":"The ever-increasing demands of computer graphics applications have motivated the evolution of computer graphics hardware over the last 20 years. Early commodity graphics hardware was largely based on fixed-function components offering little flexibility. The gradual replacement of fixed-function hardware with more general-purpose instruction processors, has enabled GPUs to deliver visual experiences more tailored to specific applications. This trend has culminated in modern GPUs essentially being programmable stream processors, capable of supporting a wide variety of applications in and outside of computer graphics. However, the growing concern of power efficiency in modern processors, coupled with an increasing demand for supporting next-generation graphics pipelines, has re-invigorated the debate on the use of fixed-function accelerators in these platforms. In this paper, we conduct a study of a heterogeneous, system-on-chip solution for the construction of a highly important data structure for computer graphics: the bounding volume hierarchy. This design incorporates conventional CPU cores alongside a fixed-function accelerator prototyped on a reconfigurable logic fabric. Our study supports earlier, simulation-only studies which argue for the introduction of this class of accelerator in future graphics processors.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 1","pages":"83-94"},"PeriodicalIF":0.0,"publicationDate":"2017-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2017.2695338","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68003398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Memory-Efficient Probabilistic 2-D Finite Impulse Response (FIR) Filter 高效记忆概率二维有限脉冲响应滤波器

IEEE Transactions on Multi-Scale Computing Systems

Pub Date : 2017-04-18 DOI: 10.1109/TMSCS.2017.2695588

Mohammed Alawad;Mingjie Lin

High memory/storage complexity poses severe challenges to achieving high throughput and high energy efficiency in discrete 2-D FIR filtering. This performance bottleneck is especially acute for embedded image or video applications, that use 2-D FIR processing extensively, because real-time processing and low power consumption are their paramount design objectives. Fortunately, most of such perception-based embedded applications possess so-called “inherent fault tolerance”, meaning slight computing accuracy degradation has a little negative effect on their quality of results, but has significant implication to their throughput, hardware implementation cost, and energy efficiency. This paper develops a novel stochastic-based 2-D FIR filtering architecture that exploits the well-known probabilistic convolution theorem to achieve both low hardware cost and high energy efficiency while achieving very high throughput and computing robustness. Our ASIC synthesis results show that stochastic-based architecture achieves L outputs per cycle with 97 and 81 percent less area-delay-product (ADP), and 77 and 67 percent less power consumption compared with the conventional structure and recently published state-of-the-art architecture, respectively, when the 2-D FIR filter size is 4 × 4, the input block size is L 1/4 4, and the image size is 512 × 512.

高存储器/存储复杂性对在离散2-D FIR滤波中实现高吞吐量和高能效提出了严峻挑战。这种性能瓶颈对于广泛使用2-D FIR处理的嵌入式图像或视频应用程序尤其严重，因为实时处理和低功耗是它们的首要设计目标。幸运的是，大多数基于感知的嵌入式应用程序都具有所谓的“固有容错”，这意味着轻微的计算精度下降对其结果质量有一点负面影响，但对其吞吐量、硬件实现成本和能源效率有重大影响。本文开发了一种新的基于随机的2-D FIR滤波架构，该架构利用众所周知的概率卷积定理来实现低硬件成本和高能效，同时实现非常高的吞吐量和计算鲁棒性。我们的ASIC综合结果表明，与传统结构和最近发表的最新技术架构相比，当2-D FIR滤波器大小为4×4，输入块大小为L1/4时，基于随机的架构每周期实现L输出，面积延迟积（ADP）分别减少97%和81%，功耗分别减少77%和67%，图像大小为512×。

{"title":"Memory-Efficient Probabilistic 2-D Finite Impulse Response (FIR) Filter","authors":"Mohammed Alawad;Mingjie Lin","doi":"10.1109/TMSCS.2017.2695588","DOIUrl":"https://doi.org/10.1109/TMSCS.2017.2695588","url":null,"abstract":"High memory/storage complexity poses severe challenges to achieving high throughput and high energy efficiency in discrete 2-D FIR filtering. This performance bottleneck is especially acute for embedded image or video applications, that use 2-D FIR processing extensively, because real-time processing and low power consumption are their paramount design objectives. Fortunately, most of such perception-based embedded applications possess so-called “inherent fault tolerance”, meaning slight computing accuracy degradation has a little negative effect on their quality of results, but has significant implication to their throughput, hardware implementation cost, and energy efficiency. This paper develops a novel stochastic-based 2-D FIR filtering architecture that exploits the well-known probabilistic convolution theorem to achieve both low hardware cost and high energy efficiency while achieving very high throughput and computing robustness. Our ASIC synthesis results show that stochastic-based architecture achieves L outputs per cycle with 97 and 81 percent less area-delay-product (ADP), and 77 and 67 percent less power consumption compared with the conventional structure and recently published state-of-the-art architecture, respectively, when the 2-D FIR filter size is 4 × 4, the input block size is L 1/4 4, and the image size is 512 × 512.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 1","pages":"69-82"},"PeriodicalIF":0.0,"publicationDate":"2017-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2017.2695588","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68003400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Co-Scheduling Persistent Periodic and Dynamic Aperiodic Real-Time Tasks on Reconfigurable Platforms 可重构平台上持续周期和动态非周期实时任务的协同调度

IEEE Transactions on Multi-Scale Computing Systems

Pub Date : 2017-04-06 DOI: 10.1109/TMSCS.2017.2691701

Sangeet Saha;Arnab Sarkar;Amlan Chakrabarti;Ranjan Ghosh

As task preemption/relocation with acceptably low overheads become a reality in today's reconfigurable FPGAs, they are starting to show bright prospects as platforms for executing performance critical task sets while allowing high resource utilization. Many performance sensitive real-time systems including those in automotive and avionics systems, chemical reactors, etc., often execute a set of persistent periodic safety critical control tasks along with dynamic event driven aperiodic tasks. This work presents a co-scheduling framework for the combined execution of such periodic and aperiodic real-time tasks on fully and run-time partially reconfigurable platforms. Specifically, we present an admission control strategy and preemptive scheduling methodology for dynamic aperiodic tasks in the presence of a set of persistent periodic tasks such that aperiodic task rejections may be minimized, thus resulting in high resource utilization. We used the 2D slotted area model where the floor of the FPGA is assumed to be statically equipartitioned into a set of tiles in which any arbitrary task may be feasibly mapped. The experimental results reveal that the proposed scheduling strategies are able to achieve high resource utilization with low task rejection rates over various simulation scenarios.

随着具有可接受的低开销的任务抢占/重定位在当今可重构FPGA中成为现实，它们作为执行性能关键任务集的平台，同时允许高资源利用率，开始显示出光明的前景。许多性能敏感的实时系统，包括汽车和航空电子系统、化学反应堆等中的系统，通常执行一组持续的周期性安全关键控制任务以及动态事件驱动的非周期性任务。这项工作提出了一个联合调度框架，用于在完全和运行时部分可重新配置的平台上联合执行这种周期性和非周期性实时任务。具体来说，我们提出了一种在存在一组持久周期性任务的情况下用于动态非周期性任务，以最小化非周期性的任务拒绝，从而实现高资源利用率的准入控制策略和抢占调度方法。我们使用了2D时隙区域模型，其中假设FPGA的底板被静态等分为一组瓦片，其中任何任意任务都可以在其中进行映射。实验结果表明，在各种仿真场景下，所提出的调度策略能够实现高资源利用率和低任务拒绝率。

{"title":"Co-Scheduling Persistent Periodic and Dynamic Aperiodic Real-Time Tasks on Reconfigurable Platforms","authors":"Sangeet Saha;Arnab Sarkar;Amlan Chakrabarti;Ranjan Ghosh","doi":"10.1109/TMSCS.2017.2691701","DOIUrl":"https://doi.org/10.1109/TMSCS.2017.2691701","url":null,"abstract":"As task preemption/relocation with acceptably low overheads become a reality in today's reconfigurable FPGAs, they are starting to show bright prospects as platforms for executing performance critical task sets while allowing high resource utilization. Many performance sensitive real-time systems including those in automotive and avionics systems, chemical reactors, etc., often execute a set of persistent periodic safety critical control tasks along with dynamic event driven aperiodic tasks. This work presents a co-scheduling framework for the combined execution of such periodic and aperiodic real-time tasks on fully and run-time partially reconfigurable platforms. Specifically, we present an admission control strategy and preemptive scheduling methodology for dynamic aperiodic tasks in the presence of a set of persistent periodic tasks such that aperiodic task rejections may be minimized, thus resulting in high resource utilization. We used the 2D slotted area model where the floor of the FPGA is assumed to be statically equipartitioned into a set of tiles in which any arbitrary task may be feasibly mapped. The experimental results reveal that the proposed scheduling strategies are able to achieve high resource utilization with low task rejection rates over various simulation scenarios.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 1","pages":"41-54"},"PeriodicalIF":0.0,"publicationDate":"2017-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2017.2691701","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68003397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14