2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)最新文献

英文中文

OpenMP device offloading to FPGA accelerators OpenMP设备卸载到FPGA加速器

2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Pub Date : 2017-07-01 DOI: 10.1109/ASAP.2017.7995280

Lukas Sommer, Jens Korinth, A. Koch

Future high-performance computing systems will need to include multiple specialized accelerators in a single heterogeneous system to overcome power-density limitations of CPU performance.

未来的高性能计算系统将需要在单个异构系统中包含多个专用加速器，以克服CPU性能的功率密度限制。

引用次数: 45

DeepPump: Multi-pumping deep Neural Networks DeepPump:多泵深度神经网络

2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Pub Date : 2017-07-01 DOI: 10.1109/ASAP.2017.7995281

Ruizhe Zhao, T. Todman, W. Luk, Xinyu Niu

This paper presents DeepPump, an approach that generates CNN hardware designs with multi-pumping, which have competitive performance when compared with previous designs. Future work includes integrating DeepPump with other optimisations, and providing further evaluations on various FPGA platforms.

本文提出了一种基于多泵的CNN硬件设计方法DeepPump，该方法与以前的设计相比具有竞争力。未来的工作包括将DeepPump与其他优化集成，并在各种FPGA平台上提供进一步的评估。

引用次数: 4

A Staged Memory Resource Management Method for CMP systems 一种面向CMP系统的分级内存资源管理方法

2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Pub Date : 2017-07-01 DOI: 10.1109/ASAP.2017.7995264

Yangguo Liu, Junlin Lu, Dong Tong, Xu Cheng

Memory interference is a critical impediment to system performance in CMP systems. To address this problem, we first propose a Dynamically Proportional Bandwidth Throttling policy (DPBT), which dynamically throttles back memory-intensive applications based on their memory access behavior. DPBT achieves a more balance memory bandwidth partitioning. Moreover, we improve the previous memory channel partitioning scheme by integrating it with a bank partitioning. We further integrate DPBT with the improved memory channel partitioning scheme and a memory scheduling policy to leverage the architecture advantages, and present a Stage Memory Resource Management Method (SRM). Experimental results show that DPBT improves system throughput/fairness by 13.5%/31.1%. SRM provides 27.1% better system throughput and 34.8% better system fairness.

内存干扰是影响CMP系统性能的一个重要因素。为了解决这个问题，我们首先提出了一个动态比例带宽节流策略(DPBT)，它根据内存访问行为动态地限制内存密集型应用程序。DPBT实现了更均衡的内存带宽分区。此外，我们还改进了以前的内存通道分区方案，将其与银行分区集成在一起。我们进一步将DPBT与改进的内存通道分区方案和内存调度策略集成，以利用架构优势，并提出了一种阶段内存资源管理方法(SRM)。实验结果表明，DPBT提高了13.5%/31.1%的系统吞吐量/公平性。SRM提供了27.1%的系统吞吐量和34.8%的系统公平性。

引用次数: 1

CFStore: Boosting Hybrid storage performance by device crossfire CFStore:通过设备交叉火力提升混合存储性能

2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Pub Date : 2017-07-01 DOI: 10.1109/ASAP.2017.7995265

Wei Zhou, D. Feng, Zhipeng Tan

Hybrid storage is widely implemented as it satisfies the requirements of capacity and performance in an economically viable fashion. With the fast technical improvement, Hybrid storage systems consisting of several types of SSDs will be adopted gradually. Existing works mostly concentrate on thoroughly utilizing high-performance device but neglect the capability of low-performance device. This paper proposes a device crossfire method to boost hybrid storage performance by efficiently leveraging both high-performance and low-performance devices. Performance-critical data are appropriately off-loaded to low-performance device to exploit access parallelism. The implemented storage system CFStore exhibits good performance during experiments. Compared to famous hybrid storage system Hystor, CFStore improves throughput by 17.9%–42.6%, and reduces latency by 15.9%–35.0%.

混合存储以经济可行的方式满足容量和性能的要求，得到了广泛的应用。随着技术的快速进步，由多种类型的ssd组成的混合存储系统将逐步被采用。现有的工作大多侧重于充分利用高性能器件，而忽视了低性能器件的性能。本文提出了一种设备交叉方法，通过有效地利用高性能和低性能设备来提高混合存储性能。性能关键型数据适当地卸载到低性能设备，以利用访问并行性。所实现的存储系统CFStore在实验中表现出良好的性能。与著名混合存储系统Hystor相比，CFStore吞吐量提升17.9% ~ 42.6%，时延降低15.9% ~ 35.0%。

引用次数: 1

An embedded scalable linear model predictive hardware-based controller using ADMM 基于ADMM的嵌入式可扩展线性模型预测硬件控制器

2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Pub Date : 2017-07-01 DOI: 10.1109/ASAP.2017.7995276

Pei Zhang, Joseph Zambreno, Phillip H. Jones

Model predictive control (MPC) is a popular advanced model-based control algorithm for controlling systems that must respect a set of system constraints (e.g. actuator force limitations). However, the computing requirements of MPC limits the suitability of deploying its software implementation into embedded controllers requiring high update rates. This paper presents a scalable embedded MPC controller implemented on a field-programmable gate array (FPGA) coupled with an on-chip ARM processor. Our architecture implements an Alternating Direction Method of Multipliers (ADMM) approach for computing MPC controller commands. All computations are performed using floating-point arithmetic. We introduce a software/hardware (SW/HW) co-design methodology, for which the ARM software can configure on-chip Block RAM to allow users to 1) configure the MPC controller for a wide range of plants, and 2) update at runtime the desired trajectory to track. Our hardware architecture has the flexibility to compromise between the amount of hardware resources used (regarding Block RAMs and DSPs) and the controller computing speed. For example, this flexibility gives the ability to control plants modeled by a large number of decision variables (i.e. a plant model using many Block RAMs) with a small number of computing resources (i.e. DSPs) at the cost of increased computing time. The hardware controller is verified using a Plant-on-Chip (PoC), which is configured to emulate a mass-spring system in real-time. A major driving goal of this work is to architect an SW/HW platform that brings FPGAs a step closer to being widely adopted by advanced control algorithm designers for deploying their algorithms into embedded systems.

模型预测控制(MPC)是一种流行的基于模型的高级控制算法，用于必须尊重一组系统约束(例如执行器力限制)的控制系统。然而，MPC的计算需求限制了将其软件实现部署到需要高更新率的嵌入式控制器中的适用性。本文提出了一种基于现场可编程门阵列(FPGA)和片上ARM处理器的可扩展嵌入式MPC控制器。我们的架构实现了一种交替方向乘法器(ADMM)方法来计算MPC控制器命令。所有的计算都使用浮点运算来执行。我们引入了一种软件/硬件(SW/HW)协同设计方法，ARM软件可以配置片上块RAM，允许用户1)为各种工厂配置MPC控制器，2)在运行时更新所需的跟踪轨迹。我们的硬件架构具有在硬件资源使用量(关于块ram和dsp)和控制器计算速度之间折衷的灵活性。例如，这种灵活性提供了用少量计算资源(即dsp)以增加计算时间为代价来控制由大量决策变量(即使用许多块ram的工厂模型)建模的工厂的能力。硬件控制器使用片上工厂(PoC)进行验证，PoC配置为实时模拟质量弹簧系统。这项工作的一个主要驱动目标是构建一个软件/硬件平台，使fpga更接近于被高级控制算法设计人员广泛采用，以便将其算法部署到嵌入式系统中。

{"title":"An embedded scalable linear model predictive hardware-based controller using ADMM","authors":"Pei Zhang, Joseph Zambreno, Phillip H. Jones","doi":"10.1109/ASAP.2017.7995276","DOIUrl":"https://doi.org/10.1109/ASAP.2017.7995276","url":null,"abstract":"Model predictive control (MPC) is a popular advanced model-based control algorithm for controlling systems that must respect a set of system constraints (e.g. actuator force limitations). However, the computing requirements of MPC limits the suitability of deploying its software implementation into embedded controllers requiring high update rates. This paper presents a scalable embedded MPC controller implemented on a field-programmable gate array (FPGA) coupled with an on-chip ARM processor. Our architecture implements an Alternating Direction Method of Multipliers (ADMM) approach for computing MPC controller commands. All computations are performed using floating-point arithmetic. We introduce a software/hardware (SW/HW) co-design methodology, for which the ARM software can configure on-chip Block RAM to allow users to 1) configure the MPC controller for a wide range of plants, and 2) update at runtime the desired trajectory to track. Our hardware architecture has the flexibility to compromise between the amount of hardware resources used (regarding Block RAMs and DSPs) and the controller computing speed. For example, this flexibility gives the ability to control plants modeled by a large number of decision variables (i.e. a plant model using many Block RAMs) with a small number of computing resources (i.e. DSPs) at the cost of increased computing time. The hardware controller is verified using a Plant-on-Chip (PoC), which is configured to emulate a mass-spring system in real-time. A major driving goal of this work is to architect an SW/HW platform that brings FPGAs a step closer to being widely adopted by advanced control algorithm designers for deploying their algorithms into embedded systems.","PeriodicalId":405953,"journal":{"name":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116885172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Modeling and evaluation for gather/scatter operations in Vector-SIMD architectures Vector-SIMD架构中聚集/分散操作的建模和评估

2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Pub Date : 2017-07-01 DOI: 10.1109/ASAP.2017.7995271

Hongbing Tan, Haiyan Chen, Sheng Liu, Jianguo Wu

Gather/scatter are state of the art vector memory access modes in Vector-SIMD architectures. However, because of the stochastic and complicated properties, the hardware design of gather/scatter operations lacks theoretical analysis and modeling. This paper proposes a model for gather/scatter operations on local vector memory for the first time. The model can not only give all the possible distributions of access locations, calculate the probability of access conflicts and predict the number of access conflicts, but also can provide the theoretical guidance for the performance optimization. This model is validated through experiments which can guide users to more specifically design and optimize the implementation of gather/scatter operations.

Gather/scatter是vector - simd架构中最先进的矢量内存访问模式。然而，由于采集/散射操作的随机性和复杂性，其硬件设计缺乏理论分析和建模。本文首次提出了一种局部矢量存储器上的聚散运算模型。该模型不仅可以给出所有可能的访问位置分布，计算访问冲突的概率和预测访问冲突的数量，而且可以为性能优化提供理论指导。通过实验验证了该模型，可以指导用户更有针对性地设计和优化聚/散操作的实现。

引用次数: 2

Real-time object detection in software with custom vector instructions and algorithm changes 实时目标检测软件与自定义矢量指令和算法的变化

2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Pub Date : 2017-07-01 DOI: 10.1109/ASAP.2017.7995262

Joe Edwards, G. Lemieux

Real-time vision applications place stringent performance requirements on embedded systems. To meet performance requirements, embedded systems often require hardware implementations. This approach is unfavorable as hardware development can be difficult to debug, time-consuming, and require extensive skill. This paper presents a case study of accelerating face detection, often part of a complex image processing pipeline, using a software/hardware hybrid approach. As a baseline, the algorithm is initially run on a scalar ARM Cortex-A9 application processor found on a Xilinx Zynq device. Next, using a previously designed vector engine implemented in the FPGA fabric, the algorithm is vectorized, using only standard vector instructions, to achieve a 25× speedup. Then, we accelerate the critical inner loops by adding two hardware-assisted custom vector instructions for an additional 10× speedup, yielding 248× speedup over the initial Cortex-A9 baseline. Collectively, the custom instructions require fewer than 800 lines of VHDL code, including comments and blank lines. Compared to previous hardware-only face detection systems, our work is 1.5 to 6.8 times faster. This approach demonstrates that good performance can be obtained from software-only vectorization, and a small amount of custom hardware can provide a significant acceleration boost.

实时视觉应用对嵌入式系统提出了严格的性能要求。为了满足性能需求，嵌入式系统通常需要硬件实现。这种方法是不利的，因为硬件开发可能难以调试，耗时，并且需要大量的技能。本文介绍了一个使用软件/硬件混合方法加速人脸检测的案例研究，人脸检测通常是复杂图像处理管道的一部分。作为基准，该算法最初在Xilinx Zynq设备上的标量ARM Cortex-A9应用处理器上运行。接下来，使用先前设计的在FPGA结构中实现的矢量引擎，仅使用标准矢量指令对算法进行矢量化，以实现25倍的加速。然后，我们通过添加两个硬件辅助的自定义矢量指令来加速关键的内环，以获得额外的10倍加速，从而在初始Cortex-A9基线上产生248倍的加速。总的来说，自定义指令需要少于800行VHDL代码，包括注释和空白行。与之前的纯硬件人脸检测系统相比，我们的工作速度快了1.5到6.8倍。这种方法表明，仅通过软件矢量化可以获得良好的性能，并且少量的定制硬件可以提供显着的加速提升。

{"title":"Real-time object detection in software with custom vector instructions and algorithm changes","authors":"Joe Edwards, G. Lemieux","doi":"10.1109/ASAP.2017.7995262","DOIUrl":"https://doi.org/10.1109/ASAP.2017.7995262","url":null,"abstract":"Real-time vision applications place stringent performance requirements on embedded systems. To meet performance requirements, embedded systems often require hardware implementations. This approach is unfavorable as hardware development can be difficult to debug, time-consuming, and require extensive skill. This paper presents a case study of accelerating face detection, often part of a complex image processing pipeline, using a software/hardware hybrid approach. As a baseline, the algorithm is initially run on a scalar ARM Cortex-A9 application processor found on a Xilinx Zynq device. Next, using a previously designed vector engine implemented in the FPGA fabric, the algorithm is vectorized, using only standard vector instructions, to achieve a 25× speedup. Then, we accelerate the critical inner loops by adding two hardware-assisted custom vector instructions for an additional 10× speedup, yielding 248× speedup over the initial Cortex-A9 baseline. Collectively, the custom instructions require fewer than 800 lines of VHDL code, including comments and blank lines. Compared to previous hardware-only face detection systems, our work is 1.5 to 6.8 times faster. This approach demonstrates that good performance can be obtained from software-only vectorization, and a small amount of custom hardware can provide a significant acceleration boost.","PeriodicalId":405953,"journal":{"name":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126292217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Hardware-accelerated CCD readout smear correction for Fast Solar Polarimeter 用于快速太阳偏振计的硬件加速CCD读出涂抹校正

2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Pub Date : 2017-07-01 DOI: 10.1109/ASAP.2017.7995261

Stefan Tabel, Korbinian Weikl, W. Stechele

Shutterless frame store charge-coupled devices (CCDs) are commonly used in ground-based solar observations, but the characteristical readout smear error of such devices hinders an application of frame store CCDs to autonomous missions. The combination of polarimetric modulation and image accumulation disables a correction of this error via software-based post-facto processing if in addition microvibrations occur during flight. This paper presents the first FPGA-based architecture for online smear correction of images from frame store CCDs, which allows for the usage of a certain frame store CCD camera on a balloon-borne solar observatory. First, we explore fast convolution-based algorithms with respect to their properties for an implementation. Afterwards, a hardware architecture is derived and implemented. Our results show that 400 frames of megapixel size can be corrected per second, maintaining an acceptable power consumption of less than 12 Watt. Finally, we discuss the circuit and show the degrees of freedom for further designs.

无快门框架存储电荷耦合器件(ccd)是地面太阳观测中常用的器件，但这种器件的读出涂抹误差特性阻碍了框架存储ccd在自主任务中的应用。如果在飞行过程中发生微振动，偏振调制和图像积累的结合将通过基于软件的事后处理来纠正这一误差。本文提出了一种基于fpga的框架存储CCD图像在线涂抹校正架构，该架构允许在气球载太阳天文台上使用特定的框架存储CCD相机。首先，我们探索基于快速卷积的算法及其实现的属性。然后，推导并实现了硬件体系结构。我们的研究结果表明，每秒可以校正400帧的百万像素大小，保持可接受的功耗低于12瓦。最后，我们对电路进行了讨论，并给出了进一步设计的自由度。

引用次数: 1

High-throughput area-efficient processor for 3GPP LTE cryptographic core algorithms 用于3GPP LTE加密核心算法的高吞吐量区域高效处理器

2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Pub Date : 2017-07-01 DOI: 10.1109/ASAP.2017.7995285

Yuanhong Huo, Dake Liu

There are three sets of cryptographic algorithms working on LTE technology and each set based on one core algorithm. In high-end embedded systems, it is necessary to implement the three core algorithms: block cipher AES-128 and stream ciphers SNOW 3G and ZUC, with high performance and low silicon cost. This paper proposes a high throughput ASIP (application-specific instruction-set processor) design (CP-LTE) for the three core algorithms.

在LTE技术上有三组加密算法，每一组都基于一个核心算法。在高端嵌入式系统中，必须实现分组密码AES-128和流密码SNOW 3G和ZUC三种核心算法，以实现高性能和低硅成本。本文针对这三种核心算法提出了一种高吞吐量专用指令集处理器(asp)设计(CP-LTE)。

引用次数: 2

RVNet: A fast and high energy efficiency network packet processing system on RISC-V RVNet:基于RISC-V的快速高能效网络数据包处理系统

2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Pub Date : 2017-07-01 DOI: 10.1109/ASAP.2017.7995266

Yanpeng Wang, M. Wen, Chunyuan Zhang, Jie Lin

RISC-V is a new open-source general-purpose instruction set architecture (ISA) developed by the University of California, Berkeley. It allows everyone to design their hardware circuits based on application characteristics and can be used in embedded devices, desktop computer and high-performance servers. In this paper, we use the RISC-V processor to design a fast network packet processing system. It aims to use less power and lower price to provide a faster network data processing capability for upper-layer applications in SDN and NFV. According to the results in our prototype on Field Programmable Gate Array (FPGA), our system has a comparable performance with DPDK, one of the fastest packet processing frameworks on the ×86 platform. It is worth mentioning that our system has higher (about 7.75 times) network packets processing energy efficiency than DPDK.

RISC-V是由加州大学伯克利分校开发的一种新的开源通用指令集架构(ISA)。它允许每个人根据应用特性设计自己的硬件电路，可用于嵌入式设备，台式计算机和高性能服务器。本文采用RISC-V处理器设计了一个快速的网络数据包处理系统。它旨在以更低的功耗和更低的价格，为SDN和NFV的上层应用提供更快的网络数据处理能力。根据我们在现场可编程门阵列(FPGA)上的原型结果，我们的系统具有与×86平台上最快的数据包处理框架之一DPDK相当的性能。值得一提的是，我们的系统具有比DPDK更高的网络数据包处理能量效率(约7.75倍)。

引用次数: 3

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀