首页 > 最新文献

2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)最新文献

英文 中文
KV-FTL: A novel key-value based FTL scheme for large scale SSDs KV-FTL:一种新的基于键值的大型ssd的FTL方案
Juan Li, Zhengguo Chen, Zhiguang Chen, Nong Xiao, Fang Liu, Wei Chen
Both traditional coarse-grained and fine-grained Flash Translation Layer schemes are unsuitable for ultra-large SSDs. They produce overmuch mapping entries which fail to be kept in embedded DRAM completely and can suffer severely from low spatial and temporal localities. In this paper, we propose a novel KV-FTL for ultra-large SSDs, which mostly maps logical addresses to physical addresses via a simple hash function, while handles hash collisions and out-of-place data updates by the traditional manner, i.e., the mapping table. Our KV-FTL can accelerate address translation by avoiding loading mapping table from flash memory to DRAM, thus improve performance; as well as reduce the write-traffic incurred by the mapping table, thus extend the lifespan of SSDs. Experimental results show that our KV-FTL facilitates SSDs to survive longer lifespan by a factor of up to 18.7% with an average of 13.6%; improves read performance ranging from 18.4% to 50.7% with an average of 39% with optimization, and in the case of extremely intensive requests, improves the access performance for requests with an average of 47%.
传统的粗粒度和细粒度的Flash Translation Layer方案都不适合超大容量的ssd。它们产生过多的映射项,这些映射项不能完全保存在嵌入式DRAM中,并且可能受到低空间和时间位置的严重影响。在本文中,我们提出了一种新的用于超大型ssd的KV-FTL,它主要通过简单的哈希函数将逻辑地址映射到物理地址,而通过传统的方式(即映射表)处理哈希冲突和错位数据更新。我们的KV-FTL可以通过避免从闪存到DRAM的加载映射表来加速地址转换,从而提高性能;并减少映射表带来的写流量,从而延长ssd的使用寿命。实验结果表明,我们的KV-FTL使ssd的寿命延长了18.7%,平均为13.6%;优化后的读性能提升幅度为18.4% ~ 50.7%,平均提升39%,在请求非常密集的情况下,平均提升47%的请求访问性能。
{"title":"KV-FTL: A novel key-value based FTL scheme for large scale SSDs","authors":"Juan Li, Zhengguo Chen, Zhiguang Chen, Nong Xiao, Fang Liu, Wei Chen","doi":"10.1109/HPCC-SmartCity-DSS.2017.14","DOIUrl":"https://doi.org/10.1109/HPCC-SmartCity-DSS.2017.14","url":null,"abstract":"Both traditional coarse-grained and fine-grained Flash Translation Layer schemes are unsuitable for ultra-large SSDs. They produce overmuch mapping entries which fail to be kept in embedded DRAM completely and can suffer severely from low spatial and temporal localities. In this paper, we propose a novel KV-FTL for ultra-large SSDs, which mostly maps logical addresses to physical addresses via a simple hash function, while handles hash collisions and out-of-place data updates by the traditional manner, i.e., the mapping table. Our KV-FTL can accelerate address translation by avoiding loading mapping table from flash memory to DRAM, thus improve performance; as well as reduce the write-traffic incurred by the mapping table, thus extend the lifespan of SSDs. Experimental results show that our KV-FTL facilitates SSDs to survive longer lifespan by a factor of up to 18.7% with an average of 13.6%; improves read performance ranging from 18.4% to 50.7% with an average of 39% with optimization, and in the case of extremely intensive requests, improves the access performance for requests with an average of 47%.","PeriodicalId":405953,"journal":{"name":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133300231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
DoSGuard: Protecting pipelined MPSoCs against hardware Trojan based DoS attacks DoSGuard:保护流水线mpsoc免受基于硬件木马的DoS攻击
Amin Malekpour, R. Ragel, A. Ignjatović, S. Parameswaran
Billions of transistors on a chip and the power wall made embedded systems to be designed with Multiprocessor System-on-Chip (MPSoC) architectures. One utilization of MPSoCs is the Pipelined MPSoCs (PMPSoCs). As many reliable and safety critical systems are deployed with MPSoCs, denying their service would have adverse effects. One such possibility is the insertion of a hardware Trojan that performs Denial of Service (DoS) attacks. DoSGuard present a novel PMPSoC architecture that continues its execution in the presence of DoS Trojans in Third Party Intellectual Property (3PIP) cores. DoSGuard deploys two methods; one can detect the presence of Trojans and recover, and the other can also identify the 3PIPs under attack using buffer delays. While the state of the art incurs 3× area and power overheads, DoSGuard consumes 1.5M+3 area and leakage power (M is the number of cores in the base system) and a small (the power consumption of the monitoring system) dynamic power overheads. On a cycle accurate commercial multiprocessor simulator, DoSGuard takes 531 clock cycles to detect a DoS attack. With DoSGuard the throughput reduction due to a DoS attack varies with the application and the monitoring interval but is negligible (< 10−3%) for real world scenarios, where millions of iterations take place.
一个芯片上的数十亿个晶体管和电源墙使得嵌入式系统被设计成多处理器片上系统(MPSoC)架构。mpsoc的一种应用是流水线mpsoc (pmpsoc)。由于许多可靠和安全的关键系统都部署了mpsoc,拒绝它们的服务将会产生不利影响。其中一种可能性是插入执行拒绝服务(DoS)攻击的硬件木马。DoSGuard提出了一种新颖的PMPSoC架构,可以在第三方知识产权(3PIP)内核中存在DoS木马的情况下继续执行。DoSGuard部署了两种方法:一个可以检测木马的存在并进行恢复,另一个还可以使用缓冲延迟识别受到攻击的3pip。目前采用的是3倍的面积和功耗开销,而DoSGuard的功耗是1.5M+3的面积和泄漏功率(M为基础系统的核数),动态功耗很小(监控系统的功耗)。在周期精确的商用多处理器模拟器上,DoSGuard检测一次DoS攻击需要531个时钟周期。使用DoSGuard,由于DoS攻击而导致的吞吐量减少随应用程序和监控间隔而变化,但对于发生数百万次迭代的真实世界场景来说可以忽略不计(< 10 - 3%)。
{"title":"DoSGuard: Protecting pipelined MPSoCs against hardware Trojan based DoS attacks","authors":"Amin Malekpour, R. Ragel, A. Ignjatović, S. Parameswaran","doi":"10.1109/ASAP.2017.7995258","DOIUrl":"https://doi.org/10.1109/ASAP.2017.7995258","url":null,"abstract":"Billions of transistors on a chip and the power wall made embedded systems to be designed with Multiprocessor System-on-Chip (MPSoC) architectures. One utilization of MPSoCs is the Pipelined MPSoCs (PMPSoCs). As many reliable and safety critical systems are deployed with MPSoCs, denying their service would have adverse effects. One such possibility is the insertion of a hardware Trojan that performs Denial of Service (DoS) attacks. DoSGuard present a novel PMPSoC architecture that continues its execution in the presence of DoS Trojans in Third Party Intellectual Property (3PIP) cores. DoSGuard deploys two methods; one can detect the presence of Trojans and recover, and the other can also identify the 3PIPs under attack using buffer delays. While the state of the art incurs 3× area and power overheads, DoSGuard consumes 1.5M+3 area and leakage power (M is the number of cores in the base system) and a small (the power consumption of the monitoring system) dynamic power overheads. On a cycle accurate commercial multiprocessor simulator, DoSGuard takes 531 clock cycles to detect a DoS attack. With DoSGuard the throughput reduction due to a DoS attack varies with the application and the monitoring interval but is negligible (< 10−3%) for real world scenarios, where millions of iterations take place.","PeriodicalId":405953,"journal":{"name":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129088744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Hierarchical Dataflow Model for efficient programming of clustered manycore processors 集群多核处理器高效编程的分层数据流模型
J. Hascoet, K. Desnos, J. Nezan, B. Dinechin
Programming Multiprocessor Systems-on-Chips (MPSoCs) with hundreds of heterogeneous Processing Elements (PEs), complex memory architectures, and Networks-on-Chips (NoCs) remains a challenge for embedded system designers. Dataflow Models of Computation (MoCs) are increasingly used for developing parallel applications as their high-level of abstraction eases the automation of mapping, task scheduling and memory allocation onto MPSoCs. This paper introduces a technique for deploying hierarchical dataflow graphs efficiently onto MPSoC. The proposed technique exploits different granularity of dataflow parallelism to generate both NoC-based communications and nested OpenMP loops. Deployment of an image processing application on a many-core MPSoC results in speedups of up to 58.7 compared to the sequential execution.
对包含数百个异构处理元素(pe)、复杂内存架构和片上网络(noc)的多处理器片上系统(mpsoc)进行编程仍然是嵌入式系统设计人员面临的一个挑战。数据流计算模型(moc)越来越多地用于开发并行应用程序,因为它们的高级抽象简化了mpsoc上映射、任务调度和内存分配的自动化。本文介绍了一种在MPSoC上高效部署分层数据流图的技术。所提出的技术利用不同粒度的数据流并行性来生成基于noc的通信和嵌套的OpenMP循环。与顺序执行相比,在多核MPSoC上部署图像处理应用程序的速度可高达58.7。
{"title":"Hierarchical Dataflow Model for efficient programming of clustered manycore processors","authors":"J. Hascoet, K. Desnos, J. Nezan, B. Dinechin","doi":"10.1109/ASAP.2017.7995270","DOIUrl":"https://doi.org/10.1109/ASAP.2017.7995270","url":null,"abstract":"Programming Multiprocessor Systems-on-Chips (MPSoCs) with hundreds of heterogeneous Processing Elements (PEs), complex memory architectures, and Networks-on-Chips (NoCs) remains a challenge for embedded system designers. Dataflow Models of Computation (MoCs) are increasingly used for developing parallel applications as their high-level of abstraction eases the automation of mapping, task scheduling and memory allocation onto MPSoCs. This paper introduces a technique for deploying hierarchical dataflow graphs efficiently onto MPSoC. The proposed technique exploits different granularity of dataflow parallelism to generate both NoC-based communications and nested OpenMP loops. Deployment of an image processing application on a many-core MPSoC results in speedups of up to 58.7 compared to the sequential execution.","PeriodicalId":405953,"journal":{"name":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125105737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
An efficient embedded multi-ported memory architecture for next-generation FPGAs 用于下一代fpga的高效嵌入式多端口存储器架构
S. N. Shahrouzi, D. Perera
In recent years, there has been a dramatic increase in utilization of FPGAs to enhance the speed-performance of many real-time compute and data intensive applications on embedded platforms. FPGA-based designs leverage parallelism in computations to achieve high speed-performance. Parallel computations require multi-ported memories to provide any number of ports for simultaneous multiple read/write (R/W) operations. Although several multi-ported memories are proposed in the literature, these designs become complex due to the extra logic and routing used for techniques/architectures to provide an arbitrary number of R/W ports. In this research work, we introduce a novel and efficient multi-ported memory architecture utilizing simple dual-port BRAMs, to provide an arbitrary number of R/W ports. Apart from the BRAMs, our proposed multi-ported memory design only consists of the Decision Making Modules and a counter, thus simplifying the design process. The R/W operations within our architecture are also straightforward. Experiments are performed to evaluate the feasibility and efficiency of our multi-ported memory architecture. We also evaluate our architecture with the most recently proposed multi-ported memory designs, implemented using LVT and XOR techniques, from the existing literature. FPGA manufacturers could employ our multi-ported memory architecture to accelerate real-time compute/data intensive applications with their next-generation FPGAs. Due to lower design complexity compared to the existing designs, our simplified memory architecture would enable seamless integration to the existing FPGA-based CAD tools with minimal design cost.
近年来,为了提高嵌入式平台上许多实时计算和数据密集型应用的速度性能,fpga的利用率急剧增加。基于fpga的设计利用并行计算来实现高速性能。并行计算需要多端口存储器来提供任意数量的端口,以便同时进行多个读/写(R/W)操作。虽然文献中提出了几种多端口存储器,但由于技术/体系结构使用额外的逻辑和路由来提供任意数量的R/W端口,这些设计变得复杂。在这项研究工作中,我们介绍了一种新颖高效的多端口存储器架构,利用简单的双端口bram,提供任意数量的R/W端口。除了bram外,我们提出的多端口存储器设计仅由决策模块和计数器组成,从而简化了设计过程。我们架构中的R/W操作也很简单。实验验证了多端口存储架构的可行性和效率。我们还使用现有文献中最新提出的多端口内存设计来评估我们的架构,这些设计使用LVT和XOR技术实现。FPGA制造商可以使用我们的多端口内存架构来加速下一代FPGA的实时计算/数据密集型应用。与现有设计相比,由于设计复杂性较低,我们简化的内存架构可以以最小的设计成本无缝集成到现有的基于fpga的CAD工具中。
{"title":"An efficient embedded multi-ported memory architecture for next-generation FPGAs","authors":"S. N. Shahrouzi, D. Perera","doi":"10.1109/ASAP.2017.7995263","DOIUrl":"https://doi.org/10.1109/ASAP.2017.7995263","url":null,"abstract":"In recent years, there has been a dramatic increase in utilization of FPGAs to enhance the speed-performance of many real-time compute and data intensive applications on embedded platforms. FPGA-based designs leverage parallelism in computations to achieve high speed-performance. Parallel computations require multi-ported memories to provide any number of ports for simultaneous multiple read/write (R/W) operations. Although several multi-ported memories are proposed in the literature, these designs become complex due to the extra logic and routing used for techniques/architectures to provide an arbitrary number of R/W ports. In this research work, we introduce a novel and efficient multi-ported memory architecture utilizing simple dual-port BRAMs, to provide an arbitrary number of R/W ports. Apart from the BRAMs, our proposed multi-ported memory design only consists of the Decision Making Modules and a counter, thus simplifying the design process. The R/W operations within our architecture are also straightforward. Experiments are performed to evaluate the feasibility and efficiency of our multi-ported memory architecture. We also evaluate our architecture with the most recently proposed multi-ported memory designs, implemented using LVT and XOR techniques, from the existing literature. FPGA manufacturers could employ our multi-ported memory architecture to accelerate real-time compute/data intensive applications with their next-generation FPGAs. Due to lower design complexity compared to the existing designs, our simplified memory architecture would enable seamless integration to the existing FPGA-based CAD tools with minimal design cost.","PeriodicalId":405953,"journal":{"name":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122389523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Hardware support for embedded operating system security 对嵌入式操作系统安全性的硬件支持
Arman Pouraghily, T. Wolf, R. Tessier
Internet-connected embedded systems have limited capabilities to defend themselves against remote hacking attacks. The potential effects of such attacks, however, can have a significant impact in the context of the Internet of Things, industrial control systems, smart health systems, etc. Embedded systems cannot effectively utilize existing software-based protection mechanisms due to limited processing capabilities and energy resources. We propose a novel hardware-based monitoring technique that can detect if the embedded operating system or any running application deviates from the originally programmed behavior due to an attack. We present an FPGA-based prototype implementation that shows the effectiveness of such a security approach.
连接互联网的嵌入式系统在抵御远程黑客攻击方面能力有限。然而,这种攻击的潜在影响可能会对物联网、工业控制系统、智能卫生系统等产生重大影响。由于处理能力和能源有限,嵌入式系统无法有效利用现有的基于软件的保护机制。我们提出了一种新的基于硬件的监控技术,可以检测嵌入式操作系统或任何正在运行的应用程序是否由于攻击而偏离了最初的编程行为。我们提出了一个基于fpga的原型实现,显示了这种安全方法的有效性。
{"title":"Hardware support for embedded operating system security","authors":"Arman Pouraghily, T. Wolf, R. Tessier","doi":"10.1109/ASAP.2017.7995260","DOIUrl":"https://doi.org/10.1109/ASAP.2017.7995260","url":null,"abstract":"Internet-connected embedded systems have limited capabilities to defend themselves against remote hacking attacks. The potential effects of such attacks, however, can have a significant impact in the context of the Internet of Things, industrial control systems, smart health systems, etc. Embedded systems cannot effectively utilize existing software-based protection mechanisms due to limited processing capabilities and energy resources. We propose a novel hardware-based monitoring technique that can detect if the embedded operating system or any running application deviates from the originally programmed behavior due to an attack. We present an FPGA-based prototype implementation that shows the effectiveness of such a security approach.","PeriodicalId":405953,"journal":{"name":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"159 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121686759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Fast and efficient implementation of Convolutional Neural Networks on FPGA 卷积神经网络在FPGA上的快速高效实现
Abhinav Podili, Chi Zhang, V. Prasanna
State-of-the-art CNN models for Image recognition use deep networks with small filters instead of shallow networks with large filters, because the former requires fewer weights. In the light of above trend, we present a fast and efficient FPGA based convolution engine to accelerate CNN models over small filters. The convolution engine implements Winograd minimal filtering algorithm to reduce the number of multiplications by 38% to 55% for state-of-the-art CNNs. We exploit the parallelism of the Winograd convolution engine to scale the overall performance. We show that our overall design sustains the peak throughput of the convolution engines. We propose a novel data layout to reduce the required memory bandwidth of our design by half. One noteworthy feature of our Winograd convolution engine is that it hides the computation latency of the pooling layer. As a case study we implement VGG16 CNN model and compare it with previous approaches. Compared with the state-of-the-art reduced precision VGG16 implementation, our implementation achieves 1.2× improvement in throughput by using 3× less multipliers and 2× less on-chip memory without impacting the classification accuracy. The improvements in throughput per multiplier and throughput per unit on-chip memory are 3.7× and 2.47× respectively, compared with the state-of-the-art design.
最先进的CNN图像识别模型使用带有小滤波器的深度网络,而不是带有大滤波器的浅网络,因为前者需要更少的权重。鉴于上述趋势,我们提出了一种快速高效的基于FPGA的卷积引擎,用于在小滤波器上加速CNN模型。卷积引擎实现了Winograd最小滤波算法,将最先进的cnn的乘法次数减少了38%到55%。我们利用Winograd卷积引擎的并行性来扩展整体性能。我们表明,我们的整体设计维持了卷积引擎的峰值吞吐量。我们提出了一种新颖的数据布局,以减少我们设计所需的内存带宽的一半。Winograd卷积引擎的一个值得注意的特性是它隐藏了池化层的计算延迟。作为案例研究,我们实现了VGG16 CNN模型,并与之前的方法进行了比较。与目前最先进的降低精度的VGG16实现相比,我们的实现在不影响分类精度的情况下,通过使用少3倍的乘法器和少2倍的片上内存,实现了1.2倍的吞吐量提高。与最先进的设计相比,每个乘法器的吞吐量和每个单元片上存储器的吞吐量分别提高了3.7倍和2.47倍。
{"title":"Fast and efficient implementation of Convolutional Neural Networks on FPGA","authors":"Abhinav Podili, Chi Zhang, V. Prasanna","doi":"10.1109/ASAP.2017.7995253","DOIUrl":"https://doi.org/10.1109/ASAP.2017.7995253","url":null,"abstract":"State-of-the-art CNN models for Image recognition use deep networks with small filters instead of shallow networks with large filters, because the former requires fewer weights. In the light of above trend, we present a fast and efficient FPGA based convolution engine to accelerate CNN models over small filters. The convolution engine implements Winograd minimal filtering algorithm to reduce the number of multiplications by 38% to 55% for state-of-the-art CNNs. We exploit the parallelism of the Winograd convolution engine to scale the overall performance. We show that our overall design sustains the peak throughput of the convolution engines. We propose a novel data layout to reduce the required memory bandwidth of our design by half. One noteworthy feature of our Winograd convolution engine is that it hides the computation latency of the pooling layer. As a case study we implement VGG16 CNN model and compare it with previous approaches. Compared with the state-of-the-art reduced precision VGG16 implementation, our implementation achieves 1.2× improvement in throughput by using 3× less multipliers and 2× less on-chip memory without impacting the classification accuracy. The improvements in throughput per multiplier and throughput per unit on-chip memory are 3.7× and 2.47× respectively, compared with the state-of-the-art design.","PeriodicalId":405953,"journal":{"name":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114994991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 57
OpenCL-based design pattern for line rate packet processing 基于opencl的线速率数据包处理设计模式
Jehandad Khan, P. Athanas, S. Booth, John Marshall
The ever changing nature of network technology requires a flexible platform that can change as the technology evolves. In this work, a complete networking switch designed in OpenCL is presented, identifying several high-level constructs that form the building blocks of any network application targeting FPGAs. These include the notion of an on-chip global memory and kernels constantly processing data without the intervention of the host. The use of OpenCL is motivated by the ability to rapidly change designs and to be maintainable by a wider developer community. Pieces of the design that cannot be realized using current OpenCL technology are also identified and a solution to the problem is presented.
网络技术不断变化的本质需要一个灵活的平台,可以随着技术的发展而变化。在这项工作中,提出了一个用OpenCL设计的完整网络交换机,确定了形成任何针对fpga的网络应用的构建块的几个高级结构。其中包括片上全局内存的概念,以及内核在没有主机干预的情况下不断处理数据。使用OpenCL的动机是能够快速更改设计,并由更广泛的开发人员社区进行维护。指出了当前OpenCL技术无法实现的设计问题,并提出了解决方案。
{"title":"OpenCL-based design pattern for line rate packet processing","authors":"Jehandad Khan, P. Athanas, S. Booth, John Marshall","doi":"10.1109/ASAP.2017.7995278","DOIUrl":"https://doi.org/10.1109/ASAP.2017.7995278","url":null,"abstract":"The ever changing nature of network technology requires a flexible platform that can change as the technology evolves. In this work, a complete networking switch designed in OpenCL is presented, identifying several high-level constructs that form the building blocks of any network application targeting FPGAs. These include the notion of an on-chip global memory and kernels constantly processing data without the intervention of the host. The use of OpenCL is motivated by the ability to rapidly change designs and to be maintainable by a wider developer community. Pieces of the design that cannot be realized using current OpenCL technology are also identified and a solution to the problem is presented.","PeriodicalId":405953,"journal":{"name":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114577043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
High-performance FPGA implementation of equivariant adaptive separation via independence algorithm for Independent Component Analysis 基于独立分析算法的等变自适应分离的高性能FPGA实现
M. Nazemi, Shahin Nazarian, Massoud Pedram
Independent Component Analysis (ICA) is a dimensionality reduction technique that can boost efficiency of machine learning models that deal with probability density functions, e.g. Bayesian neural networks. Algorithms that implement adaptive ICA converge slower than their nonadaptive counterparts, however, they are capable of tracking changes in underlying distributions of input features. This intrinsically slow convergence of adaptive methods combined with existing hardware implementations that operate at very low clock frequencies necessitate fundamental improvements in both algorithm and hardware design. This paper presents an algorithm that allows efficient hardware implementation of ICA. Compared to previous work, our FPGA implementation of adaptive ICA improves clock frequency by at least one order of magnitude and throughput by at least two orders of magnitude. Our proposed algorithm is not limited to ICA and can be used in various machine learning problems that use stochastic gradient descent optimization.
独立成分分析(ICA)是一种降维技术,可以提高处理概率密度函数的机器学习模型的效率,例如贝叶斯神经网络。实现自适应ICA的算法收敛速度比非自适应算法慢,然而,它们能够跟踪输入特征底层分布的变化。这种固有的自适应方法的缓慢收敛与在非常低的时钟频率下运行的现有硬件实现相结合,需要在算法和硬件设计方面进行根本性的改进。本文提出了一种有效的ICA硬件实现算法。与以前的工作相比,我们的FPGA实现的自适应ICA将时钟频率提高了至少一个数量级,吞吐量提高了至少两个数量级。我们提出的算法不仅限于ICA,而且可以用于使用随机梯度下降优化的各种机器学习问题。
{"title":"High-performance FPGA implementation of equivariant adaptive separation via independence algorithm for Independent Component Analysis","authors":"M. Nazemi, Shahin Nazarian, Massoud Pedram","doi":"10.1109/ASAP.2017.7995255","DOIUrl":"https://doi.org/10.1109/ASAP.2017.7995255","url":null,"abstract":"Independent Component Analysis (ICA) is a dimensionality reduction technique that can boost efficiency of machine learning models that deal with probability density functions, e.g. Bayesian neural networks. Algorithms that implement adaptive ICA converge slower than their nonadaptive counterparts, however, they are capable of tracking changes in underlying distributions of input features. This intrinsically slow convergence of adaptive methods combined with existing hardware implementations that operate at very low clock frequencies necessitate fundamental improvements in both algorithm and hardware design. This paper presents an algorithm that allows efficient hardware implementation of ICA. Compared to previous work, our FPGA implementation of adaptive ICA improves clock frequency by at least one order of magnitude and throughput by at least two orders of magnitude. Our proposed algorithm is not limited to ICA and can be used in various machine learning problems that use stochastic gradient descent optimization.","PeriodicalId":405953,"journal":{"name":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121924340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Design and implementation of adaptive signal processing systems using Markov decision processes 马尔可夫决策过程自适应信号处理系统的设计与实现
Lin Li, A. Sapio, Jiahao Wu, Yanzhou Liu, Kyunghun Lee, M. Wolf, S. Bhattacharyya
In this paper, we propose a novel framework, called Hierarchical MDP framework for Compact System-level Modeling (HMCSM), for design and implementation of adaptive embedded signal processing systems. The HMCSM framework applies Markov decision processes (MDPs) to enable autonomous adaptation of embedded signal processing under multidimensional constraints and optimization objectives. The framework integrates automated, MDP-based generation of optimal reconfiguration policies, dataflow-based application modeling, and implementation of embedded control software that carries out the generated reconfiguration policies. HMCSM systematically decomposes a complex, monolithic MDP into a set of separate MDPs that are connected hierarchically, and that operate more efficiently through such a modularized structure. We demonstrate the effectiveness of our new MDP-based system design framework through experiments with an adaptive wireless communications receiver.
在本文中,我们提出了一种新的框架,称为紧凑系统级建模(HMCSM)的分层MDP框架,用于自适应嵌入式信号处理系统的设计和实现。HMCSM框架应用马尔可夫决策过程(mdp)在多维约束和优化目标下实现嵌入式信号处理的自主适应。该框架集成了自动化的、基于mdp的最佳重新配置策略生成、基于数据流的应用程序建模,以及执行生成的重新配置策略的嵌入式控制软件的实现。HMCSM系统地将一个复杂的单片MDP分解为一组独立的MDP,这些MDP按层次连接,并通过这种模块化结构更有效地运行。我们通过自适应无线通信接收器的实验证明了我们基于mdp的新系统设计框架的有效性。
{"title":"Design and implementation of adaptive signal processing systems using Markov decision processes","authors":"Lin Li, A. Sapio, Jiahao Wu, Yanzhou Liu, Kyunghun Lee, M. Wolf, S. Bhattacharyya","doi":"10.1109/ASAP.2017.7995275","DOIUrl":"https://doi.org/10.1109/ASAP.2017.7995275","url":null,"abstract":"In this paper, we propose a novel framework, called Hierarchical MDP framework for Compact System-level Modeling (HMCSM), for design and implementation of adaptive embedded signal processing systems. The HMCSM framework applies Markov decision processes (MDPs) to enable autonomous adaptation of embedded signal processing under multidimensional constraints and optimization objectives. The framework integrates automated, MDP-based generation of optimal reconfiguration policies, dataflow-based application modeling, and implementation of embedded control software that carries out the generated reconfiguration policies. HMCSM systematically decomposes a complex, monolithic MDP into a set of separate MDPs that are connected hierarchically, and that operate more efficiently through such a modularized structure. We demonstrate the effectiveness of our new MDP-based system design framework through experiments with an adaptive wireless communications receiver.","PeriodicalId":405953,"journal":{"name":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125576791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
reMinMin: A novel static energy-centric list scheduling approach based on real measurements remmin:一种基于实际测量的新型静态以能量为中心的列表调度方法
Achim Lösch, M. Platzner
Heterogeneous compute nodes in form of CPUs with attached GPU and FPGA accelerators have strongly gained interested in the last years. Applications differ in their execution characteristics and can therefore benefit from such heterogeneous resources in terms of performance or energy consumption. While performance optimization has been the only goal for a long time, nowadays research is more and more focusing on techniques to minimize energy consumption due to rising electricity costs. This paper presents reMinMin, a novel static list scheduling approach for optimizing the total energy consumption for a set of tasks executed on a heterogeneous compute node. reMinMin bases on a new energy model that differentiates between static and dynamic energy components and covers effects of accelerator tasks on the host CPU. The required energy values are retrieved by measurements on the real computing system. In order to evaluate reMinMin, we compare it with two reference implementations on three task sets with different degrees of heterogeneity. In our experiments, MinMin is consistently better than a scheduler optimizing for dynamic energy only, which requires up to 19.43% more energy, and very close to optimal schedules.
异构计算节点的形式与附加GPU和FPGA加速器的cpu在过去的几年里得到了强烈的兴趣。应用程序的执行特征不同,因此可以从这些异构资源中获得性能或能耗方面的好处。虽然长期以来性能优化一直是唯一的目标,但由于电力成本的上升,如今的研究越来越关注最小化能源消耗的技术。本文提出了一种新的静态列表调度方法,用于优化异构计算节点上执行的一组任务的总能耗。reMinMin基于一种新的能量模型,该模型区分了静态和动态能量组件,并涵盖了加速器任务对主机CPU的影响。在实际计算系统上通过测量得到所需的能量值。为了评估remmin,我们将其与三个异构程度不同的任务集上的两个参考实现进行了比较。在我们的实验中,MinMin始终优于仅针对动态能量优化的调度器,后者需要的能量最多增加19.43%,并且非常接近最优调度。
{"title":"reMinMin: A novel static energy-centric list scheduling approach based on real measurements","authors":"Achim Lösch, M. Platzner","doi":"10.1109/ASAP.2017.7995272","DOIUrl":"https://doi.org/10.1109/ASAP.2017.7995272","url":null,"abstract":"Heterogeneous compute nodes in form of CPUs with attached GPU and FPGA accelerators have strongly gained interested in the last years. Applications differ in their execution characteristics and can therefore benefit from such heterogeneous resources in terms of performance or energy consumption. While performance optimization has been the only goal for a long time, nowadays research is more and more focusing on techniques to minimize energy consumption due to rising electricity costs. This paper presents reMinMin, a novel static list scheduling approach for optimizing the total energy consumption for a set of tasks executed on a heterogeneous compute node. reMinMin bases on a new energy model that differentiates between static and dynamic energy components and covers effects of accelerator tasks on the host CPU. The required energy values are retrieved by measurements on the real computing system. In order to evaluate reMinMin, we compare it with two reference implementations on three task sets with different degrees of heterogeneity. In our experiments, MinMin is consistently better than a scheduler optimizing for dynamic energy only, which requires up to 19.43% more energy, and very close to optimal schedules.","PeriodicalId":405953,"journal":{"name":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127587051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
期刊
2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1