首页 > 最新文献

2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)最新文献

英文 中文
Noxim: An open, extensible and cycle-accurate network on chip simulator 一个开放的,可扩展的,周期精确的网络芯片模拟器
V. Catania, Andrea Mineo, Salvatore Monteleone, M. Palesi, Davide Patti
Emerging on-chip communication technologies like wireless Networks-on-Chip (WiNoCs) have been proposed as candidate solutions for addressing the scalability limitations of conventional multi-hop NoC architectures. In a WiNoC, a subset of network nodes are equipped with a wireless interface which allows them long-range communication in a single hop. This paper presents Noxim, an open, configurable, extendible, cycle-accurate NoC simulator developed in SystemC which allows to analyze the performance and power figures of both conventional wired NoC and emerging WiNoC architectures.
新兴的片上通信技术,如无线片上网络(WiNoCs)已被提出作为解决传统多跳NoC架构可扩展性限制的候选解决方案。在WiNoC中,网络节点的子集配备了无线接口,该接口允许它们在单跳中进行远程通信。本文介绍了在SystemC中开发的开放、可配置、可扩展、周期精确的NoC模拟器Noxim,它允许分析传统有线NoC和新兴WiNoC架构的性能和功耗数据。
{"title":"Noxim: An open, extensible and cycle-accurate network on chip simulator","authors":"V. Catania, Andrea Mineo, Salvatore Monteleone, M. Palesi, Davide Patti","doi":"10.1109/ASAP.2015.7245728","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245728","url":null,"abstract":"Emerging on-chip communication technologies like wireless Networks-on-Chip (WiNoCs) have been proposed as candidate solutions for addressing the scalability limitations of conventional multi-hop NoC architectures. In a WiNoC, a subset of network nodes are equipped with a wireless interface which allows them long-range communication in a single hop. This paper presents Noxim, an open, configurable, extendible, cycle-accurate NoC simulator developed in SystemC which allows to analyze the performance and power figures of both conventional wired NoC and emerging WiNoC architectures.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"48 1","pages":"162-163"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82311856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 236
Programmable RNS lattice-based parallel cryptographic decryption 基于可编程RNS格的并行密码解密
P. Martins, L. Sousa, J. Eynard, J. Bajard
Should quantum computing become viable, current public-key cryptographic schemes will no longer be valid. Since cryptosystems take many years to mature, research on post-quantum cryptography is now more important than ever. Herein, lattice-based cryptography is focused on, as an alternative post-quantum cryptosystem, to improve its efficiency. We put together several theoretical developments so as to produce an efficient implementation that solves the Closest Vector Problem (CVP) on Goldreich-Goldwasser-Halevi (GGH)-like cryptosystems based on the Residue Number System (RNS). We were able to produce speed-ups of up to 5.9 and 11.2 on the GTX 780 Ti and i7 4770K devices, respectively, when compared to a single-core optimized implementation. Finally, we show that the proposed implementation is a competitive alternative to the Rivest-Shamir-Adleman (RSA).
如果量子计算变得可行,当前的公钥加密方案将不再有效。由于密码系统需要多年才能成熟,因此对后量子密码学的研究比以往任何时候都更加重要。本文重点研究了基于格的密码系统作为后量子密码系统的替代方案,以提高其效率。我们将几个理论发展结合在一起,从而产生一个有效的实现,解决基于剩余数系统(RNS)的类goldreich - goldwser - halevi (GGH)密码系统上的最接近向量问题(CVP)。与单核优化实现相比,我们能够在GTX 780 Ti和i7 4770K设备上分别产生高达5.9和11.2的加速。最后,我们证明了所提出的实现是Rivest-Shamir-Adleman (RSA)的竞争性替代方案。
{"title":"Programmable RNS lattice-based parallel cryptographic decryption","authors":"P. Martins, L. Sousa, J. Eynard, J. Bajard","doi":"10.1109/ASAP.2015.7245723","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245723","url":null,"abstract":"Should quantum computing become viable, current public-key cryptographic schemes will no longer be valid. Since cryptosystems take many years to mature, research on post-quantum cryptography is now more important than ever. Herein, lattice-based cryptography is focused on, as an alternative post-quantum cryptosystem, to improve its efficiency. We put together several theoretical developments so as to produce an efficient implementation that solves the Closest Vector Problem (CVP) on Goldreich-Goldwasser-Halevi (GGH)-like cryptosystems based on the Residue Number System (RNS). We were able to produce speed-ups of up to 5.9 and 11.2 on the GTX 780 Ti and i7 4770K devices, respectively, when compared to a single-core optimized implementation. Finally, we show that the proposed implementation is a competitive alternative to the Rivest-Shamir-Adleman (RSA).","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"27 1","pages":"149-153"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90691957","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Stochastic circuit design and performance evaluation of vector quantization 随机电路设计及矢量量化性能评价
Ran Wang, Jie Han, B. Cockburn, D. Elliott
Vector quantization (VQ) is a general data compression technique that has a scalable implementation complexity and potentially a high compression ratio. In this paper, a novel implementation of VQ using stochastic circuits is proposed and its performance is evaluated. The stochastic and binary designs are compared for the same compression quality and the circuits are synthesized for an industrial 28-nm cell library. The effects of varying the sequence length of the stochastic design are studied with respect to the performance metric of throughput per area (TPA). When a shortened 512-bit encoding sequence is used to obtain a lower quality compression, the TPA is about 2.60 times that of the binary implementation with the same quality as that of the stochastic implementation measured by the L1 norm error (i.e., the first-order error). Thus, the stochastic implementation outperforms the conventional binary design in terms of TPA for a relatively low compression quality. By exploiting the progressive precision feature of a stochastic circuit, a readily scalable processing quality can be attained by simply halting the computation after different numbers of clock cycles.
矢量量化(VQ)是一种通用的数据压缩技术,具有可扩展的实现复杂性和潜在的高压缩比。本文提出了一种利用随机电路实现VQ的新方法,并对其性能进行了评价。在相同的压缩质量下,对随机设计和二进制设计进行了比较,并为工业28纳米细胞库合成了电路。研究了随机设计序列长度变化对单位面积吞吐量(TPA)性能指标的影响。当使用缩短的512位编码序列获得较低质量的压缩时,TPA约为二进制实现的2.60倍,并且与通过L1范数误差(即一阶误差)测量的随机实现的质量相同。因此,随机实现在TPA方面优于传统的二进制设计,压缩质量相对较低。利用随机电路的渐进式精度特征,只需在不同时钟周期数后停止计算,即可获得易于扩展的处理质量。
{"title":"Stochastic circuit design and performance evaluation of vector quantization","authors":"Ran Wang, Jie Han, B. Cockburn, D. Elliott","doi":"10.1109/ASAP.2015.7245717","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245717","url":null,"abstract":"Vector quantization (VQ) is a general data compression technique that has a scalable implementation complexity and potentially a high compression ratio. In this paper, a novel implementation of VQ using stochastic circuits is proposed and its performance is evaluated. The stochastic and binary designs are compared for the same compression quality and the circuits are synthesized for an industrial 28-nm cell library. The effects of varying the sequence length of the stochastic design are studied with respect to the performance metric of throughput per area (TPA). When a shortened 512-bit encoding sequence is used to obtain a lower quality compression, the TPA is about 2.60 times that of the binary implementation with the same quality as that of the stochastic implementation measured by the L1 norm error (i.e., the first-order error). Thus, the stochastic implementation outperforms the conventional binary design in terms of TPA for a relatively low compression quality. By exploiting the progressive precision feature of a stochastic circuit, a readily scalable processing quality can be attained by simply halting the computation after different numbers of clock cycles.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"38 1","pages":"111-115"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74640030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Large-scale packet classification on FPGA 基于FPGA的大规模分组分类
Shijie Zhou, Yun Qu, V. Prasanna
Packet classification is a key network function enabling a variety of network applications, such as network security, Quality of Service (QoS) routing, and other value-added services. Routers perform packet classification based on a predefined rule set. Packet classification faces two challenges: (1) the data rate of the network traffic keeps increasing, and (2) the size of the rule sets are becoming very large. In this paper, we propose an FPGA-based packet classification engine for large rule sets. We present a decomposition-based approach, where each field of the packet header is searched separately. Then we merge the partial search results from all the fields using a merging network. Experimental results show that our design can achieve a throughput of 147 Million Packets Per Second (MPPS), while supporting upto 256K rules on a state-of-the-art FPGA. Compared to the prior works on FPGA or multi-core processors, our design demonstrates significant performance improvements.
报文分类是一项重要的网络功能,可以实现网络安全、QoS路由和其他增值业务等多种网络应用。路由器根据预定义的规则集对报文进行分类。分组分类面临两个挑战:(1)网络流量的数据速率不断提高;(2)规则集的规模变得非常大。本文提出了一种基于fpga的大型规则集分组分类引擎。我们提出了一种基于分解的方法,其中包头的每个字段都是单独搜索的。然后使用合并网络对所有字段的部分搜索结果进行合并。实验结果表明,我们的设计可以实现每秒1.47亿数据包(MPPS)的吞吐量,同时在最先进的FPGA上支持高达256K的规则。与先前在FPGA或多核处理器上的工作相比,我们的设计显示出显着的性能改进。
{"title":"Large-scale packet classification on FPGA","authors":"Shijie Zhou, Yun Qu, V. Prasanna","doi":"10.1109/ASAP.2015.7245738","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245738","url":null,"abstract":"Packet classification is a key network function enabling a variety of network applications, such as network security, Quality of Service (QoS) routing, and other value-added services. Routers perform packet classification based on a predefined rule set. Packet classification faces two challenges: (1) the data rate of the network traffic keeps increasing, and (2) the size of the rule sets are becoming very large. In this paper, we propose an FPGA-based packet classification engine for large rule sets. We present a decomposition-based approach, where each field of the packet header is searched separately. Then we merge the partial search results from all the fields using a merging network. Experimental results show that our design can achieve a throughput of 147 Million Packets Per Second (MPPS), while supporting upto 256K rules on a state-of-the-art FPGA. Compared to the prior works on FPGA or multi-core processors, our design demonstrates significant performance improvements.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"37 1","pages":"226-233"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81127622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
LightSpMV: Faster CSR-based sparse matrix-vector multiplication on CUDA-enabled GPUs LightSpMV:在支持cuda的gpu上更快的基于csr的稀疏矩阵向量乘法
Yongchao Liu, B. Schmidt
Compressed sparse row (CSR) is a frequently used format for sparse matrix storage. However, the state-of-the-art CSR-based sparse matrix-vector multiplication (SpMV) implementations on CUDA-enabled GPUs do not exhibit very high efficiency. This has motivated the development of some alternative storage formats for GPU computing. Unfortunately, these alternatives are incompatible with most CPU-centric programs and require dynamic conversion from CSR at runtime, thus incurring significant computational and storage overheads. We present LightSpMV, a novel CUDA-compatible SpMV algorithm using the standard CSR format, which achieves high speed by benefiting from the fine-grained dynamic distribution of matrix rows over warps/vectors. In LightSpMV, two dynamic row distribution approaches have been investigated at the vector and warp levels with atomic operations and warp shuffle functions as the fundamental building blocks. We have evaluated LightSpMV using various sparse matrices and further compared it to the CSR-based SpMV subprograms in the state-of-the-art CUSP and cuSPARSE libraries. Performance evaluation reveals that on the same Tesla K40c GPU, LightSpMV is superior to both CUSP and cuSPARSE, with a speedup of up to 2.60 and 2.63 over CUSP, and up to 1.93 and 1.79 over cuSPARSE for single and double precision, respectively. LightSpMV is available at http://lightspmv.sourceforge.net.
压缩稀疏行(CSR)是一种常用的稀疏矩阵存储格式。然而,在支持cuda的gpu上,最先进的基于csr的稀疏矩阵向量乘法(SpMV)实现并没有表现出非常高的效率。这推动了GPU计算的一些替代存储格式的发展。不幸的是,这些替代方案与大多数以cpu为中心的程序不兼容,并且需要在运行时从CSR进行动态转换,因此会产生大量的计算和存储开销。我们提出了LightSpMV,一种使用标准CSR格式的新型cuda兼容SpMV算法,该算法通过利用扭曲/矢量上矩阵行的细粒度动态分布来实现高速。在LightSpMV中,以原子操作和warp shuffle函数作为基本构建块,在矢量和warp级别研究了两种动态行分布方法。我们使用各种稀疏矩阵对LightSpMV进行了评估,并进一步将其与最先进的CUSP和cuSPARSE库中基于csr的SpMV子程序进行了比较。性能评估显示,在相同的Tesla K40c GPU上,LightSpMV优于CUSP和cuSPARSE,在单精度和双精度上,LightSpMV的加速分别比CUSP高2.60和2.63,比cuSPARSE高1.93和1.79。LightSpMV可在http://lightspmv.sourceforge.net上获得。
{"title":"LightSpMV: Faster CSR-based sparse matrix-vector multiplication on CUDA-enabled GPUs","authors":"Yongchao Liu, B. Schmidt","doi":"10.1109/ASAP.2015.7245713","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245713","url":null,"abstract":"Compressed sparse row (CSR) is a frequently used format for sparse matrix storage. However, the state-of-the-art CSR-based sparse matrix-vector multiplication (SpMV) implementations on CUDA-enabled GPUs do not exhibit very high efficiency. This has motivated the development of some alternative storage formats for GPU computing. Unfortunately, these alternatives are incompatible with most CPU-centric programs and require dynamic conversion from CSR at runtime, thus incurring significant computational and storage overheads. We present LightSpMV, a novel CUDA-compatible SpMV algorithm using the standard CSR format, which achieves high speed by benefiting from the fine-grained dynamic distribution of matrix rows over warps/vectors. In LightSpMV, two dynamic row distribution approaches have been investigated at the vector and warp levels with atomic operations and warp shuffle functions as the fundamental building blocks. We have evaluated LightSpMV using various sparse matrices and further compared it to the CSR-based SpMV subprograms in the state-of-the-art CUSP and cuSPARSE libraries. Performance evaluation reveals that on the same Tesla K40c GPU, LightSpMV is superior to both CUSP and cuSPARSE, with a speedup of up to 2.60 and 2.63 over CUSP, and up to 1.93 and 1.79 over cuSPARSE for single and double precision, respectively. LightSpMV is available at http://lightspmv.sourceforge.net.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"12 1","pages":"82-89"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82739057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 41
An IEEE 754 double-precision floating-point multiplier for denormalized and normalized floating-point numbers 用于非规范化和规范化浮点数的ieee754双精度浮点乘法器
S. Thompson, J. Stine
This paper discusses an optimized double-precision floating-point multiplier that can handle both denormalized and normalized IEEE 754 floating-point numbers. Discussions of the optimizations are given and compared versus similar implementations, however, the main objective is keeping compliant for denormalized IEEE 754 floating-point numbers while still maintaining high performance operations for normalized numbers.
本文讨论了一种优化的双精度浮点乘法器,它可以处理非规范化和规范化的IEEE 754浮点数。本文给出了优化的讨论,并与类似的实现进行了比较,然而,主要目标是保持对非规范化的IEEE 754浮点数的遵从性,同时仍然保持对规范化数字的高性能操作。
{"title":"An IEEE 754 double-precision floating-point multiplier for denormalized and normalized floating-point numbers","authors":"S. Thompson, J. Stine","doi":"10.1109/ASAP.2015.7245706","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245706","url":null,"abstract":"This paper discusses an optimized double-precision floating-point multiplier that can handle both denormalized and normalized IEEE 754 floating-point numbers. Discussions of the optimizations are given and compared versus similar implementations, however, the main objective is keeping compliant for denormalized IEEE 754 floating-point numbers while still maintaining high performance operations for normalized numbers.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"254 1","pages":"62-63"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73194842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
An efficient architecture solution for low-power real-time background subtraction 一种低功耗实时背景减法的高效架构解决方案
H. Tabkhi, Majid Sabbagh, G. Schirner
Embedded vision is a rapidly growing market with a host of challenging algorithms. Among vision algorithms, Mixture of Gaussian (MoG) background subtraction is a frequently used kernel involving massive computation and communication. Tremendous challenges need to be reolved to provide MoG's high computation and communication demands with minimal power consumption allowing its embedded deployment. This paper proposes a customized architecture for power-efficient realization of MoG background subtraction operating at Full-HD resolution. Our design process benefits from system-level design principles. An SLDL-captured specification (result of high-level explorations) serves as a specification for architecture realization and hand-crafted RTL design. To optimize the architecture, this paper employs a set of optimization techniques including parallelism extraction, algorithm tuning, operation width sizing and deep pipelining. The final MoG implementation consists of 77 pipeline stages operating at 148.5 MHz implemented on a Zynq-7000 SoC. Furthermore, our background subtraction solution is flexible allowing end users to adjust algorithm parameters according to scene complexity. Our results demonstrate a very high efficiency for both indoor and outdoor scenes with 145 mW on-chip power consumption and more than 600× speedup over software execution on ARM Cortex A9 core.
嵌入式视觉是一个快速增长的市场,有许多具有挑战性的算法。在视觉算法中,混合高斯(MoG)背景减法是一个经常使用的核心算法,涉及大量的计算和通信。为了满足MoG的高计算和通信需求,以最小的功耗允许其嵌入式部署,需要解决巨大的挑战。本文提出了一种定制的架构,以实现在全高清分辨率下运行的MoG背景减法的节能。我们的设计过程受益于系统级设计原则。sldl捕获的规范(高级探索的结果)作为体系结构实现和手工制作的RTL设计的规范。为了优化该体系结构,本文采用了一系列优化技术,包括并行度提取、算法调优、操作宽度调整和深度流水线。最终的MoG实现由77个流水线阶段组成,在Zynq-7000 SoC上实现,工作频率为148.5 MHz。此外,我们的背景减法解决方案非常灵活,允许最终用户根据场景复杂性调整算法参数。我们的研究结果表明,在室内和室外场景中,芯片上功耗为145 mW,比ARM Cortex A9内核上的软件执行速度快600倍以上,效率非常高。
{"title":"An efficient architecture solution for low-power real-time background subtraction","authors":"H. Tabkhi, Majid Sabbagh, G. Schirner","doi":"10.1109/ASAP.2015.7245737","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245737","url":null,"abstract":"Embedded vision is a rapidly growing market with a host of challenging algorithms. Among vision algorithms, Mixture of Gaussian (MoG) background subtraction is a frequently used kernel involving massive computation and communication. Tremendous challenges need to be reolved to provide MoG's high computation and communication demands with minimal power consumption allowing its embedded deployment. This paper proposes a customized architecture for power-efficient realization of MoG background subtraction operating at Full-HD resolution. Our design process benefits from system-level design principles. An SLDL-captured specification (result of high-level explorations) serves as a specification for architecture realization and hand-crafted RTL design. To optimize the architecture, this paper employs a set of optimization techniques including parallelism extraction, algorithm tuning, operation width sizing and deep pipelining. The final MoG implementation consists of 77 pipeline stages operating at 148.5 MHz implemented on a Zynq-7000 SoC. Furthermore, our background subtraction solution is flexible allowing end users to adjust algorithm parameters according to scene complexity. Our results demonstrate a very high efficiency for both indoor and outdoor scenes with 145 mW on-chip power consumption and more than 600× speedup over software execution on ARM Cortex A9 core.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"13 1","pages":"218-225"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88200441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
How can Garbage Collection be energy efficient by dynamic offloading? 垃圾收集如何通过动态卸载实现节能?
Jie Tang, Chen Liu, J. Gaudiot
Garbage Collection (GC) is still a major issue in JVM for both mobile and cluster computing. GC offloading is proposed to improve the performance of GC by delivering part or all of the operations into another dedicated GC hardware. However, the traditional offloading just offloads directly not considering the phase change of GC behavior, which can be classified into two different groups: minor GC and major GC. The minor GC is fast and frequently invoked, while major GC is expensive in terms of time but seldom takes place. The direct offloading made GC workload frequently hopping between main processor and GC hardware, introduced a noticeable overhead and offset any possible benefits of workload loading. To solve this issue, we propose to offload GC dynamically by a careful selection of profitable and harmful GC operations. We also made a case study on Apache Spark, a lightning-fast cluster computing platform. It shows dynamic offloading can yield nearly 42.6% performance improvement with a concurrent 32.1% in energy cost reduction.
对于移动和集群计算,垃圾收集(GC)仍然是JVM中的一个主要问题。GC卸载被提议通过将部分或全部操作交付到另一个专用GC硬件来提高GC的性能。然而,传统的卸载只是直接卸载,而不考虑GC行为的相变,这可以分为两种不同的类型:次要GC和主要GC。次要GC快速且经常被调用,而主要GC在时间上很昂贵,但很少发生。直接卸载使GC工作负载频繁地在主处理器和GC硬件之间跳转,带来了明显的开销,并抵消了工作负载负载可能带来的任何好处。为了解决这个问题,我们建议通过仔细选择有利的和有害的GC操作来动态卸载GC。我们还对Apache Spark进行了案例研究,这是一个闪电般的集群计算平台。结果表明,动态卸载可以提高42.6%的性能,同时降低32.1%的能源成本。
{"title":"How can Garbage Collection be energy efficient by dynamic offloading?","authors":"Jie Tang, Chen Liu, J. Gaudiot","doi":"10.1109/ASAP.2015.7245725","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245725","url":null,"abstract":"Garbage Collection (GC) is still a major issue in JVM for both mobile and cluster computing. GC offloading is proposed to improve the performance of GC by delivering part or all of the operations into another dedicated GC hardware. However, the traditional offloading just offloads directly not considering the phase change of GC behavior, which can be classified into two different groups: minor GC and major GC. The minor GC is fast and frequently invoked, while major GC is expensive in terms of time but seldom takes place. The direct offloading made GC workload frequently hopping between main processor and GC hardware, introduced a noticeable overhead and offset any possible benefits of workload loading. To solve this issue, we propose to offload GC dynamically by a careful selection of profitable and harmful GC operations. We also made a case study on Apache Spark, a lightning-fast cluster computing platform. It shows dynamic offloading can yield nearly 42.6% performance improvement with a concurrent 32.1% in energy cost reduction.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"11 1","pages":"156-157"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83548740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Mixed-signal implementation of differential decoding using binary message passing algorithms 用二进制消息传递算法实现差分解码的混合信号
G. Cowan, Kevin Cushon, W. Gross
This paper presents the mixed-signal circuit implementation of reduced complexity algorithms for decoding low-density parity check (LDPC) codes. Based on modified differential decoding using binary message passing (MDD-BMP), binary addition using discrete-time digital circuits is replaced by continuous-time analog-current summation. Potential degradation due to the mismatch between current sources, P/N strength mismatch and inverter-threshold mismatch is considered in behavioural simulation and shown to be tolerable. Area estimates suggest a reduction from 0.27 mm2 to 0.11 mm2 for the FG(273, 191) code. Finally, transistor level simulation of the FG(273, 191) code using TSMC 65 nm technology shows an efficiency of 0.56 pJ/bit.
本文提出了一种低密度奇偶校验码译码的混合信号电路实现算法。基于改进的二进制消息传递差分解码(MDD-BMP),用连续时间模拟电流求和取代了离散时间数字电路的二进制相加。由于电流源之间的不匹配,P/N强度不匹配和逆变器阈值不匹配的潜在退化在行为模拟中被认为是可以容忍的。面积估计表明FG(273,191)代码从0.27 mm2减少到0.11 mm2。最后,利用台积电65nm技术对FG(273,191)代码进行晶体管级仿真,效率为0.56 pJ/bit。
{"title":"Mixed-signal implementation of differential decoding using binary message passing algorithms","authors":"G. Cowan, Kevin Cushon, W. Gross","doi":"10.1109/ASAP.2015.7245718","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245718","url":null,"abstract":"This paper presents the mixed-signal circuit implementation of reduced complexity algorithms for decoding low-density parity check (LDPC) codes. Based on modified differential decoding using binary message passing (MDD-BMP), binary addition using discrete-time digital circuits is replaced by continuous-time analog-current summation. Potential degradation due to the mismatch between current sources, P/N strength mismatch and inverter-threshold mismatch is considered in behavioural simulation and shown to be tolerable. Area estimates suggest a reduction from 0.27 mm2 to 0.11 mm2 for the FG(273, 191) code. Finally, transistor level simulation of the FG(273, 191) code using TSMC 65 nm technology shows an efficiency of 0.56 pJ/bit.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"49 1","pages":"116-119"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79777986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
On-demand fault-tolerant loop processing on massively parallel processor arrays 大规模并行处理器阵列上的按需容错循环处理
Alexandru Tanase, Michael Witterauf, J. Teich, Frank Hannig, Vahid Lari
We present a compilation-based technique for providing on-demand structural redundancy for massively parallel processor arrays. Thereby, application programmers gain the capability to trade throughput for reliability according to application requirements. To protect parallel loop computations against errors, we propose to apply the well-known fault tolerance schemes dual modular redundancy (DMR) and triple modular redundancy (TMR) to a whole region of the processor array rather than individual processing elements. At the source code level, the compiler realizes these replication schemes with a program transformation that: (1) replicates a parallel loop program two or three times for DMR or TMR, respectively, and (2) introduces appropriate voting operations whose frequency and location may be chosen from three proposed variants. Which variant to choose depends, for example, on the error resilience needs of the application or the expected soft error rates. Finally, we explore the different tradeoffs of these variants in terms of performance overheads and error detection latency.
我们提出了一种基于编译的技术,为大规模并行处理器阵列提供按需结构冗余。因此,应用程序程序员获得了根据应用程序需求以吞吐量换取可靠性的能力。为了防止并行环路计算出错,我们建议将众所周知的容错方案双模冗余(DMR)和三模冗余(TMR)应用于处理器阵列的整个区域,而不是单个处理元素。在源代码级别,编译器通过程序转换实现这些复制方案:(1)分别为DMR或TMR复制并行循环程序两次或三次,以及(2)引入适当的投票操作,其频率和位置可以从三个建议的变体中选择。选择哪种变体取决于,例如,应用程序的错误弹性需求或预期的软错误率。最后,我们在性能开销和错误检测延迟方面探讨了这些变体的不同权衡。
{"title":"On-demand fault-tolerant loop processing on massively parallel processor arrays","authors":"Alexandru Tanase, Michael Witterauf, J. Teich, Frank Hannig, Vahid Lari","doi":"10.1109/ASAP.2015.7245734","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245734","url":null,"abstract":"We present a compilation-based technique for providing on-demand structural redundancy for massively parallel processor arrays. Thereby, application programmers gain the capability to trade throughput for reliability according to application requirements. To protect parallel loop computations against errors, we propose to apply the well-known fault tolerance schemes dual modular redundancy (DMR) and triple modular redundancy (TMR) to a whole region of the processor array rather than individual processing elements. At the source code level, the compiler realizes these replication schemes with a program transformation that: (1) replicates a parallel loop program two or three times for DMR or TMR, respectively, and (2) introduces appropriate voting operations whose frequency and location may be chosen from three proposed variants. Which variant to choose depends, for example, on the error resilience needs of the application or the expected soft error rates. Finally, we explore the different tradeoffs of these variants in terms of performance overheads and error detection latency.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"70 1","pages":"194-201"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90563008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
期刊
2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1