首页 > 最新文献

2012 IEEE Conference on High Performance Extreme Computing最新文献

英文 中文
Graph programming model: An efficient approach for sensor signal processing 图编程模型:传感器信号处理的一种有效方法
Pub Date : 2012-09-01 DOI: 10.1109/HPEC.2012.6408662
Steve Kirsch
The HPC community has struggled to find a parallel programming model or language that can efficiently expose algorithmic parallelism in a sequential program and automate the implementation of a highly efficient parallel program. A plethora of parallel programming languages have been developed along with sophisticated compilers and runtimes, but none of these approaches have been successful enough to became a defacto standard. Graph Programming Model has the capability and efficiencies to become that ubiquitous standard for the signal processing domain.
高性能计算社区一直在努力寻找一种并行编程模型或语言,它可以有效地暴露顺序程序中的算法并行性,并自动实现高效的并行程序。随着复杂的编译器和运行时的出现,已经开发了大量的并行编程语言,但这些方法都没有成功到足以成为事实上的标准。图编程模型具有成为信号处理领域通用标准的能力和效率。
{"title":"Graph programming model: An efficient approach for sensor signal processing","authors":"Steve Kirsch","doi":"10.1109/HPEC.2012.6408662","DOIUrl":"https://doi.org/10.1109/HPEC.2012.6408662","url":null,"abstract":"The HPC community has struggled to find a parallel programming model or language that can efficiently expose algorithmic parallelism in a sequential program and automate the implementation of a highly efficient parallel program. A plethora of parallel programming languages have been developed along with sophisticated compilers and runtimes, but none of these approaches have been successful enough to became a defacto standard. Graph Programming Model has the capability and efficiencies to become that ubiquitous standard for the signal processing domain.","PeriodicalId":193020,"journal":{"name":"2012 IEEE Conference on High Performance Extreme Computing","volume":"116 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125544233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An update on SIPHER (Scalable Implementation of Primitives for Homomorphic EncRyption) — FPGA implementation using Simulink siphher(同态加密基元的可扩展实现)的更新-使用Simulink的FPGA实现
Pub Date : 2012-09-01 DOI: 10.1109/HPEC.2012.6408672
D. Cousins, K. Rohloff, Chris Peikert, R. Schantz
Accelerating the development of a practical Fully Homomorphic Encryption (FHE) scheme is the goal of the DARPA PROCEED program. For the past year, this program has had as its focus the acceleration of various aspects of the FHE concept toward practical implementation and use. FHE would be a game-changing technology to enable secure, general computation on encrypted data, e.g., on untrusted off-site hardware. However, FHE will still require several orders of magnitude improvement in computation before it will be practical for widespread use. Recent theoretical breakthroughs demonstrated the existence of FHE schemes [1, 2], and to date much progress has been made in both algorithmic and implementation improvements. Specifically our contribution to the Proceed program has been the development of FPGA based hardware primitives to accelerate the computation on encrypted data using FHE based on lattice techniques [3]. Our project, SIPHER, has been using a state of the art tool-chain developed by Mathworks to implement VHDL code for FPGA circuits directly from Simulink models. Our baseline Homomorphic Encryption prototypes are developed directly in Matlab using the fixed point toolbox to perform the required integer arithmetic. Constant improvements in algorithms require us to be able to quickly implement them in a high level language such as Matlab. We reported on our initial results at HPEC 2011 [4]. In the past year, increases in algorithm complexity have introduced several new design requirements for our FPGA implementation. This report presents new Simulink primitives that had to be developed to deal with these new requirements.
加速开发一种实用的完全同态加密(FHE)方案是DARPA PROCEED计划的目标。在过去的一年中,该计划的重点是加速FHE概念的各个方面走向实际实施和使用。FHE将是一项改变游戏规则的技术,可以在加密数据(例如,在不受信任的场外硬件上)上实现安全、通用的计算。然而,在广泛应用之前,FHE仍然需要几个数量级的计算改进。最近的理论突破证明了FHE方案的存在[1,2],迄今为止,在算法和实现改进方面都取得了很大进展。具体来说,我们对Proceed程序的贡献是开发基于FPGA的硬件原语,以使用基于晶格技术[3]的FHE加速加密数据的计算。我们的项目SIPHER一直在使用Mathworks开发的最先进的工具链,直接从Simulink模型中实现FPGA电路的VHDL代码。我们的基线同态加密原型直接在Matlab中开发,使用不动点工具箱来执行所需的整数运算。算法的不断改进要求我们能够在Matlab等高级语言中快速实现它们。我们在2011年HPEC会议上报告了我们的初步结果。在过去的一年中,算法复杂性的增加为我们的FPGA实现引入了一些新的设计要求。本报告提出了必须开发的新的Simulink原语来处理这些新需求。
{"title":"An update on SIPHER (Scalable Implementation of Primitives for Homomorphic EncRyption) — FPGA implementation using Simulink","authors":"D. Cousins, K. Rohloff, Chris Peikert, R. Schantz","doi":"10.1109/HPEC.2012.6408672","DOIUrl":"https://doi.org/10.1109/HPEC.2012.6408672","url":null,"abstract":"Accelerating the development of a practical Fully Homomorphic Encryption (FHE) scheme is the goal of the DARPA PROCEED program. For the past year, this program has had as its focus the acceleration of various aspects of the FHE concept toward practical implementation and use. FHE would be a game-changing technology to enable secure, general computation on encrypted data, e.g., on untrusted off-site hardware. However, FHE will still require several orders of magnitude improvement in computation before it will be practical for widespread use. Recent theoretical breakthroughs demonstrated the existence of FHE schemes [1, 2], and to date much progress has been made in both algorithmic and implementation improvements. Specifically our contribution to the Proceed program has been the development of FPGA based hardware primitives to accelerate the computation on encrypted data using FHE based on lattice techniques [3]. Our project, SIPHER, has been using a state of the art tool-chain developed by Mathworks to implement VHDL code for FPGA circuits directly from Simulink models. Our baseline Homomorphic Encryption prototypes are developed directly in Matlab using the fixed point toolbox to perform the required integer arithmetic. Constant improvements in algorithms require us to be able to quickly implement them in a high level language such as Matlab. We reported on our initial results at HPEC 2011 [4]. In the past year, increases in algorithm complexity have introduced several new design requirements for our FPGA implementation. This report presents new Simulink primitives that had to be developed to deal with these new requirements.","PeriodicalId":193020,"journal":{"name":"2012 IEEE Conference on High Performance Extreme Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114945728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 40
Scrubbing optimization via availability prediction (SOAP) for reconfigurable space computing 通过可用性预测(SOAP)进行可重构空间计算的擦洗优化
Pub Date : 2012-09-01 DOI: 10.1109/HPEC.2012.6408673
Quinn Martin, A. George
Reconfigurable computing with FPGAs can be highly effective in terms of performance, adaptability, and power for accelerating space applications, but their configuration memory must be scrubbed to prevent the accumulation of single-event upsets. Many scrubbing techniques currently exist, each with different advantages, making it difficult for the system designer to choose the optimal scrubbing strategy for a given mission. This paper surveys the currently available scrubbing techniques and introduces the SOAP method for predicting system availability for various scrubbing strategies using Markov models. We then apply the method to compare hypothetical Virtex-5 and Virtex-6 systems for blind, CRC-32, and Frame ECC scrubbing strategies in LEO and HEO. We show that availability in excess of 5 nines can be obtained with modern, FPGA-based systems using scrubbing. Furthermore, we show the value of the SOAP method by observing that different scrubbing strategies are optimal for different types of missions.
在加速空间应用方面,fpga的可重构计算在性能、适应性和功率方面非常有效,但它们的配置内存必须被清除,以防止单事件干扰的累积。目前存在许多清洗技术,每种技术都有不同的优点,这使得系统设计师很难为给定任务选择最佳的清洗策略。本文综述了目前可用的清洗技术,并介绍了SOAP方法,用于使用马尔可夫模型预测各种清洗策略的系统可用性。然后,我们应用该方法比较假设的Virtex-5和Virtex-6系统在LEO和HEO中的盲、CRC-32和Frame ECC洗涤策略。我们表明,使用擦洗的现代基于fpga的系统可以获得超过5个9的可用性。此外,通过观察不同的清洗策略对于不同类型的任务是最优的,我们展示了SOAP方法的价值。
{"title":"Scrubbing optimization via availability prediction (SOAP) for reconfigurable space computing","authors":"Quinn Martin, A. George","doi":"10.1109/HPEC.2012.6408673","DOIUrl":"https://doi.org/10.1109/HPEC.2012.6408673","url":null,"abstract":"Reconfigurable computing with FPGAs can be highly effective in terms of performance, adaptability, and power for accelerating space applications, but their configuration memory must be scrubbed to prevent the accumulation of single-event upsets. Many scrubbing techniques currently exist, each with different advantages, making it difficult for the system designer to choose the optimal scrubbing strategy for a given mission. This paper surveys the currently available scrubbing techniques and introduces the SOAP method for predicting system availability for various scrubbing strategies using Markov models. We then apply the method to compare hypothetical Virtex-5 and Virtex-6 systems for blind, CRC-32, and Frame ECC scrubbing strategies in LEO and HEO. We show that availability in excess of 5 nines can be obtained with modern, FPGA-based systems using scrubbing. Furthermore, we show the value of the SOAP method by observing that different scrubbing strategies are optimal for different types of missions.","PeriodicalId":193020,"journal":{"name":"2012 IEEE Conference on High Performance Extreme Computing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127752836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Parallel search of k-nearest neighbors with synchronous operations 用同步操作并行搜索k近邻
Pub Date : 2012-09-01 DOI: 10.1109/HPEC.2012.6408667
N. Sismanis, N. Pitsianis, Xiaobai Sun
We present a new study of parallel algorithms for locating k-nearest neighbors (kNN) of each single query in a high dimensional (feature) space on a many-core processor or accelerator that favors synchronous operations, such as on a graphics processing unit. Exploiting the intimate relationships between two primitive operations, select and sort, we introduce a cohort of truncated sort algorithms for parallel kNN search. The truncated bitonic sort (TBiS) in particular has desirable data locality, synchronous concurrency and simple data and program structures. Its implementation on a graphics processing unit outperforms the other existing implementations for kNN search based on either sort or select operations. We provide algorithm analysis and experimental results.
我们提出了一项新的并行算法研究,用于在多核处理器或加速器上定位高维(特征)空间中每个单个查询的k近邻(kNN),这些处理器或加速器有利于同步操作,例如在图形处理单元上。利用选择和排序这两个基本操作之间的密切关系,我们引入了一组用于并行kNN搜索的截断排序算法。特别是截断双元排序(tbi)具有理想的数据局域性、同步并发性和简单的数据和程序结构。它在图形处理单元上的实现优于基于排序或选择操作的kNN搜索的其他现有实现。我们提供了算法分析和实验结果。
{"title":"Parallel search of k-nearest neighbors with synchronous operations","authors":"N. Sismanis, N. Pitsianis, Xiaobai Sun","doi":"10.1109/HPEC.2012.6408667","DOIUrl":"https://doi.org/10.1109/HPEC.2012.6408667","url":null,"abstract":"We present a new study of parallel algorithms for locating k-nearest neighbors (kNN) of each single query in a high dimensional (feature) space on a many-core processor or accelerator that favors synchronous operations, such as on a graphics processing unit. Exploiting the intimate relationships between two primitive operations, select and sort, we introduce a cohort of truncated sort algorithms for parallel kNN search. The truncated bitonic sort (TBiS) in particular has desirable data locality, synchronous concurrency and simple data and program structures. Its implementation on a graphics processing unit outperforms the other existing implementations for kNN search based on either sort or select operations. We provide algorithm analysis and experimental results.","PeriodicalId":193020,"journal":{"name":"2012 IEEE Conference on High Performance Extreme Computing","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114815318","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 49
STINGER: High performance data structure for streaming graphs STINGER:用于流图的高性能数据结构
Pub Date : 2012-09-01 DOI: 10.1109/HPEC.2012.6408680
David Ediger, R. McColl, E. J. Riedy, David A. Bader
The current research focus on “big data” problems highlights the scale and complexity of analytics required and the high rate at which data may be changing. In this paper, we present our high performance, scalable and portable software, Spatio-Temporal Interaction Networks and Graphs Extensible Representation (STINGER), that includes a graph data structure that enables these applications. Key attributes of STINGER are fast insertions, deletions, and updates on semantic graphs with skewed degree distributions. We demonstrate a process of algorithmic and architectural optimizations that enable high performance on the Cray XMT family and Intel multicore servers. Our implementation of STINGER on the Cray XMT processes over 3 million updates per second on a scale-free graph with 537 million edges.
目前对“大数据”问题的研究重点突出了所需分析的规模和复杂性,以及数据可能变化的高速率。在本文中,我们展示了我们的高性能,可扩展和便携式软件,时空交互网络和图形可扩展表示(STINGER),其中包括支持这些应用程序的图形数据结构。STINGER的关键属性是在倾斜度分布的语义图上快速插入、删除和更新。我们演示了在Cray XMT系列和Intel多核服务器上实现高性能的算法和架构优化过程。我们在Cray XMT上实现的STINGER每秒处理超过300万次更新,在一个无尺度图上有5.37亿个边。
{"title":"STINGER: High performance data structure for streaming graphs","authors":"David Ediger, R. McColl, E. J. Riedy, David A. Bader","doi":"10.1109/HPEC.2012.6408680","DOIUrl":"https://doi.org/10.1109/HPEC.2012.6408680","url":null,"abstract":"The current research focus on “big data” problems highlights the scale and complexity of analytics required and the high rate at which data may be changing. In this paper, we present our high performance, scalable and portable software, Spatio-Temporal Interaction Networks and Graphs Extensible Representation (STINGER), that includes a graph data structure that enables these applications. Key attributes of STINGER are fast insertions, deletions, and updates on semantic graphs with skewed degree distributions. We demonstrate a process of algorithmic and architectural optimizations that enable high performance on the Cray XMT family and Intel multicore servers. Our implementation of STINGER on the Cray XMT processes over 3 million updates per second on a scale-free graph with 537 million edges.","PeriodicalId":193020,"journal":{"name":"2012 IEEE Conference on High Performance Extreme Computing","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121930638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 204
Synthetic Aperture Radar on low power multi-core Digital Signal Processor 基于低功耗多核数字信号处理器的合成孔径雷达
Pub Date : 2012-09-01 DOI: 10.1109/HPEC.2012.6408665
Dan Wang, Murtaza Ali
Commercial off-the-self (COTS) components have recently gained popularity in Synthetic Aperture Radar (SAR) applications. The compute capabilities of these devices have advanced to a level where real time processing of complex SAR algorithms have become feasible. In this paper, we focus on a low power multi-core Digital Signal Processor (DSP) from Texas Instruments Inc. and evaluate its capability for SAR signal processing. The specific DSP studied here is an eight-core device, codenamed TMS320C6678, that provides a peak performance of 128 GFLOPS (single precision) for only 10 watts. We describe how the basic SAR operations can be implemented efficiently in such a device. Our results indicate that a baseline SAR range-Doppler algorithm takes around 0.25 second for a 16 M (4K × 4K) image, achieving real-time performance.
商用自组装(COTS)组件最近在合成孔径雷达(SAR)应用中得到了广泛的应用。这些设备的计算能力已经发展到可以实时处理复杂SAR算法的水平。本文以德州仪器公司的一款低功耗多核数字信号处理器(DSP)为研究对象,对其SAR信号处理能力进行了评估。这里研究的具体DSP是一个代号为TMS320C6678的八核器件,仅为10瓦提供128 GFLOPS(单精度)的峰值性能。我们描述了如何在这样的设备中有效地实现基本的SAR操作。我们的研究结果表明,基线SAR距离-多普勒算法对于16 M (4K × 4K)图像大约需要0.25秒,实现了实时性能。
{"title":"Synthetic Aperture Radar on low power multi-core Digital Signal Processor","authors":"Dan Wang, Murtaza Ali","doi":"10.1109/HPEC.2012.6408665","DOIUrl":"https://doi.org/10.1109/HPEC.2012.6408665","url":null,"abstract":"Commercial off-the-self (COTS) components have recently gained popularity in Synthetic Aperture Radar (SAR) applications. The compute capabilities of these devices have advanced to a level where real time processing of complex SAR algorithms have become feasible. In this paper, we focus on a low power multi-core Digital Signal Processor (DSP) from Texas Instruments Inc. and evaluate its capability for SAR signal processing. The specific DSP studied here is an eight-core device, codenamed TMS320C6678, that provides a peak performance of 128 GFLOPS (single precision) for only 10 watts. We describe how the basic SAR operations can be implemented efficiently in such a device. Our results indicate that a baseline SAR range-Doppler algorithm takes around 0.25 second for a 16 M (4K × 4K) image, achieving real-time performance.","PeriodicalId":193020,"journal":{"name":"2012 IEEE Conference on High Performance Extreme Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114078962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
A third generation many-core processor for secure embedded computing systems 用于安全嵌入式计算系统的第三代多核处理器
Pub Date : 2012-09-01 DOI: 10.1109/HPEC.2012.6408657
John Irza, Michael Doerr, Michael Solka
As compute-intensive products proliferate, there is an ever growing need to provide security features to detect tampering, identify cloned or counterfeit hardware, and deter cybersecurity threats. This paper describes the security features of the third generation 100-core HyperX™ processor which addresses these needs. Programmable security barriers allow the processor to implement a red-black System on Chip solution. The implementation of Physically Unclonable Functions (PUFs), encryption/decryption engines, a secure boot controller, and anti-tamper features enable the engineer to realize a secure embedded computing solution in an ultra-low power, many-core, C programmable processor-memory network.
随着计算密集型产品的激增,越来越需要提供安全功能来检测篡改、识别克隆或假冒硬件,并阻止网络安全威胁。本文介绍了满足这些需求的第三代100核HyperX™处理器的安全特性。可编程安全屏障允许处理器实现红黑系统芯片解决方案。物理不可克隆功能(puf)、加密/解密引擎、安全启动控制器和防篡改功能的实现使工程师能够在超低功耗、多核、C可编程处理器内存网络中实现安全的嵌入式计算解决方案。
{"title":"A third generation many-core processor for secure embedded computing systems","authors":"John Irza, Michael Doerr, Michael Solka","doi":"10.1109/HPEC.2012.6408657","DOIUrl":"https://doi.org/10.1109/HPEC.2012.6408657","url":null,"abstract":"As compute-intensive products proliferate, there is an ever growing need to provide security features to detect tampering, identify cloned or counterfeit hardware, and deter cybersecurity threats. This paper describes the security features of the third generation 100-core HyperX™ processor which addresses these needs. Programmable security barriers allow the processor to implement a red-black System on Chip solution. The implementation of Physically Unclonable Functions (PUFs), encryption/decryption engines, a secure boot controller, and anti-tamper features enable the engineer to realize a secure embedded computing solution in an ultra-low power, many-core, C programmable processor-memory network.","PeriodicalId":193020,"journal":{"name":"2012 IEEE Conference on High Performance Extreme Computing","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133759362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Accelerating fully homomorphic encryption using GPU 使用GPU加速完全同态加密
Pub Date : 2012-09-01 DOI: 10.1109/HPEC.2012.6408660
Wei Wang, Yin Hu, Lianmu Chen, Xinming Huang, B. Sunar
As a major breakthrough, in 2009 Gentry introduced the first plausible construction of a fully homomorphic encryption (FHE) scheme. FHE allows the evaluation of arbitrary functions directly on encrypted data on untwisted servers. In 2010, Gentry and Halevi presented the first FHE implementation on an IBM x3500 server. However, this implementation remains impractical due to the high latency of encryption and recryption. The Gentry-Halevi (GH) FHE primitives utilize multi-million-bit modular multiplications and additions which are time-consuming tasks for a general purpose computer. In the GH-FHE implementation, the most computationally intensive arithmetic operation is modular multiplication. In this paper, the million-bit modular multiplication is computed in two steps. For large number multiplication, Strassen's FFT based algorithm is employed and accelerated on a graphics processing unit (GPU) through its massive parallelism. Subsequently, Barrett modular reduction algorithm is applied to implement modular reduction. As an experimental study, we implement the GH-FHE primitives for the small setting with a dimension of 2048 on NVIDIA C2050 GPU. The experimental results show the speedup factors of 7.68, 7.4 and 6.59 for encryption, decryption and recrypt respectively, when compared with the existing CPU implementation.
作为一项重大突破,Gentry在2009年引入了第一个完全同态加密(FHE)方案的合理构造。FHE允许对未扭曲服务器上的加密数据直接评估任意函数。2010年,Gentry和Halevi在IBM x3500服务器上提出了第一个FHE实现。然而,由于加密和加密的高延迟,这种实现仍然不切实际。Gentry-Halevi (GH) FHE原语利用数百万位的模块化乘法和加法,这对于通用计算机来说是耗时的任务。在GH-FHE实现中,计算量最大的算术运算是模乘法。本文分两步计算百万位模乘法。对于大数乘法,采用Strassen的基于FFT的算法,并通过其大规模并行性在图形处理单元(GPU)上加速。随后,采用Barrett模约简算法实现模约简。作为实验研究,我们在NVIDIA C2050 GPU上实现了尺寸为2048的小场景下的GH-FHE原语。实验结果表明,与现有的CPU实现相比,加密、解密和重加密的加速系数分别为7.68、7.4和6.59。
{"title":"Accelerating fully homomorphic encryption using GPU","authors":"Wei Wang, Yin Hu, Lianmu Chen, Xinming Huang, B. Sunar","doi":"10.1109/HPEC.2012.6408660","DOIUrl":"https://doi.org/10.1109/HPEC.2012.6408660","url":null,"abstract":"As a major breakthrough, in 2009 Gentry introduced the first plausible construction of a fully homomorphic encryption (FHE) scheme. FHE allows the evaluation of arbitrary functions directly on encrypted data on untwisted servers. In 2010, Gentry and Halevi presented the first FHE implementation on an IBM x3500 server. However, this implementation remains impractical due to the high latency of encryption and recryption. The Gentry-Halevi (GH) FHE primitives utilize multi-million-bit modular multiplications and additions which are time-consuming tasks for a general purpose computer. In the GH-FHE implementation, the most computationally intensive arithmetic operation is modular multiplication. In this paper, the million-bit modular multiplication is computed in two steps. For large number multiplication, Strassen's FFT based algorithm is employed and accelerated on a graphics processing unit (GPU) through its massive parallelism. Subsequently, Barrett modular reduction algorithm is applied to implement modular reduction. As an experimental study, we implement the GH-FHE primitives for the small setting with a dimension of 2048 on NVIDIA C2050 GPU. The experimental results show the speedup factors of 7.68, 7.4 and 6.59 for encryption, decryption and recrypt respectively, when compared with the existing CPU implementation.","PeriodicalId":193020,"journal":{"name":"2012 IEEE Conference on High Performance Extreme Computing","volume":"25 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114118080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 115
Driving big data with big compute 用大计算驱动大数据
Pub Date : 2012-09-01 DOI: 10.1109/HPEC.2012.6408678
C. Byun, W. Arcand, David Bestor, Bill Bergeron, M. Hubbell, J. Kepner, A. McCabe, P. Michaleas, J. Mullen, David O'Gwynn, Andrew Prout, A. Reuther, Antonio Rosa, Charles Yee
Big Data (as embodied by Hadoop clusters) and Big Compute (as embodied by MPI clusters) provide unique capabilities for storing and processing large volumes of data. Hadoop clusters make distributed computing readily accessible to the Java community and MPI clusters provide high parallel efficiency for compute intensive workloads. Bringing the big data and big compute communities together is an active area of research. The LLGrid team has developed and deployed a number of technologies that aim to provide the best of both worlds. LLGrid MapReduce allows the map/reduce parallel programming model to be used quickly and efficiently in any language on any compute cluster. D4M (Dynamic Distributed Dimensional Data Model) provided a high level distributed arrays interface to the Apache Accumulo database. The accessibility of these technologies is assessed by measuring the effort to use these tools and is typically a few lines of code. The performance is assessed by measuring the insert rate into the Accumulo database. Using these tools a database insert rate of 4M inserts/second has been achieved on an 8 node cluster.
大数据(以Hadoop集群为代表)和大计算(以MPI集群为代表)为存储和处理大量数据提供了独特的能力。Hadoop集群使Java社区可以很容易地访问分布式计算,而MPI集群为计算密集型工作负载提供了高并行效率。将大数据和大计算社区结合在一起是一个活跃的研究领域。LLGrid团队已经开发和部署了许多技术,旨在提供两者的最佳效果。LLGrid MapReduce允许在任何语言和任何计算集群上快速有效地使用map/reduce并行编程模型。D4M(动态分布式维度数据模型)为Apache Accumulo数据库提供了一个高级分布式数组接口。这些技术的可访问性是通过测量使用这些工具的工作量来评估的,通常是几行代码。性能是通过测量插入到Accumulo数据库中的速率来评估的。使用这些工具,在一个8节点集群上,数据库插入速率达到了4M /秒。
{"title":"Driving big data with big compute","authors":"C. Byun, W. Arcand, David Bestor, Bill Bergeron, M. Hubbell, J. Kepner, A. McCabe, P. Michaleas, J. Mullen, David O'Gwynn, Andrew Prout, A. Reuther, Antonio Rosa, Charles Yee","doi":"10.1109/HPEC.2012.6408678","DOIUrl":"https://doi.org/10.1109/HPEC.2012.6408678","url":null,"abstract":"Big Data (as embodied by Hadoop clusters) and Big Compute (as embodied by MPI clusters) provide unique capabilities for storing and processing large volumes of data. Hadoop clusters make distributed computing readily accessible to the Java community and MPI clusters provide high parallel efficiency for compute intensive workloads. Bringing the big data and big compute communities together is an active area of research. The LLGrid team has developed and deployed a number of technologies that aim to provide the best of both worlds. LLGrid MapReduce allows the map/reduce parallel programming model to be used quickly and efficiently in any language on any compute cluster. D4M (Dynamic Distributed Dimensional Data Model) provided a high level distributed arrays interface to the Apache Accumulo database. The accessibility of these technologies is assessed by measuring the effort to use these tools and is typically a few lines of code. The performance is assessed by measuring the insert rate into the Accumulo database. Using these tools a database insert rate of 4M inserts/second has been achieved on an 8 node cluster.","PeriodicalId":193020,"journal":{"name":"2012 IEEE Conference on High Performance Extreme Computing","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122225700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 48
Fast functional simulation with a dynamic language 快速功能仿真与动态语言
Pub Date : 2012-09-01 DOI: 10.1109/HPEC.2012.6408664
C. Steele, J. Bonn
Simulation of large computational systems-on-a-chip (SoCs) is increasing challenging as the number and complexity of components is scaled up. With the ubiquity of programmable components in computational SoCs, fast functional instruction-set simulation (ISS) is increasingly important. Much ISS has been done with straightforward functional models of a non-pipelined fetch-decode-execute iteration written in a low-to-mid-level C-family static language, delivering mid-level efficiency. Some ISS programs, such as QEMU, perform dynamic binary translation to allow software emulation to reach more usable speeds. This relatively complex methodology has not been widely adopted for system modeling. We demonstrate a fresh approach to ISS that achieves performance comparable to a fast dynamic binary translator by exploiting recent advances in just-in-time (JIT) compilers for dynamic languages, such as JavaScript and Lua, together with a specific programming idiom inspired by pipelined processor design. We believe that this approach is relatively accessible to system designers familiar with C-family functional simulator coding styles, and may be generally useful for fast modeling of complex SoC components.
随着元件数量和复杂度的增加,大型单片计算系统(soc)的仿真越来越具有挑战性。随着可编程元件在计算型soc中的普及,快速功能指令集仿真(ISS)变得越来越重要。ISS的大部分工作都是通过使用低到中级c族静态语言编写的非流水线提取-解码-执行迭代的直接功能模型完成的,从而提供中级效率。一些ISS程序,如QEMU,执行动态二进制转换,以允许软件仿真达到更可用的速度。这种相对复杂的方法还没有被广泛地用于系统建模。我们展示了一种新的ISS方法,通过利用动态语言(如JavaScript和Lua)的即时(JIT)编译器的最新进展,以及受流水线处理器设计启发的特定编程习惯,实现了与快速动态二进制翻译器相当的性能。我们相信,对于熟悉c族功能模拟器编码风格的系统设计人员来说,这种方法相对容易使用,并且通常可以用于复杂SoC组件的快速建模。
{"title":"Fast functional simulation with a dynamic language","authors":"C. Steele, J. Bonn","doi":"10.1109/HPEC.2012.6408664","DOIUrl":"https://doi.org/10.1109/HPEC.2012.6408664","url":null,"abstract":"Simulation of large computational systems-on-a-chip (SoCs) is increasing challenging as the number and complexity of components is scaled up. With the ubiquity of programmable components in computational SoCs, fast functional instruction-set simulation (ISS) is increasingly important. Much ISS has been done with straightforward functional models of a non-pipelined fetch-decode-execute iteration written in a low-to-mid-level C-family static language, delivering mid-level efficiency. Some ISS programs, such as QEMU, perform dynamic binary translation to allow software emulation to reach more usable speeds. This relatively complex methodology has not been widely adopted for system modeling. We demonstrate a fresh approach to ISS that achieves performance comparable to a fast dynamic binary translator by exploiting recent advances in just-in-time (JIT) compilers for dynamic languages, such as JavaScript and Lua, together with a specific programming idiom inspired by pipelined processor design. We believe that this approach is relatively accessible to system designers familiar with C-family functional simulator coding styles, and may be generally useful for fast modeling of complex SoC components.","PeriodicalId":193020,"journal":{"name":"2012 IEEE Conference on High Performance Extreme Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130079906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
2012 IEEE Conference on High Performance Extreme Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1