首页 > 最新文献

2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)最新文献

英文 中文
Towards secure cryptographic software implementation against side-channel power analysis attacks 针对侧信道功率分析攻击的安全加密软件实现
Pei Luo, Liwei Zhang, Yunsi Fei, A. Ding
Side-channel attacks have been a real threat against many embedded cryptographic systems. A commonly used algorithmic countermeasure, random masking, incurs large execution delay and resource overhead. The other countermeasure, operation shuffling or permutation, can mitigate side-channel leakage effectively with minimal overhead. In this paper, we target automatically implementing operation shuffling in cryptographic algorithms to resist against side-channel power analysis attacks. We design a tool to detect independence among statements at the source code level and devise an algorithm for automatic operation shuffling. We test our algorithm on the new SHA3 standard, Keccak. Results show that the tool effectively implements operation-shuffling to reduce the side-channel leakage significantly, and therefore can guide automatic secure cryptographic software implementations against differential power analysis attacks.
侧信道攻击已经成为许多嵌入式密码系统的真正威胁。一种常用的算法对抗,随机屏蔽,会导致很大的执行延迟和资源开销。另一种对策,操作变换或排列,可以以最小的开销有效地减轻侧信道泄漏。在本文中,我们的目标是在密码算法中自动实现操作洗牌,以抵御侧信道功率分析攻击。我们设计了一个工具来检测源代码级语句之间的独立性,并设计了一个自动操作洗牌算法。我们在新的SHA3标准Keccak上测试了我们的算法。结果表明,该工具有效地实现了操作变换,显著减少了侧信道泄漏,因此可以指导自动安全加密软件实现对抗差分功率分析攻击。
{"title":"Towards secure cryptographic software implementation against side-channel power analysis attacks","authors":"Pei Luo, Liwei Zhang, Yunsi Fei, A. Ding","doi":"10.1109/ASAP.2015.7245722","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245722","url":null,"abstract":"Side-channel attacks have been a real threat against many embedded cryptographic systems. A commonly used algorithmic countermeasure, random masking, incurs large execution delay and resource overhead. The other countermeasure, operation shuffling or permutation, can mitigate side-channel leakage effectively with minimal overhead. In this paper, we target automatically implementing operation shuffling in cryptographic algorithms to resist against side-channel power analysis attacks. We design a tool to detect independence among statements at the source code level and devise an algorithm for automatic operation shuffling. We test our algorithm on the new SHA3 standard, Keccak. Results show that the tool effectively implements operation-shuffling to reduce the side-channel leakage significantly, and therefore can guide automatic secure cryptographic software implementations against differential power analysis attacks.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"83 1","pages":"144-148"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88221012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Custom FPGA-based soft-processors for sparse graph acceleration 自定义基于fpga的稀疏图形加速软处理器
Nachiket Kapre
FPGA-based soft processors customized for operations on sparse graphs can deliver significant performance improvements over conventional organizations (ARMv7 CPUs) for bulk synchronous sparse graph algorithms. We develop a stripped-down soft processor ISA to implement specific repetitive operations on graph nodes and edges that are commonly observed in sparse graph computations. In the processing core, we provide hardware support for rapidly fetching and processing state of local graph nodes and edges through spatial address generators and zero-overhead loop iterators. We interconnect a 2D array of these lightweight processors with a packet-switched network-on-chip to enable fine-grained operand routing along the graph edges and provide custom send/receive instructions in the soft processor. We develop the processor RTL using Vivado High-Level Synthesis and also provide an assembler and compilation flow to configure the processor instruction and data memories. We outperform a Microblaze (100MHz on Zedboard) and an NIOS-II/f (100MHz on DE2-115) by 6× (single processor design) as well as the ARMv7 dual-core CPU on the Zynq SoCs by as much as 10× on the Xilinx ZC706 board (100 processor design) across a range of matrix datasets.
针对稀疏图的操作定制的基于fpga的软处理器可以为批量同步稀疏图算法提供比传统组织(ARMv7 cpu)显著的性能改进。我们开发了一个精简的软处理器ISA来实现在稀疏图计算中常见的图节点和边上的特定重复操作。在处理核心中,我们通过空间地址生成器和零开销循环迭代器为局部图节点和边的快速获取和处理状态提供硬件支持。我们将这些轻量级处理器的2D阵列与数据包交换的片上网络互连,以实现沿着图边缘的细粒度操作数路由,并在软处理器中提供自定义的发送/接收指令。我们使用Vivado高级合成技术开发了处理器RTL,并提供了一个汇编和编译流程来配置处理器指令和数据存储器。我们在一系列矩阵数据集上优于Microblaze(在Zedboard上100MHz)和NIOS-II/f(在DE2-115上100MHz) 6倍(单处理器设计)以及Zynq soc上的ARMv7双核CPU在Xilinx ZC706板上(100处理器设计)多达10倍。
{"title":"Custom FPGA-based soft-processors for sparse graph acceleration","authors":"Nachiket Kapre","doi":"10.1109/ASAP.2015.7245698","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245698","url":null,"abstract":"FPGA-based soft processors customized for operations on sparse graphs can deliver significant performance improvements over conventional organizations (ARMv7 CPUs) for bulk synchronous sparse graph algorithms. We develop a stripped-down soft processor ISA to implement specific repetitive operations on graph nodes and edges that are commonly observed in sparse graph computations. In the processing core, we provide hardware support for rapidly fetching and processing state of local graph nodes and edges through spatial address generators and zero-overhead loop iterators. We interconnect a 2D array of these lightweight processors with a packet-switched network-on-chip to enable fine-grained operand routing along the graph edges and provide custom send/receive instructions in the soft processor. We develop the processor RTL using Vivado High-Level Synthesis and also provide an assembler and compilation flow to configure the processor instruction and data memories. We outperform a Microblaze (100MHz on Zedboard) and an NIOS-II/f (100MHz on DE2-115) by 6× (single processor design) as well as the ARMv7 dual-core CPU on the Zynq SoCs by as much as 10× on the Xilinx ZC706 board (100 processor design) across a range of matrix datasets.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"20 1","pages":"9-16"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81603801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 32
Hardware acceleration of Private Information Retrieval protocols using GPUs 基于gpu的私有信息检索协议硬件加速
Mihai Maruseac, Gabriel Ghinita, Ming Ouyang, R. Rughinis
Private Information Retrieval (PIR) protocols allow users to search for data items stored at an untrusted server, without disclosing to the server the search attributes. Several computational PIR protocols provide cryptographic-strength guarantees for the privacy of users, building upon well-known hard mathematical problems, such as factorisation of large integers. Unfortunately, the computational-intensive nature of these solutions results in significant performance overhead, preventing their adoption in practice. In this paper, we employ graphical processing units (GPUs) to speed up the cryptographic operations required by PIR. We identify the challenges that arise when using GPUs for PIR and we propose solutions to address them. To the best of our knowledge, this is the first work to use GPUs for efficient private information retrieval, and an important first step towards GPU-based acceleration of a broader range of secure data operations. Our experimental evaluation shows that GPUs improve performance by more than an order of magnitude.
私有信息检索(Private Information Retrieval, PIR)协议允许用户搜索存储在不受信任的服务器上的数据项,而无需向服务器透露搜索属性。一些计算PIR协议为用户的隐私提供了加密强度保证,它们建立在众所周知的数学难题(如大整数的因数分解)之上。不幸的是,这些解决方案的计算密集型特性导致了显著的性能开销,阻碍了它们在实践中的采用。在本文中,我们使用图形处理单元(gpu)来加快PIR所需的加密操作。我们确定了将gpu用于PIR时出现的挑战,并提出了解决这些挑战的解决方案。据我们所知,这是第一个使用gpu进行高效私人信息检索的工作,也是迈向基于gpu的更广泛安全数据操作加速的重要的第一步。我们的实验评估表明,gpu提高性能超过一个数量级。
{"title":"Hardware acceleration of Private Information Retrieval protocols using GPUs","authors":"Mihai Maruseac, Gabriel Ghinita, Ming Ouyang, R. Rughinis","doi":"10.1109/ASAP.2015.7245719","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245719","url":null,"abstract":"Private Information Retrieval (PIR) protocols allow users to search for data items stored at an untrusted server, without disclosing to the server the search attributes. Several computational PIR protocols provide cryptographic-strength guarantees for the privacy of users, building upon well-known hard mathematical problems, such as factorisation of large integers. Unfortunately, the computational-intensive nature of these solutions results in significant performance overhead, preventing their adoption in practice. In this paper, we employ graphical processing units (GPUs) to speed up the cryptographic operations required by PIR. We identify the challenges that arise when using GPUs for PIR and we propose solutions to address them. To the best of our knowledge, this is the first work to use GPUs for efficient private information retrieval, and an important first step towards GPU-based acceleration of a broader range of secure data operations. Our experimental evaluation shows that GPUs improve performance by more than an order of magnitude.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"4 1","pages":"120-127"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87527543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Atomic stream computation unit based on micro-thread level parallelism 基于微线程级并行的原子流计算单元
Nasim Farahini, A. Hemani
The increasing demand for higher resolution of images and communication bandwidth requires the streaming applications to deal with ever increasing size of datasets. Further, with technology scaling the cost of moving data is reducing at a slower pace compared to the cost of computing. These trends have motivated the proposed micro-architectural reorganization of stream processors by dividing the stream computation into functional computation, address constraints computation and address generation and deploying independent, distributed micro-threads to implement them. This scheme is an alternative to parallelizing them at instruction level. The proposed scheme has two benefits: a more efficient sequencer logic and energy savings in address generation and transportation. These benefits are quantified for a set of streaming applications and show average percentage improvement of 39 in silicon efficiency of the sequencer logic and 23 in total computational efficiency.
对更高分辨率图像和通信带宽的需求日益增长,要求流媒体应用程序处理不断增长的数据集大小。此外,与计算成本相比,随着技术的扩展,移动数据的成本正在以较慢的速度降低。这些趋势促使人们提出了流处理器的微架构重组,将流计算分为功能计算、地址约束计算和地址生成,并部署独立的分布式微线程来实现它们。该方案是在指令级并行化它们的替代方案。该方案具有两个优点:一个更有效的序列逻辑,以及在地址生成和传输中节省能源。对一组流应用程序的这些好处进行了量化,结果显示,顺序器逻辑的硅效率平均提高了39%,总计算效率提高了23%。
{"title":"Atomic stream computation unit based on micro-thread level parallelism","authors":"Nasim Farahini, A. Hemani","doi":"10.1109/ASAP.2015.7245700","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245700","url":null,"abstract":"The increasing demand for higher resolution of images and communication bandwidth requires the streaming applications to deal with ever increasing size of datasets. Further, with technology scaling the cost of moving data is reducing at a slower pace compared to the cost of computing. These trends have motivated the proposed micro-architectural reorganization of stream processors by dividing the stream computation into functional computation, address constraints computation and address generation and deploying independent, distributed micro-threads to implement them. This scheme is an alternative to parallelizing them at instruction level. The proposed scheme has two benefits: a more efficient sequencer logic and energy savings in address generation and transportation. These benefits are quantified for a set of streaming applications and show average percentage improvement of 39 in silicon efficiency of the sequencer logic and 23 in total computational efficiency.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"28 1","pages":"25-29"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88517005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Application-set driven exploration for custom processor architectures 应用程序集驱动的自定义处理器架构探索
M. A. Arslan, F. Gruian, K. Kuchcinski
Custom architectures are often adopted as more efficient alternatives to general purpose processors in terms of performance and power. However, the design of such architectures requires experts both in hardware and the application domain. In this paper we propose a method for speeding up the design space exploration. Our method, based on Pareto points, identifies sets of solutions in terms of scalar units and vector units of certain length, fulfilling the throughput constraints for each application in a given set. Architectures can then be selected by combining these solutions, as starting points for a more thorough, model-based evaluation.
在性能和功耗方面,定制架构通常被用作通用处理器的更有效的替代方案。然而,这种体系结构的设计需要硬件和应用领域的专家。本文提出了一种加速设计空间探索的方法。我们的方法基于帕累托点,以一定长度的标量单位和向量单位标识解集,满足给定集合中每个应用程序的吞吐量约束。然后可以通过组合这些解决方案来选择架构,作为更彻底的、基于模型的评估的起点。
{"title":"Application-set driven exploration for custom processor architectures","authors":"M. A. Arslan, F. Gruian, K. Kuchcinski","doi":"10.1109/ASAP.2015.7245710","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245710","url":null,"abstract":"Custom architectures are often adopted as more efficient alternatives to general purpose processors in terms of performance and power. However, the design of such architectures requires experts both in hardware and the application domain. In this paper we propose a method for speeding up the design space exploration. Our method, based on Pareto points, identifies sets of solutions in terms of scalar units and vector units of certain length, fulfilling the throughput constraints for each application in a given set. Architectures can then be selected by combining these solutions, as starting points for a more thorough, model-based evaluation.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"77 1","pages":"70-71"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74048299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Balance power leakage to fight against side-channel analysis at gate level in FPGAs 平衡功率泄漏对抗fpga门级旁道分析
Xin Fang, Pei Luo, Yunsi Fei, M. Leeser
Side-channel attacks have been a serious threat to the security of embedded cryptographic systems, and various countermeasures have been devised to mitigate the leakages. Power balance technologies such as wave dynamic differential logic (WDDL) aim to balance the power by introducing differential logic. However, different routing length leads to different capacitance of wire, and this hampers the strength of the power balance countermeasure. In this paper, we further balance the power of differential signals by manipulating the lower level primitives and placement constraints on a Field Programmable Gate Array (FPGA). We choose Advanced Encryption Standard (AES) as the encryption algorithm and apply Hamming weight model to demonstrate the amount of leakage for different implementations. Results show that our method not only efficiently mitigates the side-channel leakage but also saves FPGA logic block resources and dynamic power consumption.
侧信道攻击已经成为嵌入式密码系统安全的严重威胁,人们已经设计了各种对策来减轻泄漏。波动差分逻辑(WDDL)等功率平衡技术通过引入差分逻辑来实现功率平衡。然而,不同的布线长度导致导线的电容不同,从而影响了功率平衡对策的强度。在本文中,我们通过操纵低级原语和现场可编程门阵列(FPGA)上的放置约束进一步平衡差分信号的功率。我们选择高级加密标准(AES)作为加密算法,并应用Hamming权重模型来演示不同实现的泄漏量。结果表明,该方法不仅有效地减轻了侧信道泄漏,而且节省了FPGA逻辑块资源和动态功耗。
{"title":"Balance power leakage to fight against side-channel analysis at gate level in FPGAs","authors":"Xin Fang, Pei Luo, Yunsi Fei, M. Leeser","doi":"10.1109/ASAP.2015.7245724","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245724","url":null,"abstract":"Side-channel attacks have been a serious threat to the security of embedded cryptographic systems, and various countermeasures have been devised to mitigate the leakages. Power balance technologies such as wave dynamic differential logic (WDDL) aim to balance the power by introducing differential logic. However, different routing length leads to different capacitance of wire, and this hampers the strength of the power balance countermeasure. In this paper, we further balance the power of differential signals by manipulating the lower level primitives and placement constraints on a Field Programmable Gate Array (FPGA). We choose Advanced Encryption Standard (AES) as the encryption algorithm and apply Hamming weight model to demonstrate the amount of leakage for different implementations. Results show that our method not only efficiently mitigates the side-channel leakage but also saves FPGA logic block resources and dynamic power consumption.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"126 1","pages":"154-155"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78295747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Multi-task support for security-enabled embedded processors 支持多任务的安全嵌入式处理器
Tedy Thomas, Arman Pouraghily, Kekai Hu, R. Tessier, T. Wolf
Embedded systems require low overhead security approaches to ensure that they are protected from attacks. In this paper, we propose a hardware-based approach to secure the operation of an embedded processor instruction-by-instruction, where deviations from expected program behavior are detected within the execution of an instruction. These security-enabled embedded processors provide effective defenses against common attacks, such as stack smashing. Previous work in this area has focused on monitoring a single task on a CPU while here we present a novel hardware monitoring system that can monitor multiple active tasks in an operating-system-based platform. The hardware monitor is able to track context switches that occur in the operating system and ensure that monitoring is performed continuously, thus ensuring system security. We present the design of our system and results obtained from a prototype implementation of the system on an Altera DE4 FPGA board. We demonstrate in hardware that applications can be monitored at the instruction level without execution slowdown and stack smashing attacks can be defeated using our system.
嵌入式系统需要低开销的安全方法来确保它们免受攻击。在本文中,我们提出了一种基于硬件的方法来保护嵌入式处理器的指令操作,其中在指令执行过程中检测到与预期程序行为的偏差。这些支持安全的嵌入式处理器提供了针对常见攻击的有效防御,例如堆栈破坏。该领域以前的工作主要集中在监视CPU上的单个任务,而在这里,我们提出了一种新的硬件监视系统,可以监视基于操作系统的平台上的多个活动任务。硬件监视器能够跟踪操作系统中发生的上下文切换,并确保持续执行监视,从而确保系统安全性。我们介绍了系统的设计和系统在Altera DE4 FPGA板上的原型实现结果。我们在硬件中演示了应用程序可以在指令级别进行监控而不会导致执行速度减慢,并且使用我们的系统可以挫败堆栈破坏攻击。
{"title":"Multi-task support for security-enabled embedded processors","authors":"Tedy Thomas, Arman Pouraghily, Kekai Hu, R. Tessier, T. Wolf","doi":"10.1109/ASAP.2015.7245721","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245721","url":null,"abstract":"Embedded systems require low overhead security approaches to ensure that they are protected from attacks. In this paper, we propose a hardware-based approach to secure the operation of an embedded processor instruction-by-instruction, where deviations from expected program behavior are detected within the execution of an instruction. These security-enabled embedded processors provide effective defenses against common attacks, such as stack smashing. Previous work in this area has focused on monitoring a single task on a CPU while here we present a novel hardware monitoring system that can monitor multiple active tasks in an operating-system-based platform. The hardware monitor is able to track context switches that occur in the operating system and ensure that monitoring is performed continuously, thus ensuring system security. We present the design of our system and results obtained from a prototype implementation of the system on an Altera DE4 FPGA board. We demonstrate in hardware that applications can be monitored at the instruction level without execution slowdown and stack smashing attacks can be defeated using our system.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"1 1","pages":"136-143"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86000535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Mixed-length SIMD code generation for VLIW architectures with multiple native vector-widths 具有多个本机矢量宽度的VLIW体系结构的混合长度SIMD代码生成
Erkan Diken, M. O'Riordan, Roel Jordans, L. Józwiak, H. Corporaal, D. Moloney
The degree of DLP parallelism in applications is not fixed and varies due to different computational characteristics of applications. On the contrary, most of the processors today include single-width SIMD (vector) hardware to exploit DLP. However, single-width SIMD architectures may not be optimal to serve applications with varying DLP and they may cause performance and energy inefficiency. We propose the usage of VLIW processors with multiple native vector-widths to better serve applications with changing DLP. SHAVE is an example of such VLIW processor and provides hardware support for the native 32-bit and 128-bit wide vector operations. This paper researches and implements the mixed-length SIMD code generation support for SHAVE processor. More specifically, we target generating 32-bit and 128/64-bit SIMD code for the native 32-bit and 128-bit wide vector units of SHAVE processor. In this way, we improved the performance of compiler generated SIMD code by reducing the number of overhead operations and by increasing the SIMD hardware utilization. Experimental results demonstrated that our methodology implemented in the compiler improves the performance of synthetic benchmarks up to 47%.
DLP在应用中的并行度并不是固定的,而是根据应用的不同计算特性而变化的。相反,今天的大多数处理器都包括单宽度SIMD(矢量)硬件来利用DLP。然而,单宽度SIMD架构可能不是服务具有不同DLP的应用程序的最佳选择,而且它们可能导致性能和能源效率低下。我们建议使用具有多个原生矢量宽度的VLIW处理器,以更好地服务于具有变化DLP的应用程序。SHAVE就是这种VLIW处理器的一个例子,它为本机32位和128位宽矢量操作提供硬件支持。本文研究并实现了面向剃须处理器的混合长度SIMD代码生成支持。更具体地说,我们的目标是为剃须处理器的本机32位和128位宽矢量单元生成32位和128/64位SIMD代码。通过这种方式,我们通过减少开销操作的数量和增加SIMD硬件利用率来提高编译器生成的SIMD代码的性能。实验结果表明,我们在编译器中实现的方法将综合基准测试的性能提高了47%。
{"title":"Mixed-length SIMD code generation for VLIW architectures with multiple native vector-widths","authors":"Erkan Diken, M. O'Riordan, Roel Jordans, L. Józwiak, H. Corporaal, D. Moloney","doi":"10.1109/ASAP.2015.7245732","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245732","url":null,"abstract":"The degree of DLP parallelism in applications is not fixed and varies due to different computational characteristics of applications. On the contrary, most of the processors today include single-width SIMD (vector) hardware to exploit DLP. However, single-width SIMD architectures may not be optimal to serve applications with varying DLP and they may cause performance and energy inefficiency. We propose the usage of VLIW processors with multiple native vector-widths to better serve applications with changing DLP. SHAVE is an example of such VLIW processor and provides hardware support for the native 32-bit and 128-bit wide vector operations. This paper researches and implements the mixed-length SIMD code generation support for SHAVE processor. More specifically, we target generating 32-bit and 128/64-bit SIMD code for the native 32-bit and 128-bit wide vector units of SHAVE processor. In this way, we improved the performance of compiler generated SIMD code by reducing the number of overhead operations and by increasing the SIMD hardware utilization. Experimental results demonstrated that our methodology implemented in the compiler improves the performance of synthetic benchmarks up to 47%.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"9 1","pages":"181-188"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88614205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Accelerating bootstrapping in FHEW using GPUs 使用gpu加速FHEW的引导
M. Lee, Yongje Lee, J. Cheon, Y. Paek
Recently, the usage of GPU is not limited to the jobs associated with graphics and a wide variety of applications take advantage of the flexibility of GPUs to accelerate the computing performance. Among them, one of the most emerging applications is the fully homomorphic encryption (FHE) scheme, which enables arbitrary computations on encrypted data. Despite much research effort, it cannot be considered as practical due to the enormous amount of computations, especially in the bootstrapping procedure. In this paper, we accelerate the performance of the recently suggested fast bootstrapping method in FHEW scheme using GPUs, as a case study of a FHE scheme. In order to optimize, we explored the reference code and carried out profiling to find out candidates for performance acceleration. Based on the profiling results, combined with more flexible tradeoff method, we optimized the bootstrapping algorithm in FHEW using GPU and CUDA's programming model. The empirical result shows that the bootstrapping of FHEW ciphertext can be done in less than 0.11 second after optimization.
近年来,GPU的使用已经不仅仅局限于图形相关的工作,各种各样的应用都在利用GPU的灵活性来加速计算性能。其中,最新兴的应用之一是完全同态加密(FHE)方案,它允许对加密数据进行任意计算。尽管进行了大量的研究,但由于计算量巨大,特别是在自引导过程中,它不能被认为是实用的。本文以FHE方案为例,利用gpu加速了FHEW方案中最近提出的快速自启动方法的性能。为了进行优化,我们研究了参考代码并执行了性能分析,以找出性能加速的候选对象。在分析结果的基础上,结合更灵活的权衡方法,利用GPU和CUDA的编程模型对FHEW中的自举算法进行了优化。实验结果表明,优化后的FHEW密文的自启动时间小于0.11秒。
{"title":"Accelerating bootstrapping in FHEW using GPUs","authors":"M. Lee, Yongje Lee, J. Cheon, Y. Paek","doi":"10.1109/ASAP.2015.7245720","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245720","url":null,"abstract":"Recently, the usage of GPU is not limited to the jobs associated with graphics and a wide variety of applications take advantage of the flexibility of GPUs to accelerate the computing performance. Among them, one of the most emerging applications is the fully homomorphic encryption (FHE) scheme, which enables arbitrary computations on encrypted data. Despite much research effort, it cannot be considered as practical due to the enormous amount of computations, especially in the bootstrapping procedure. In this paper, we accelerate the performance of the recently suggested fast bootstrapping method in FHEW scheme using GPUs, as a case study of a FHE scheme. In order to optimize, we explored the reference code and carried out profiling to find out candidates for performance acceleration. Based on the profiling results, combined with more flexible tradeoff method, we optimized the bootstrapping algorithm in FHEW using GPU and CUDA's programming model. The empirical result shows that the bootstrapping of FHEW ciphertext can be done in less than 0.11 second after optimization.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"12 1","pages":"128-135"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90235270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Comparative analysis of OpenCL vs. HDL with image-processing kernels on Stratix-V FPGA 基于Stratix-V FPGA的OpenCL与HDL图像处理内核的对比分析
K. Hill, S. Craciun, A. George, H. Lam
Application development with hardware description languages (HDLs) such as VHDL or Verilog involves numerous productivity challenges, limiting the potential impact of reconfigurable computing (RC) with FPGAs in high-performance computing. Major challenges with HDL design include steep learning curves, large and complex codes, long compilation times, and lack of development standards across platforms. A relative newcomer to RC, the Open Computing Language (OpenCL) reduces productivity hurdles by providing a platform-independent, C-based programming language. In this study, we conduct a performance and productivity comparison between three image-processing kernels (Canny edge detector, Sobel filter, and SURF feature-extractor) developed using Altera's SDK for OpenCL and traditional VHDL. Our results show that VHDL designs achieved a more efficient use of resources (59% to 70% less logic), however, both OpenCL and VHDL designs resulted in similar timing constraints (255MHz <; fmax <; 325MHz). Furthermore, we observed a 6× increase in productivity when using OpenCL development tools, as well as the ability to efficiently port the same OpenCL designs without change to three different RC platforms, with similar performance in terms of frequency and resource utilization.
使用硬件描述语言(hdl)(如VHDL或Verilog)开发应用程序涉及许多生产力挑战,限制了fpga在高性能计算中可重构计算(RC)的潜在影响。HDL设计的主要挑战包括陡峭的学习曲线、庞大而复杂的代码、漫长的编译时间以及缺乏跨平台的开发标准。开放计算语言(Open Computing Language, OpenCL)是RC领域的新成员,它提供了一种独立于平台的、基于c语言的编程语言,从而减少了生产力障碍。在本研究中,我们对使用Altera的OpenCL SDK和传统VHDL开发的三种图像处理内核(Canny边缘检测器、Sobel滤波器和SURF特征提取器)进行了性能和生产率比较。我们的研究结果表明,VHDL设计实现了更有效的资源利用(减少59%到70%的逻辑),然而,OpenCL和VHDL设计都导致了类似的时间限制(255MHz <;fmax <;325 mhz)。此外,我们观察到,当使用OpenCL开发工具时,生产力提高了6倍,并且能够有效地将相同的OpenCL设计移植到三个不同的RC平台,而无需更改,在频率和资源利用率方面具有相似的性能。
{"title":"Comparative analysis of OpenCL vs. HDL with image-processing kernels on Stratix-V FPGA","authors":"K. Hill, S. Craciun, A. George, H. Lam","doi":"10.1109/ASAP.2015.7245733","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245733","url":null,"abstract":"Application development with hardware description languages (HDLs) such as VHDL or Verilog involves numerous productivity challenges, limiting the potential impact of reconfigurable computing (RC) with FPGAs in high-performance computing. Major challenges with HDL design include steep learning curves, large and complex codes, long compilation times, and lack of development standards across platforms. A relative newcomer to RC, the Open Computing Language (OpenCL) reduces productivity hurdles by providing a platform-independent, C-based programming language. In this study, we conduct a performance and productivity comparison between three image-processing kernels (Canny edge detector, Sobel filter, and SURF feature-extractor) developed using Altera's SDK for OpenCL and traditional VHDL. Our results show that VHDL designs achieved a more efficient use of resources (59% to 70% less logic), however, both OpenCL and VHDL designs resulted in similar timing constraints (255MHz <; fmax <; 325MHz). Furthermore, we observed a 6× increase in productivity when using OpenCL development tools, as well as the ability to efficiently port the same OpenCL designs without change to three different RC platforms, with similar performance in terms of frequency and resource utilization.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"1 1","pages":"189-193"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81013115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 48
期刊
2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1