首页 > 最新文献

Proceedings of the 2018 4th Workshop on Programming Models for SIMD/Vector Processing最新文献

英文 中文
Investigating automatic vectorization for real-time 3D scene understanding 研究实时三维场景理解的自动矢量化
A. Nica, E. Vespa, Pablo González de Aledo Marugán, P. Kelly
Simultaneous Localization And Mapping (SLAM) is the problem of building a representation of a geometric space while simultaneously estimating the observer's location within the space. While this seems to be a chicken-and-egg problem, several algorithms have appeared in the last decades that approximately and iteratively solve this problem. SLAM algorithms are tailored to the available resources, hence aimed at balancing the precision of the map with the constraints that the computational platform imposes and the desire to obtain real-time results. Working with KinectFusion, an established SLAM implementation, we explore in this work the vectorization opportunities present in this scenario, with the goal of using the CPU to its full potential. Using ISPC, an automatic vectorization tool, we produce a partially vectorized version of KinectFusion. Along the way we explore a number of optimization strategies, among which tiling to exploit ray-coherence and outer loop vectorization, obtaining up to 4x speed-up over the baseline on an 8-wide vector machine.
同时定位和映射(SLAM)是在建立几何空间表示的同时估计观察者在空间中的位置的问题。虽然这似乎是一个先有鸡还是先有蛋的问题,但在过去的几十年里出现了一些算法,它们近似地和迭代地解决了这个问题。SLAM算法是针对可用资源量身定制的,因此旨在平衡地图的精度与计算平台施加的约束以及获得实时结果的愿望。与KinectFusion(一个已建立的SLAM实现)合作,我们在这项工作中探索了在这种情况下存在的向量化机会,目标是充分利用CPU的潜力。使用ISPC,一个自动矢量化工具,我们产生了一个部分矢量化版本的KinectFusion。在此过程中,我们探索了许多优化策略,其中利用光线相干性和外环矢量化的平铺技术,在8宽矢量机上获得高达4倍的基线加速。
{"title":"Investigating automatic vectorization for real-time 3D scene understanding","authors":"A. Nica, E. Vespa, Pablo González de Aledo Marugán, P. Kelly","doi":"10.1145/3178433.3178438","DOIUrl":"https://doi.org/10.1145/3178433.3178438","url":null,"abstract":"Simultaneous Localization And Mapping (SLAM) is the problem of building a representation of a geometric space while simultaneously estimating the observer's location within the space. While this seems to be a chicken-and-egg problem, several algorithms have appeared in the last decades that approximately and iteratively solve this problem. SLAM algorithms are tailored to the available resources, hence aimed at balancing the precision of the map with the constraints that the computational platform imposes and the desire to obtain real-time results. Working with KinectFusion, an established SLAM implementation, we explore in this work the vectorization opportunities present in this scenario, with the goal of using the CPU to its full potential. Using ISPC, an automatic vectorization tool, we produce a partially vectorized version of KinectFusion. Along the way we explore a number of optimization strategies, among which tiling to exploit ray-coherence and outer loop vectorization, obtaining up to 4x speed-up over the baseline on an 8-wide vector machine.","PeriodicalId":197479,"journal":{"name":"Proceedings of the 2018 4th Workshop on Programming Models for SIMD/Vector Processing","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123463912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Data Layout Transformation for Vectorizing Compilers 面向向量化编译器的数据布局转换
Arsène Pérard-Gayot, Richard Membarth, P. Slusallek, Simon Moll, Roland Leißa, Sebastian Hack
Modern processors are often equipped with vector instruction sets. Such instructions operate on multiple elements of data at once, and greatly improve performance for specific applications. A programmer has two options to take advantage of these instructions: writing manually vectorized code, or using an auto-vectorizing compiler. In the latter case, he only has to place annotations to instruct the auto-vectorizing compiler to vectorize a particular piece of code. Thanks to auto-vectorization, the source program remains portable, and the programmer can focus on the task at hand instead of the low-level details of intrinsics programming. However, the performance of the vectorized program strongly depends on the precision of the analyses performed by the vectorizing compiler. In this paper, we improve the precision of these analyses by selectively splitting stack-allocated variables of a structure or aggregate type. Without this optimization, automatic vectorization slows the execution down compared to the scalar, non-vectorized code. When this optimization is enabled, we show that the vectorized code can be as fast as hand-optimized, manually vectorized implementations.
现代处理器通常配备矢量指令集。这样的指令一次操作多个数据元素,并极大地提高了特定应用程序的性能。程序员有两种选择来利用这些指令:手动编写向量化代码,或者使用自动向量化编译器。在后一种情况下,他只需要放置注释来指示自动向量化编译器向量化一段特定的代码。多亏了自动向量化,源程序保持了可移植性,程序员可以专注于手头的任务,而不是内在编程的底层细节。然而,向量化程序的性能在很大程度上取决于向量化编译器所执行的分析的精度。在本文中,我们通过选择性地拆分结构或聚合类型的堆栈分配变量来提高这些分析的精度。如果没有这种优化,与标量、非向量化代码相比,自动向量化会减慢执行速度。当这种优化被启用时,我们展示了向量化代码可以和手工优化、手动向量化实现一样快。
{"title":"A Data Layout Transformation for Vectorizing Compilers","authors":"Arsène Pérard-Gayot, Richard Membarth, P. Slusallek, Simon Moll, Roland Leißa, Sebastian Hack","doi":"10.1145/3178433.3178440","DOIUrl":"https://doi.org/10.1145/3178433.3178440","url":null,"abstract":"Modern processors are often equipped with vector instruction sets. Such instructions operate on multiple elements of data at once, and greatly improve performance for specific applications. A programmer has two options to take advantage of these instructions: writing manually vectorized code, or using an auto-vectorizing compiler. In the latter case, he only has to place annotations to instruct the auto-vectorizing compiler to vectorize a particular piece of code. Thanks to auto-vectorization, the source program remains portable, and the programmer can focus on the task at hand instead of the low-level details of intrinsics programming. However, the performance of the vectorized program strongly depends on the precision of the analyses performed by the vectorizing compiler. In this paper, we improve the precision of these analyses by selectively splitting stack-allocated variables of a structure or aggregate type. Without this optimization, automatic vectorization slows the execution down compared to the scalar, non-vectorized code. When this optimization is enabled, we show that the vectorized code can be as fast as hand-optimized, manually vectorized implementations.","PeriodicalId":197479,"journal":{"name":"Proceedings of the 2018 4th Workshop on Programming Models for SIMD/Vector Processing","volume":"12 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131204298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Small SIMD Matrices for CERN High Throughput Computing 用于CERN高通量计算的小型SIMD矩阵
F. Lemaitre, Benjamin Couturier, L. Lacassagne
System tracking is an old problem and has been heavily optimized throughout the past. However, in High Energy Physics, many small systems are tracked in real-time using Kalman filtering and no implementation satisfying those constraints currently exists. In this paper, we present a code generator used to speed up Cholesky Factorization and Kalman Filter for small matrices. The generator is easy to use and produces portable and heavily optimized code. We focus on current SIMD architectures (SSE, AVX, AVX512, Neon, SVE, Altivec and VSX). Our Cholesky factorization outperforms any existing libraries: from x3 to x10 faster than MKL. The Kalman Filter is also faster than existing implementations, and achieves 4 · 109 iter/s on a 2x24C Intel Xeon.
系统跟踪是一个老问题,在过去已经进行了大量优化。然而,在高能物理中,许多小型系统都是使用卡尔曼滤波进行实时跟踪的,目前还没有满足这些约束的实现。在本文中,我们提出了一种用于加速小矩阵的乔列斯基分解和卡尔曼滤波的代码生成器。该生成器易于使用,并生成可移植且经过大量优化的代码。我们专注于当前的SIMD架构(SSE, AVX, AVX512, Neon, SVE, Altivec和VSX)。我们的Cholesky分解优于任何现有的库:比MKL快3到10倍。卡尔曼滤波器也比现有的实现更快,在2x24C英特尔至强处理器上达到4109 iter/s。
{"title":"Small SIMD Matrices for CERN High Throughput Computing","authors":"F. Lemaitre, Benjamin Couturier, L. Lacassagne","doi":"10.1145/3178433.3178434","DOIUrl":"https://doi.org/10.1145/3178433.3178434","url":null,"abstract":"System tracking is an old problem and has been heavily optimized throughout the past. However, in High Energy Physics, many small systems are tracked in real-time using Kalman filtering and no implementation satisfying those constraints currently exists. In this paper, we present a code generator used to speed up Cholesky Factorization and Kalman Filter for small matrices. The generator is easy to use and produces portable and heavily optimized code. We focus on current SIMD architectures (SSE, AVX, AVX512, Neon, SVE, Altivec and VSX). Our Cholesky factorization outperforms any existing libraries: from x3 to x10 faster than MKL. The Kalman Filter is also faster than existing implementations, and achieves 4 · 109 iter/s on a 2x24C Intel Xeon.","PeriodicalId":197479,"journal":{"name":"Proceedings of the 2018 4th Workshop on Programming Models for SIMD/Vector Processing","volume":"354 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132451570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
MIPP: a Portable C++ SIMD Wrapper and its use for Error Correction Coding in 5G Standard MIPP:可移植c++ SIMD包装器及其在5G标准中纠错编码中的应用
Adrien Cassagne, Olivier Aumage, Denis Barthou, Camille Leroux, C. Jégo
Error correction code (ECC) processing has so far been performed on dedicated hardware for previous generations of mobile communication standards, to meet latency and bandwidth constraints. As the 5G mobile standard, and its associated channel coding algorithms, are now being specified, modern CPUs are progressing to the point where software channel decoders can viably be contemplated. A key aspect in reaching this transition point is to get the most of CPUs SIMD units on the decoding algorithms being pondered for 5G mobile standards. The nature and diversity of such algorithms requires highly versatile programming tools. This paper demonstrates the virtues and versatility of our MIPP SIMD wrapper in implementing a high performance portfolio of key ECC decoding algorithms.
到目前为止,纠错码(ECC)处理一直是在前几代移动通信标准的专用硬件上进行的,以满足延迟和带宽限制。随着5G移动标准及其相关的信道编码算法正在被指定,现代cpu正在发展到可以考虑软件信道解码器的地步。达到这一过渡点的一个关键方面是在5G移动标准的解码算法上获得大多数cpu SIMD单元。这种算法的性质和多样性需要高度通用的编程工具。本文展示了我们的MIPP SIMD封装器在实现关键ECC解码算法的高性能组合方面的优点和多功能性。
{"title":"MIPP: a Portable C++ SIMD Wrapper and its use for Error Correction Coding in 5G Standard","authors":"Adrien Cassagne, Olivier Aumage, Denis Barthou, Camille Leroux, C. Jégo","doi":"10.1145/3178433.3178435","DOIUrl":"https://doi.org/10.1145/3178433.3178435","url":null,"abstract":"Error correction code (ECC) processing has so far been performed on dedicated hardware for previous generations of mobile communication standards, to meet latency and bandwidth constraints. As the 5G mobile standard, and its associated channel coding algorithms, are now being specified, modern CPUs are progressing to the point where software channel decoders can viably be contemplated. A key aspect in reaching this transition point is to get the most of CPUs SIMD units on the decoding algorithms being pondered for 5G mobile standards. The nature and diversity of such algorithms requires highly versatile programming tools. This paper demonstrates the virtues and versatility of our MIPP SIMD wrapper in implementing a high performance portfolio of key ECC decoding algorithms.","PeriodicalId":197479,"journal":{"name":"Proceedings of the 2018 4th Workshop on Programming Models for SIMD/Vector Processing","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127866584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
SIMDization of Small Tensor Multiplication Kernels for Wide SIMD Vector Processors 宽SIMD矢量处理器的小张量乘法核的sim化
Christopher I. Rodrigues, Amarin Phaosawasdi, Peng Wu
Developers often rely on automatic vectorization to speed up fine-grained data-parallel code. However, for loop nests where the loops are shorter than the processor's SIMD width, automatic vectorization performs poorly. Vectorizers attempt to vectorize a single short loop, using (at best) a fraction of the processor's SIMD capacity. It is not straightforward to vectorize multiple nested loops together because they typically have memory accesses with multiple strides, which conventional methods cannot profitably vectorize. We present a solution in the context of compiling small tensor multiplication. Our compiler vectorizes several inner loops in order to utilize wide vector parallelism. To handle complicated strides, we devise a vectorizable form of loop tiling. The compiler transforms loops to improve memory locality, then caches tiles of data in vector registers. Strided access patterns are transformed into permute instructions. We show that our compiler is able to significantly speed up many small tensor multiplication algorithms. It judges 13.5% of a randomly generated sample of algorithms to be profitable to vectorize. On these, it generates code 1.55x as fast on average as that produced by GCC's state-of-the-art vectorizer, with a maximum speedup of 10x. We discuss potential extensions to vectorize more general algorithms.
开发人员通常依赖于自动向量化来加速细粒度数据并行代码。但是,对于循环比处理器的SIMD宽度短的循环巢,自动向量化的性能很差。向量化器尝试对单个短循环进行向量化,使用(最多)处理器SIMD容量的一小部分。将多个嵌套循环向量化在一起并不简单,因为它们通常具有多个步进的内存访问,而传统方法无法有效地向量化。我们在编译小张量乘法的背景下给出了一个解决方案。我们的编译器向量化了几个内部循环,以利用广泛的向量并行性。为了处理复杂的步进,我们设计了一种可矢量化的循环平铺形式。编译器转换循环以改善内存局部性,然后将数据块缓存到向量寄存器中。跨行访问模式被转换成置换指令。我们证明了我们的编译器能够显著加快许多小张量乘法算法。它判断随机生成的算法样本中有13.5%是可以进行矢量化的。在这些代码上,它生成代码的平均速度是GCC最先进的矢量器生成代码的1.55倍,最大加速为10倍。我们讨论了向量化更一般算法的潜在扩展。
{"title":"SIMDization of Small Tensor Multiplication Kernels for Wide SIMD Vector Processors","authors":"Christopher I. Rodrigues, Amarin Phaosawasdi, Peng Wu","doi":"10.1145/3178433.3178436","DOIUrl":"https://doi.org/10.1145/3178433.3178436","url":null,"abstract":"Developers often rely on automatic vectorization to speed up fine-grained data-parallel code. However, for loop nests where the loops are shorter than the processor's SIMD width, automatic vectorization performs poorly. Vectorizers attempt to vectorize a single short loop, using (at best) a fraction of the processor's SIMD capacity. It is not straightforward to vectorize multiple nested loops together because they typically have memory accesses with multiple strides, which conventional methods cannot profitably vectorize. We present a solution in the context of compiling small tensor multiplication. Our compiler vectorizes several inner loops in order to utilize wide vector parallelism. To handle complicated strides, we devise a vectorizable form of loop tiling. The compiler transforms loops to improve memory locality, then caches tiles of data in vector registers. Strided access patterns are transformed into permute instructions. We show that our compiler is able to significantly speed up many small tensor multiplication algorithms. It judges 13.5% of a randomly generated sample of algorithms to be profitable to vectorize. On these, it generates code 1.55x as fast on average as that produced by GCC's state-of-the-art vectorizer, with a maximum speedup of 10x. We discuss potential extensions to vectorize more general algorithms.","PeriodicalId":197479,"journal":{"name":"Proceedings of the 2018 4th Workshop on Programming Models for SIMD/Vector Processing","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131663963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Vectorization of a spectral finite-element numerical kernel 谱有限元数值核的矢量化
S. Jubertie, F. Dupros, F. D. Martin
In this paper, we present an optimized implementation of the Finite-Element Methods numerical kernel for SIMD vectorization. A typical application is the modelling of seismic wave propagation. In this case, the computations at the element level are generally based on nested loops where the memory accesses are non-contiguous. Moreover, the back and forth from the element level to the global level (e.g., assembly phase) is a serious brake for automatic vectorization by compilers and for efficient reuse of data at the cache memory levels. This is particularly true when the problem under study relies on an unstructured mesh. The application proxies used for our experiments were extracted from EFISPEC code that implements the spectral finite-element method to solve the elastodynamic equations. We underline that the intra-node performance may be further improved. Additionally, we show that standard compilers such as GNU GCC, Clang and Intel ICC are unable to perform automatic vectorization even when the nested loops were reorganized or when SIMD pragmas were added. Due to the irregular memory access pattern, we introduce a dedicated strategy to squeeze the maximum performance out of the SIMD units. Experiments are carried out on Intel Broadwell and Skylake platforms that respectively offer AVX2 and AVX-512 SIMD units. We believe that our vectorization approach may be generic enough to be adapted to other codes.
在本文中,我们提出了一个优化实现的有限元方法数值核SIMD矢量化。一个典型的应用是地震波传播的建模。在这种情况下,元素级别的计算通常基于嵌套循环,其中内存访问是非连续的。此外,从元素级到全局级(例如,汇编阶段)的来回转换严重阻碍了编译器的自动向量化和缓存内存级数据的有效重用。当所研究的问题依赖于非结构化网格时尤其如此。实验中使用的应用代理是从EFISPEC代码中提取的,该代码实现了谱有限元法求解弹性动力学方程。我们强调节点内性能可以进一步提高。此外,我们还表明,即使在重新组织嵌套循环或添加SIMD pragmas时,GNU GCC、Clang和Intel ICC等标准编译器也无法执行自动向量化。由于不规则的内存访问模式,我们引入了一种专门的策略来从SIMD单元中挤出最大的性能。实验在英特尔Broadwell和Skylake平台上进行,分别提供AVX2和AVX-512 SIMD单元。我们相信,我们的向量化方法可能是通用的,足以适应其他代码。
{"title":"Vectorization of a spectral finite-element numerical kernel","authors":"S. Jubertie, F. Dupros, F. D. Martin","doi":"10.1145/3178433.3178441","DOIUrl":"https://doi.org/10.1145/3178433.3178441","url":null,"abstract":"In this paper, we present an optimized implementation of the Finite-Element Methods numerical kernel for SIMD vectorization. A typical application is the modelling of seismic wave propagation. In this case, the computations at the element level are generally based on nested loops where the memory accesses are non-contiguous. Moreover, the back and forth from the element level to the global level (e.g., assembly phase) is a serious brake for automatic vectorization by compilers and for efficient reuse of data at the cache memory levels. This is particularly true when the problem under study relies on an unstructured mesh. The application proxies used for our experiments were extracted from EFISPEC code that implements the spectral finite-element method to solve the elastodynamic equations. We underline that the intra-node performance may be further improved. Additionally, we show that standard compilers such as GNU GCC, Clang and Intel ICC are unable to perform automatic vectorization even when the nested loops were reorganized or when SIMD pragmas were added. Due to the irregular memory access pattern, we introduce a dedicated strategy to squeeze the maximum performance out of the SIMD units. Experiments are carried out on Intel Broadwell and Skylake platforms that respectively offer AVX2 and AVX-512 SIMD units. We believe that our vectorization approach may be generic enough to be adapted to other codes.","PeriodicalId":197479,"journal":{"name":"Proceedings of the 2018 4th Workshop on Programming Models for SIMD/Vector Processing","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123651208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Usuba: Optimizing & Trustworthy Bitslicing Compiler Usuba:优化和值得信赖的位切片编译器
Darius Mercadier, Pierre-Évariste Dagand, L. Lacassagne, Gilles Muller
Bitslicing is a programming technique commonly used in cryptography that consists in implementing a combinational circuit in software. It results in a massively parallel program immune to cache-timing attacks by design. However, writing a program in bitsliced form requires extreme minutia. This paper introduces Usuba, a synchronous dataflow language producing bitsliced C code. Usuba is both a domain-specific language -- providing syntactic support for the implementation of cryptographic algorithms -- as well as a domain-specific compiler -- taking advantage of well-defined semantics invariants to perform various optimizations before handing the generated code to an (optimizing) C compiler. On the Data Encryption Standard (DES) algorithm, we show that Usuba outperforms a reference, hand-tuned implementation by 15% (using Intel's 64 bits general-purpose registers and depending on the underlying C compiler) whilst our implementation also transparently supports modern SIMD extensions (SSE, AVX, AVX-512), other architectures (ARM Neon, IBM Altivec) as well as multicore processors through an OpenMP backend.
位切片是密码学中常用的一种编程技术,它包括在软件中实现组合电路。它会导致大规模并行程序在设计上不受缓存定时攻击的影响。然而,以位切片的形式编写程序需要极端的细节。本文介绍了一种同步数据流语言Usuba,它可以生成位片C代码。Usuba既是一种特定于领域的语言——为加密算法的实现提供语法支持——也是一种特定于领域的编译器——在将生成的代码交给(优化的)C编译器之前,利用定义良好的语义不变量执行各种优化。在数据加密标准(DES)算法上,我们表明Usuba比参考的手动调优实现高出15%(使用英特尔的64位通用寄存器并取决于底层C编译器),同时我们的实现也透明地支持现代SIMD扩展(SSE, AVX, AVX-512),其他架构(ARM Neon, IBM Altivec)以及通过OpenMP后端的多核处理器。
{"title":"Usuba: Optimizing & Trustworthy Bitslicing Compiler","authors":"Darius Mercadier, Pierre-Évariste Dagand, L. Lacassagne, Gilles Muller","doi":"10.1145/3178433.3178437","DOIUrl":"https://doi.org/10.1145/3178433.3178437","url":null,"abstract":"Bitslicing is a programming technique commonly used in cryptography that consists in implementing a combinational circuit in software. It results in a massively parallel program immune to cache-timing attacks by design. However, writing a program in bitsliced form requires extreme minutia. This paper introduces Usuba, a synchronous dataflow language producing bitsliced C code. Usuba is both a domain-specific language -- providing syntactic support for the implementation of cryptographic algorithms -- as well as a domain-specific compiler -- taking advantage of well-defined semantics invariants to perform various optimizations before handing the generated code to an (optimizing) C compiler. On the Data Encryption Standard (DES) algorithm, we show that Usuba outperforms a reference, hand-tuned implementation by 15% (using Intel's 64 bits general-purpose registers and depending on the underlying C compiler) whilst our implementation also transparently supports modern SIMD extensions (SSE, AVX, AVX-512), other architectures (ARM Neon, IBM Altivec) as well as multicore processors through an OpenMP backend.","PeriodicalId":197479,"journal":{"name":"Proceedings of the 2018 4th Workshop on Programming Models for SIMD/Vector Processing","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127877273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Ikra-Cpp: A C++/CUDA DSL for Object-Oriented Programming with Structure-of-Arrays Layout Ikra-Cpp:一个c++ /CUDA面向对象编程的数组结构布局DSL
M. Springer, H. Masuhara
Structure of Arrays (SOA) is a well-studied data layout technique for SIMD architectures. Previous work has shown that it can speed up applications in high-performance computing by several factors compared to a traditional Array of Structures (AOS) layout. However, most programmers are used to AOS-style programming, which is more readable and easier to maintain. We present Ikra-Cpp, an embedded DSL for object-oriented programming in C++/CUDA. Ikra-Cpp's notation is very close to standard AOS-style C++ code, but data is layed out as SOA. This gives programmers the performance benefit of SOA and the expressiveness of AOS-style object-oriented programming at the same time. Ikra-Cpp is well integrated with C++ and lets programmers use C++ notation and syntax for classes, fields, member functions, constructors and instance creation.
阵列结构(SOA)是针对SIMD体系结构的一种经过充分研究的数据布局技术。先前的工作表明,与传统的结构阵列(AOS)布局相比,它可以在几个方面加快高性能计算应用的速度。然而,大多数程序员都习惯于aop风格的编程,这种编程更容易阅读,也更容易维护。我们提出Ikra-Cpp,一个在c++ /CUDA中用于面向对象编程的嵌入式DSL。Ikra-Cpp的表示法非常接近标准的aop风格的c++代码,但是数据是以SOA的方式布局的。这同时为程序员提供了SOA的性能优势和aop风格的面向对象编程的表现力。Ikra-Cpp很好地集成了c++,并允许程序员使用c++符号和语法来创建类、字段、成员函数、构造函数和实例。
{"title":"Ikra-Cpp: A C++/CUDA DSL for Object-Oriented Programming with Structure-of-Arrays Layout","authors":"M. Springer, H. Masuhara","doi":"10.1145/3178433.3178439","DOIUrl":"https://doi.org/10.1145/3178433.3178439","url":null,"abstract":"Structure of Arrays (SOA) is a well-studied data layout technique for SIMD architectures. Previous work has shown that it can speed up applications in high-performance computing by several factors compared to a traditional Array of Structures (AOS) layout. However, most programmers are used to AOS-style programming, which is more readable and easier to maintain. We present Ikra-Cpp, an embedded DSL for object-oriented programming in C++/CUDA. Ikra-Cpp's notation is very close to standard AOS-style C++ code, but data is layed out as SOA. This gives programmers the performance benefit of SOA and the expressiveness of AOS-style object-oriented programming at the same time. Ikra-Cpp is well integrated with C++ and lets programmers use C++ notation and syntax for classes, fields, member functions, constructors and instance creation.","PeriodicalId":197479,"journal":{"name":"Proceedings of the 2018 4th Workshop on Programming Models for SIMD/Vector Processing","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126026262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
期刊
Proceedings of the 2018 4th Workshop on Programming Models for SIMD/Vector Processing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1