首页 > 最新文献

WPMVP '14最新文献

英文 中文
High level transforms for SIMD and low-level computer vision algorithms SIMD的高级变换和低级计算机视觉算法
Pub Date : 2014-02-16 DOI: 10.1145/2568058.2568067
L. Lacassagne, D. Etiemble, A. Zahraee, A. Dominguez, P. Vezolle
This paper presents a review of algorithmic transforms called High Level Transforms for IBM, Intel and ARM SIMD multicore processors to accelerate the implementation of low level image processing algorithms. We show that these optimizations provide a significant acceleration. A first evaluation of 512-bit SIMD Xeon- Phi is also presented. We focus on the point that the combination of optimizations leading to the best execution time cannot be predicted, and thus, systematic benchmarking is mandatory. Once the best configuration is found for each architecture, a comparison of these performances is presented. The Harris points detection operator is selected as being representative of low level image processing and computer vision algorithms. Being composed of five convolutions, it is more complex than a simple filter and enables more opportunities to combine optimizations. The presented work can scale across a wide range of codes using 2D stencils and convolutions.
本文介绍了用于IBM、Intel和ARM SIMD多核处理器的称为高级变换的算法变换,以加速低级图像处理算法的实现。我们展示了这些优化提供了显著的加速。本文还介绍了512位SIMD Xeon- Phi的首次评估。我们关注的是,无法预测导致最佳执行时间的优化组合,因此必须进行系统的基准测试。一旦找到了每种体系结构的最佳配置,就会对这些性能进行比较。选择Harris点检测算子作为低级图像处理和计算机视觉算法的代表。由于由五个卷积组成,它比一个简单的过滤器更复杂,并且提供了更多组合优化的机会。所提出的工作可以使用2D模板和卷积在广泛的代码范围内进行扩展。
{"title":"High level transforms for SIMD and low-level computer vision algorithms","authors":"L. Lacassagne, D. Etiemble, A. Zahraee, A. Dominguez, P. Vezolle","doi":"10.1145/2568058.2568067","DOIUrl":"https://doi.org/10.1145/2568058.2568067","url":null,"abstract":"This paper presents a review of algorithmic transforms called High Level Transforms for IBM, Intel and ARM SIMD multicore processors to accelerate the implementation of low level image processing algorithms. We show that these optimizations provide a significant acceleration. A first evaluation of 512-bit SIMD Xeon- Phi is also presented. We focus on the point that the combination of optimizations leading to the best execution time cannot be predicted, and thus, systematic benchmarking is mandatory. Once the best configuration is found for each architecture, a comparison of these performances is presented. The Harris points detection operator is selected as being representative of low level image processing and computer vision algorithms. Being composed of five convolutions, it is more complex than a simple filter and enables more opportunities to combine optimizations. The presented work can scale across a wide range of codes using 2D stencils and convolutions.","PeriodicalId":411100,"journal":{"name":"WPMVP '14","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129752055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
A SIMD programming model for dart, javascript,and other dynamically typed scripting languages 用于dart、javascript和其他动态类型脚本语言的SIMD编程模型
Pub Date : 2014-02-16 DOI: 10.1145/2568058.2568066
J. McCutchan, Haitao Feng, Nicholas D. Matsakis, Zachary R. Anderson, P. Jensen
It has not been possible to take advantage of the SIMD co-processors available in all x86 and most ARM processors shipping today in dynamically typed scripting languages. Web browsers have become a mainstream platform to deliver large and complex applications with feature sets and performance comparable to native applications, programmers must choose between Dart and JavaScript when writing web programs. This paper introduces an explicit SIMD programming model for Dart and JavaScript, we show that it can be compiled to efficient x86/SSE or ARM/Neon code by both Dart and JavaScript virtual machines achieving a 300%-600% speed increase across a variety of benchmarks. The result of this work is that more sophisticated and performant applications can be built to run in web browsers. The ideas introduced in this paper can also be used in other dynamically typed scripting languages to provide a similarly performant interface to SIMD co-processors.
目前还不可能利用所有x86和大多数ARM处理器中可用的SIMD协处理器,这些处理器采用动态类型脚本语言。Web浏览器已经成为交付大型复杂应用程序的主流平台,其功能集和性能可与本机应用程序相媲美,程序员在编写Web程序时必须在Dart和JavaScript之间做出选择。本文介绍了Dart和JavaScript的显式SIMD编程模型,我们证明了它可以被Dart和JavaScript虚拟机编译成高效的x86/SSE或ARM/Neon代码,在各种基准测试中实现了300%-600%的速度提升。这项工作的结果是,可以构建在web浏览器中运行的更复杂、更高性能的应用程序。本文介绍的思想也可以用于其他动态类型脚本语言,为SIMD协处理器提供类似的性能接口。
{"title":"A SIMD programming model for dart, javascript,and other dynamically typed scripting languages","authors":"J. McCutchan, Haitao Feng, Nicholas D. Matsakis, Zachary R. Anderson, P. Jensen","doi":"10.1145/2568058.2568066","DOIUrl":"https://doi.org/10.1145/2568058.2568066","url":null,"abstract":"It has not been possible to take advantage of the SIMD co-processors available in all x86 and most ARM processors shipping today in dynamically typed scripting languages. Web browsers have become a mainstream platform to deliver large and complex applications with feature sets and performance comparable to native applications, programmers must choose between Dart and JavaScript when writing web programs. This paper introduces an explicit SIMD programming model for Dart and JavaScript, we show that it can be compiled to efficient x86/SSE or ARM/Neon code by both Dart and JavaScript virtual machines achieving a 300%-600% speed increase across a variety of benchmarks. The result of this work is that more sophisticated and performant applications can be built to run in web browsers. The ideas introduced in this paper can also be used in other dynamically typed scripting languages to provide a similarly performant interface to SIMD co-processors.","PeriodicalId":411100,"journal":{"name":"WPMVP '14","volume":"160 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113997529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
OpenCL framework for ARM processors with NEON support 支持NEON的ARM处理器的OpenCL框架
Pub Date : 2014-02-16 DOI: 10.1145/2568058.2568064
Gangwon Jo, W. J. Jeon, Wookeun Jung, Gordon Taft, Jaejin Lee
The state-of-the-art ARM processors provide multiple cores and SIMD instructions. OpenCL is a promising programming model for utilizing such parallel processing capability because of its SPMD programming model and built-in vector support. Moreover, it provides portability between multicore ARM processors and accelerators in embedded systems. In this paper, we introduce the design and implementation of an efficient OpenCL framework for multicore ARM processors. Computational tasks in a program are implemented as OpenCL kernels and run on all CPU cores in parallel by our OpenCL framework. Vector operations and built-in functions in OpenCL kernels are optimized using the NEON SIMD instruction set. We evaluate our OpenCL framework using 37 benchmark applications. The result shows that our approach is effective and promising.
最先进的ARM处理器提供多核和SIMD指令。OpenCL是利用这种并行处理能力的一个很有前途的编程模型,因为它的SPMD编程模型和内置向量支持。此外,它还提供了嵌入式系统中多核ARM处理器和加速器之间的可移植性。在本文中,我们介绍了一个高效的多核ARM处理器的OpenCL框架的设计和实现。程序中的计算任务通过OpenCL内核实现,并通过我们的OpenCL框架在所有CPU内核上并行运行。使用NEON SIMD指令集优化了OpenCL内核中的矢量操作和内置函数。我们使用37个基准应用程序来评估我们的OpenCL框架。结果表明,该方法是有效的,具有良好的应用前景。
{"title":"OpenCL framework for ARM processors with NEON support","authors":"Gangwon Jo, W. J. Jeon, Wookeun Jung, Gordon Taft, Jaejin Lee","doi":"10.1145/2568058.2568064","DOIUrl":"https://doi.org/10.1145/2568058.2568064","url":null,"abstract":"The state-of-the-art ARM processors provide multiple cores and SIMD instructions. OpenCL is a promising programming model for utilizing such parallel processing capability because of its SPMD programming model and built-in vector support. Moreover, it provides portability between multicore ARM processors and accelerators in embedded systems. In this paper, we introduce the design and implementation of an efficient OpenCL framework for multicore ARM processors. Computational tasks in a program are implemented as OpenCL kernels and run on all CPU cores in parallel by our OpenCL framework. Vector operations and built-in functions in OpenCL kernels are optimized using the NEON SIMD instruction set. We evaluate our OpenCL framework using 37 benchmark applications. The result shows that our approach is effective and promising.","PeriodicalId":411100,"journal":{"name":"WPMVP '14","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130191727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Exploring the vectorization of python constructs using pythran and boost SIMD 使用pythran和boost SIMD探索python构造的矢量化
Pub Date : 2014-02-16 DOI: 10.1145/2568058.2568060
S. Guelton, J. Falcou, Pierrick Brunet
The Python language is highly dynamic, most notably due to late binding. As a consequence, programs using Python typically run an order of magnitude slower than their C counterpart. It is also a high level language whose semantic can be made more static without much change from a user point of view in the case of mathematical applications. In that case, the language provides several vectorization opportunities that are studied in this paper, and evaluated in the context of Pythran, an ahead-of-time compiler that turns Python module into C++ meta-programs.
Python语言是高度动态的,最明显的原因是后期绑定。因此,使用Python的程序的运行速度通常比C语言慢一个数量级。它也是一种高级语言,在数学应用程序中,从用户的角度来看,它的语义可以变得更加静态,而无需进行太多更改。在这种情况下,该语言提供了几个向量化的机会,本文对这些机会进行了研究,并在Pythran的上下文中进行了评估,Pythran是一种将Python模块转换为c++元程序的提前编译器。
{"title":"Exploring the vectorization of python constructs using pythran and boost SIMD","authors":"S. Guelton, J. Falcou, Pierrick Brunet","doi":"10.1145/2568058.2568060","DOIUrl":"https://doi.org/10.1145/2568058.2568060","url":null,"abstract":"The Python language is highly dynamic, most notably due to late binding. As a consequence, programs using Python typically run an order of magnitude slower than their C counterpart. It is also a high level language whose semantic can be made more static without much change from a user point of view in the case of mathematical applications. In that case, the language provides several vectorization opportunities that are studied in this paper, and evaluated in the context of Pythran, an ahead-of-time compiler that turns Python module into C++ meta-programs.","PeriodicalId":411100,"journal":{"name":"WPMVP '14","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122032679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Sierra: a SIMD extension for C++ Sierra: c++的SIMD扩展
Pub Date : 2014-02-16 DOI: 10.1145/2568058.2568062
Roland Leißa, Immanuel Haffner, Sebastian Hack
Nowadays, SIMD hardware is omnipresent in computers. Nonetheless, many software projects make hardly use of SIMD instructions: Applications are usually written in general-purpose languages like C++. However, general-purpose languages only provide poor abstractions for SIMD programming enforcing an error-prone, assembly-like programming style. An alternative are data-parallel languages. They indeed offer more convenience to target SIMD architectures but introduce their own set of problems. In particular, programmers are often unwilling to port their working C++ code to a new programming language. In this paper we present Sierra: a SIMD extension for C++. It combines the full power of C++ with an intuitive and effective way to address SIMD hardware. With Sierra, the programmer can write efficient, portable and maintainable code. It is particularly easy to enhance existing code to run efficiently on SIMD machines. In contrast to prior approaches, the programmer has explicit control over the involved vector lengths.
如今,SIMD硬件在计算机中无处不在。尽管如此,许多软件项目几乎不使用SIMD指令:应用程序通常是用c++等通用语言编写的。然而,通用语言仅为SIMD编程提供了较差的抽象,强制执行容易出错的、类似于汇编的编程风格。另一种选择是数据并行语言。它们确实为瞄准SIMD体系结构提供了更多便利,但也引入了自己的一组问题。特别是,程序员通常不愿意将他们的c++代码移植到新的编程语言中。在本文中,我们介绍Sierra: c++的SIMD扩展。它将c++的全部功能与解决SIMD硬件的直观而有效的方法相结合。使用Sierra,程序员可以编写高效、可移植和可维护的代码。增强现有代码以使其在SIMD机器上高效运行是特别容易的。与之前的方法相比,程序员可以显式地控制所涉及的向量长度。
{"title":"Sierra: a SIMD extension for C++","authors":"Roland Leißa, Immanuel Haffner, Sebastian Hack","doi":"10.1145/2568058.2568062","DOIUrl":"https://doi.org/10.1145/2568058.2568062","url":null,"abstract":"Nowadays, SIMD hardware is omnipresent in computers. Nonetheless, many software projects make hardly use of SIMD instructions: Applications are usually written in general-purpose languages like C++. However, general-purpose languages only provide poor abstractions for SIMD programming enforcing an error-prone, assembly-like programming style. An alternative are data-parallel languages. They indeed offer more convenience to target SIMD architectures but introduce their own set of problems. In particular, programmers are often unwilling to port their working C++ code to a new programming language.\u0000 In this paper we present Sierra: a SIMD extension for C++. It combines the full power of C++ with an intuitive and effective way to address SIMD hardware. With Sierra, the programmer can write efficient, portable and maintainable code. It is particularly easy to enhance existing code to run efficiently on SIMD machines.\u0000 In contrast to prior approaches, the programmer has explicit control over the involved vector lengths.","PeriodicalId":411100,"journal":{"name":"WPMVP '14","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133562331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
Writing scalable SIMD programs with ISPC 用ISPC编写可扩展的SIMD程序
Pub Date : 2014-02-16 DOI: 10.1145/2568058.2568065
James C. Brodman, Dmitry Babokin, I. Filippov, P. Tu
Modern processors contain many resources for parallel execution. In addition to having multiple cores, processors can also contain vector functional units that are capable of performing a single operation on multiple inputs in parallel. Taking advantage of this vector hardware is essential to obtaining peak performance on a machine, but it is often challenging for programmers to do so. This paper presents a performance study of compiling several benchmarks from the domains of computer graphics, financial modeling, and high-performance computing for different vector instruction sets using the Intel SPMD Program Compiler, an alternative to compiler autovectorization of scalar code or handwriting vector code with intrinsics. ispc is both a language and compiler that produces high quality code for SIMD CPU vector extensions such as Intel Streaming SIMD Extensions (SSE), Intel Advanced Vector Extensions (AVX), or ARM NEON. We present the results of compiling the same ispc source program for various targets. The performance of the resulting ispc versions is compared to that of scalar C++ code, and we also examine the scalability of the benchmarks when targeting wider vector units.
现代处理器包含许多用于并行执行的资源。除了具有多核之外,处理器还可以包含矢量功能单元,能够并行地对多个输入执行单个操作。利用这种矢量硬件对于在机器上获得最佳性能是必不可少的,但是对于程序员来说,这样做通常是具有挑战性的。本文介绍了一项性能研究,使用英特尔SPMD程序编译器编译来自计算机图形学,金融建模和高性能计算领域的几个基准,用于不同的矢量指令集,这是标量代码或手写矢量代码的编译器自动向量化的替代方案。ispc是一种语言和编译器,可以为SIMD CPU矢量扩展(如Intel Streaming SIMD extensions (SSE), Intel Advanced vector extensions (AVX)或ARM NEON)生成高质量的代码。我们给出了为不同目标编译同一个ispc源程序的结果。结果ispc版本的性能与标量c++代码的性能进行了比较,并且我们还检查了针对更宽向量单位的基准测试的可伸缩性。
{"title":"Writing scalable SIMD programs with ISPC","authors":"James C. Brodman, Dmitry Babokin, I. Filippov, P. Tu","doi":"10.1145/2568058.2568065","DOIUrl":"https://doi.org/10.1145/2568058.2568065","url":null,"abstract":"Modern processors contain many resources for parallel execution. In addition to having multiple cores, processors can also contain vector functional units that are capable of performing a single operation on multiple inputs in parallel. Taking advantage of this vector hardware is essential to obtaining peak performance on a machine, but it is often challenging for programmers to do so.\u0000 This paper presents a performance study of compiling several benchmarks from the domains of computer graphics, financial modeling, and high-performance computing for different vector instruction sets using the Intel SPMD Program Compiler, an alternative to compiler autovectorization of scalar code or handwriting vector code with intrinsics. ispc is both a language and compiler that produces high quality code for SIMD CPU vector extensions such as Intel Streaming SIMD Extensions (SSE), Intel Advanced Vector Extensions (AVX), or ARM NEON. We present the results of compiling the same ispc source program for various targets. The performance of the resulting ispc versions is compared to that of scalar C++ code, and we also examine the scalability of the benchmarks when targeting wider vector units.","PeriodicalId":411100,"journal":{"name":"WPMVP '14","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134490474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
SIMDizing pairwise sums: a summation algorithm balancing accuracy with throughput SIMDizing pairwise sum:一种平衡精度和吞吐量的求和算法
Pub Date : 2014-02-16 DOI: 10.1145/2568058.2568070
Barnaby Dalton, Amy Wang, Bob Blainey
Implementing summation when accuracy and throughput need to be balanced is a challenging endevour. We present experimental results that provide a sense when to start worrying and the expense of the various solutions that exist. We also present a new algorithm based on pairwise summation that achieves 89% of the throughput of the fastest summation algorithms when the data is not resident in L1 cache while eclipsing the accuracy of signifigantly slower compensated sums like Kahan summation and Kahan-Babuska that are typically used when accuracy is important.
在需要平衡准确性和吞吐量的情况下实现求和是一项具有挑战性的工作。我们提出的实验结果提供了一种何时开始担忧的感觉,以及现有各种解决方案的成本。我们还提出了一种基于两两求和的新算法,当数据不驻留在L1缓存中时,该算法的吞吐量达到最快求和算法的89%,同时超过了精度要求很高的情况下通常使用的明显较慢的补偿求和,如Kahan求和和Kahan- babuska。
{"title":"SIMDizing pairwise sums: a summation algorithm balancing accuracy with throughput","authors":"Barnaby Dalton, Amy Wang, Bob Blainey","doi":"10.1145/2568058.2568070","DOIUrl":"https://doi.org/10.1145/2568058.2568070","url":null,"abstract":"Implementing summation when accuracy and throughput need to be balanced is a challenging endevour. We present experimental results that provide a sense when to start worrying and the expense of the various solutions that exist. We also present a new algorithm based on pairwise summation that achieves 89% of the throughput of the fastest summation algorithms when the data is not resident in L1 cache while eclipsing the accuracy of signifigantly slower compensated sums like Kahan summation and Kahan-Babuska that are typically used when accuracy is important.","PeriodicalId":411100,"journal":{"name":"WPMVP '14","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129920242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Simple, portable and fast SIMD intrinsic programming: generic simd library 简单、可移植、快速的 SIMD 本征编程:通用 simd 库
Pub Date : 2014-02-16 DOI: 10.1145/2568058.2568059
Haichuan Wang, Peng Wu, Ilie Gabriel Tanase, M. Serrano, J. Moreira
Using SIMD (Single Instruction Multiple Data) is a cost-effective way to explore data parallelism on modern processors. Most processor vendors today provide SIMD engines, such as Altivec/VSX for POWER, SSE/AVX for Intel processors, and NEON for ARM. While high-level SIMD programming models are rapidly evolving, for many SIMD developers, the most effective way to get the performance out of SIMD is still by programming directly via vendor-provided SIMD intrinsics. However, intrinsics programming is both tedious and error-prone, and worst of all, introduces non-portable codes. This paper presents the Generic SIMD Library (https://github.com/genericsimd/generic_simd/), an open-source, portable C++ interface that provides an abstraction of short vectors and overloads most C/C++ operators for short vectors. The library provides several mappings from platform-specific intrinsics to the generic SIMD intrinsic interface so that codes developed based on the library are portable across different SIMD platforms. We have evaluated the library with several applications from the multimedia, data analytics and math domains. Compared with platform-specific intrinsics codes, using Generic SIMD Library results in less line-of-code, a 22% reduction on average, and achieves similar performance as platform-specific intrinsics versions.
使用 SIMD(单指令多数据)是在现代处理器上探索数据并行性的一种经济有效的方法。目前,大多数处理器供应商都提供 SIMD 引擎,如 POWER 处理器的 Altivec/VSX、Intel 处理器的 SSE/AVX 和 ARM 处理器的 NEON。虽然高级 SIMD 编程模型发展迅速,但对于许多 SIMD 开发人员来说,获得 SIMD 性能的最有效方法仍然是直接通过供应商提供的 SIMD 本征编程。然而,本征编程既繁琐又容易出错,最糟糕的是还会引入不可移植的代码。本文介绍了通用 SIMD 库 (https://github.com/genericsimd/generic_simd/),这是一个开源、可移植的 C++ 接口,提供了短向量的抽象,并为短向量重载了大多数 C/C++ 运算符。该库提供了从特定平台内在函数到通用 SIMD 内在函数接口的几种映射,因此基于该库开发的代码可以在不同 SIMD 平台上移植。我们利用多媒体、数据分析和数学领域的多个应用对该库进行了评估。与特定平台的本征代码相比,使用通用 SIMD 库减少了代码行数,平均减少了 22%,并获得了与特定平台本征版本类似的性能。
{"title":"Simple, portable and fast SIMD intrinsic programming: generic simd library","authors":"Haichuan Wang, Peng Wu, Ilie Gabriel Tanase, M. Serrano, J. Moreira","doi":"10.1145/2568058.2568059","DOIUrl":"https://doi.org/10.1145/2568058.2568059","url":null,"abstract":"Using SIMD (Single Instruction Multiple Data) is a cost-effective way to explore data parallelism on modern processors. Most processor vendors today provide SIMD engines, such as Altivec/VSX for POWER, SSE/AVX for Intel processors, and NEON for ARM. While high-level SIMD programming models are rapidly evolving, for many SIMD developers, the most effective way to get the performance out of SIMD is still by programming directly via vendor-provided SIMD intrinsics. However, intrinsics programming is both tedious and error-prone, and worst of all, introduces non-portable codes.\u0000 This paper presents the Generic SIMD Library (https://github.com/genericsimd/generic_simd/), an open-source, portable C++ interface that provides an abstraction of short vectors and overloads most C/C++ operators for short vectors. The library provides several mappings from platform-specific intrinsics to the generic SIMD intrinsic interface so that codes developed based on the library are portable across different SIMD platforms.\u0000 We have evaluated the library with several applications from the multimedia, data analytics and math domains. Compared with platform-specific intrinsics codes, using Generic SIMD Library results in less line-of-code, a 22% reduction on average, and achieves similar performance as platform-specific intrinsics versions.","PeriodicalId":411100,"journal":{"name":"WPMVP '14","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128514666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Vector seeker: a tool for finding vector potential 矢量搜索器:寻找矢量势的工具
Pub Date : 2014-02-16 DOI: 10.1145/2568058.2568069
G. C. Evans, S. Abraham, B. Kuhn, D. Padua
The importance of vector instructions is growing in modern computers. Almost all architectures include some form of vector instructions and the tendency is for the size of the instructions to grow with newer designs. To take advantage of the performance that these systems offer, it is imperative that programs use these instructions, and yet they do not always do so. The tools to take advantage of these extensions require programmer assistance either by hand coding or providing hints to the compiler. We present Vector Seeker, a tool to help investigate vector parallelism in existing codes. Vector Seeker runs with the execution of a program to optimistically measure the vector parallelism that is present. Besides describing Vector Seeker, the paper also evaluates its effectiveness using two applications from Petascale Application Collaboration Teams (PACT) and eight applications from Media Bench II. These results are compared to known results from manual vectorization studies. Finally, we use the tool to automatically analyze codes from Numerical Recipes and TSVC and then compare the results with the automatic vectorization algorithms of Intel's ICC.
矢量指令在现代计算机中的重要性与日俱增。几乎所有的体系结构都包含某种形式的矢量指令,并且随着设计的更新,指令的大小也有增长的趋势。为了利用这些系统提供的性能,程序必须使用这些指令,但它们并不总是这样做。利用这些扩展的工具需要程序员的帮助,要么手工编码,要么向编译器提供提示。我们介绍Vector Seeker,一个帮助研究现有代码中向量并行性的工具。Vector Seeker与程序的执行一起运行,以乐观地测量存在的向量并行性。除了描述Vector Seeker之外,本文还使用Petascale应用协作团队(PACT)的两个应用程序和Media Bench II的八个应用程序来评估其有效性。将这些结果与人工矢量化研究的已知结果进行比较。最后,我们使用该工具对Numerical Recipes和TSVC中的代码进行自动分析,并将结果与Intel的ICC自动矢量化算法进行比较。
{"title":"Vector seeker: a tool for finding vector potential","authors":"G. C. Evans, S. Abraham, B. Kuhn, D. Padua","doi":"10.1145/2568058.2568069","DOIUrl":"https://doi.org/10.1145/2568058.2568069","url":null,"abstract":"The importance of vector instructions is growing in modern computers. Almost all architectures include some form of vector instructions and the tendency is for the size of the instructions to grow with newer designs. To take advantage of the performance that these systems offer, it is imperative that programs use these instructions, and yet they do not always do so. The tools to take advantage of these extensions require programmer assistance either by hand coding or providing hints to the compiler.\u0000 We present Vector Seeker, a tool to help investigate vector parallelism in existing codes. Vector Seeker runs with the execution of a program to optimistically measure the vector parallelism that is present. Besides describing Vector Seeker, the paper also evaluates its effectiveness using two applications from Petascale Application Collaboration Teams (PACT) and eight applications from Media Bench II. These results are compared to known results from manual vectorization studies. Finally, we use the tool to automatically analyze codes from Numerical Recipes and TSVC and then compare the results with the automatic vectorization algorithms of Intel's ICC.","PeriodicalId":411100,"journal":{"name":"WPMVP '14","volume":"149 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127515763","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Comparing the performance of different x86 SIMD instruction sets for a medical imaging application on modern multi- and manycore chips 在现代多核和多核芯片上比较不同x86 SIMD指令集在医学成像应用中的性能
Pub Date : 2014-01-29 DOI: 10.1145/2568058.2568068
Johannes Hofmann, Jan Treibig, G. Hager, G. Wellein
Single Instruction, Multiple Data (SIMD) vectorization is a major driver of performance in current architectures, and is mandatory for achieving good performance with codes that are limited by instruction throughput. We investigate the efficiency of different SIMD-vectorized implementations of the RabbitCT benchmark. RabbitCT performs 3D image reconstruction by back projection, a vital operation in computed tomography applications. The underlying algorithm is a challenge for vectorization because it consists, apart from a streaming part, also of a bilinear interpolation requiring scattered access to image data. We analyze the performance of SSE (128 bit), AVX (256 bit), AVX2 (256 bit), and IMCI (512 bit) implementations on recent Intel x86 systems. A special emphasis is put on the vector gather implementation on Intel Haswell and Knights Corner microarchitectures. Finally we discuss why GPU implementations perform much better for this specific algorithm.
单指令多数据(SIMD)矢量化是当前体系结构中性能的主要驱动因素,并且对于受指令吞吐量限制的代码实现良好性能是必需的。我们研究了不同simd矢量化实现的RabbitCT基准的效率。RabbitCT通过反向投影进行三维图像重建,这是计算机断层扫描应用中的重要操作。底层算法对矢量化来说是一个挑战,因为除了流部分之外,它还包括需要分散访问图像数据的双线性插值。我们分析了SSE(128位)、AVX(256位)、AVX2(256位)和IMCI(512位)在最新的Intel x86系统上的性能。特别强调了在Intel Haswell和Knights Corner微架构上的矢量采集实现。最后,我们讨论了为什么GPU实现在这种特定算法中表现得更好。
{"title":"Comparing the performance of different x86 SIMD instruction sets for a medical imaging application on modern multi- and manycore chips","authors":"Johannes Hofmann, Jan Treibig, G. Hager, G. Wellein","doi":"10.1145/2568058.2568068","DOIUrl":"https://doi.org/10.1145/2568058.2568068","url":null,"abstract":"Single Instruction, Multiple Data (SIMD) vectorization is a major driver of performance in current architectures, and is mandatory for achieving good performance with codes that are limited by instruction throughput. We investigate the efficiency of different SIMD-vectorized implementations of the RabbitCT benchmark. RabbitCT performs 3D image reconstruction by back projection, a vital operation in computed tomography applications. The underlying algorithm is a challenge for vectorization because it consists, apart from a streaming part, also of a bilinear interpolation requiring scattered access to image data. We analyze the performance of SSE (128 bit), AVX (256 bit), AVX2 (256 bit), and IMCI (512 bit) implementations on recent Intel x86 systems. A special emphasis is put on the vector gather implementation on Intel Haswell and Knights Corner microarchitectures. Finally we discuss why GPU implementations perform much better for this specific algorithm.","PeriodicalId":411100,"journal":{"name":"WPMVP '14","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132594387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 46
期刊
WPMVP '14
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1