首页 > 最新文献

Proceedings of the 2018 International Symposium on Code Generation and Optimization最新文献

英文 中文
Lightweight detection of cache conflicts 缓存冲突的轻量级检测
Probir Roy, S. Song, S. Krishnamoorthy, Xu Liu
In memory hierarchies, caches perform an important role in reducing average memory access latency. Minimizing cache misses can yield significant performance gains. As set-associative caches are widely used in modern architectures, capacity and conflict cache misses co-exist. These two types of cache misses require different optimization strategies. While cache misses are commonly studied using cache simulators, state-of-the-art simulators usually incur hundreds to thousands of times a program's execution runtime. Moreover, a simulator has difficulty in simulating complex real hardware. To overcome these limitations, measurement methods are proposed to directly monitor program execution on real hardware via performance monitoring units. However, existing measurement-based tools either focus on capacity cache misses or do not distinguish capacity and conflict cache misses. In this paper, we design and implement CCProf, a lightweight measurement-based profiler that identifies conflict cache misses and associates them with program source code and data structures. CCProf incurs moderate runtime overhead that is at least an order of magnitude lower than simulators. With the evaluation on a number of representative programs, CCProf is able to guide optimizations on cache conflict misses and obtain nontrivial speedups.
在内存层次结构中,缓存在减少平均内存访问延迟方面发挥着重要作用。最小化缓存丢失可以显著提高性能。由于集合关联缓存在现代体系结构中的广泛应用,容量和冲突缓存丢失是并存的。这两种类型的缓存缺失需要不同的优化策略。虽然通常使用缓存模拟器来研究缓存缺失,但最先进的模拟器通常会导致程序执行运行时的数百到数千倍。此外,模拟器难以模拟复杂的真实硬件。为了克服这些限制,提出了通过性能监控单元直接监控实际硬件上程序执行的测量方法。然而,现有的基于度量的工具要么关注容量缓存缺失,要么不区分容量缓存缺失和冲突缓存缺失。在本文中,我们设计并实现了CCProf,这是一个轻量级的基于度量的分析器,可以识别冲突缓存缺失,并将它们与程序源代码和数据结构联系起来。CCProf产生适度的运行时开销,至少比模拟器低一个数量级。通过对一些代表性程序的评估,CCProf能够指导对缓存冲突失误的优化,并获得显著的加速。
{"title":"Lightweight detection of cache conflicts","authors":"Probir Roy, S. Song, S. Krishnamoorthy, Xu Liu","doi":"10.1145/3168819","DOIUrl":"https://doi.org/10.1145/3168819","url":null,"abstract":"In memory hierarchies, caches perform an important role in reducing average memory access latency. Minimizing cache misses can yield significant performance gains. As set-associative caches are widely used in modern architectures, capacity and conflict cache misses co-exist. These two types of cache misses require different optimization strategies. While cache misses are commonly studied using cache simulators, state-of-the-art simulators usually incur hundreds to thousands of times a program's execution runtime. Moreover, a simulator has difficulty in simulating complex real hardware. To overcome these limitations, measurement methods are proposed to directly monitor program execution on real hardware via performance monitoring units. However, existing measurement-based tools either focus on capacity cache misses or do not distinguish capacity and conflict cache misses. In this paper, we design and implement CCProf, a lightweight measurement-based profiler that identifies conflict cache misses and associates them with program source code and data structures. CCProf incurs moderate runtime overhead that is at least an order of magnitude lower than simulators. With the evaluation on a number of representative programs, CCProf is able to guide optimizations on cache conflict misses and obtain nontrivial speedups.","PeriodicalId":103558,"journal":{"name":"Proceedings of the 2018 International Symposium on Code Generation and Optimization","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117201532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Poker: permutation-based SIMD execution of intensive tree search by path encoding 扑克:基于排列的SIMD执行密集树搜索的路径编码
Feng Zhang, Jingling Xue
We propose POKER, a permutation-based vectorization approach for vectorizing multiple queries over B+-trees. Our key insight is to combine vector loads and path-encoding-based permutations to alleviate memory latency while keeping the number of key comparisons needed for a query to a minimum. Implemented as a C++ template library, POKER represents a general-purpose solution for vectorizing the queries over indexing trees on multi-core processors equipped with SIMD units. For a set of five representative benchmarks evaluated with 24 configurations each, POKER outperforms the state-of-the-art by 2.11x with one single thread and 2.28x with eight threads on an Intel Broadwell processor that supports 256-bit AVX2, on average.
我们提出了POKER,一种基于排列的向量化方法,用于向量化B+树上的多个查询。我们的关键见解是结合矢量负载和基于路径编码的排列,以减轻内存延迟,同时将查询所需的键比较数量保持在最低限度。作为c++模板库实现,POKER代表了一种通用的解决方案,用于在配备SIMD单元的多核处理器上对索引树的查询进行向量化。对于一组使用24种配置评估的五个代表性基准测试,在支持256位AVX2的Intel Broadwell处理器上,POKER在单线程情况下的平均性能比最先进的基准测试高出2.11倍,在8线程情况下的平均性能高出2.28倍。
{"title":"Poker: permutation-based SIMD execution of intensive tree search by path encoding","authors":"Feng Zhang, Jingling Xue","doi":"10.1145/3168808","DOIUrl":"https://doi.org/10.1145/3168808","url":null,"abstract":"We propose POKER, a permutation-based vectorization approach for vectorizing multiple queries over B+-trees. Our key insight is to combine vector loads and path-encoding-based permutations to alleviate memory latency while keeping the number of key comparisons needed for a query to a minimum. Implemented as a C++ template library, POKER represents a general-purpose solution for vectorizing the queries over indexing trees on multi-core processors equipped with SIMD units. For a set of five representative benchmarks evaluated with 24 configurations each, POKER outperforms the state-of-the-art by 2.11x with one single thread and 2.28x with eight threads on an Intel Broadwell processor that supports 256-bit AVX2, on average.","PeriodicalId":103558,"journal":{"name":"Proceedings of the 2018 International Symposium on Code Generation and Optimization","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129023469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
May-happen-in-parallel analysis with static vector clocks 可能发生的并行分析与静态矢量时钟
Qing Zhou, Lian Li, Lei Wang, Jingling Xue, Xiaobing Feng
May-Happen-in-Parallel (MHP) analysis computes whether two statements in a multi-threaded program may execute concurrently or not. It works as a basis for many analyses and optimization techniques of concurrent programs. This paper proposes a novel approach for MHP analysis, by statically computing vector clocks. Static vector clocks extend the classic vector clocks algorithm to handle the complex control flow structures in static analysis, and we have developed an efficient context-sensitive algorithm to compute them. To the best of our knowledge, this is the first attempt to compute vector clocks statically. Using static vector clocks, we can drastically improve the efficiency of existing MHP analyses, without loss of precision: the performance speedup can be up to 1828X, with a much smaller memory footprint (reduced by up to 150X). We have implemented our analysis in a static data race detector, and experimental results show that our MHP analysis can help remove up to 88% of spurious data race pairs.
可能并行(MHP)分析计算多线程程序中的两个语句是否可以并发执行。它是许多并发程序分析和优化技术的基础。本文提出了一种新的MHP分析方法,即静态计算矢量时钟。静态矢量时钟扩展了经典的矢量时钟算法来处理静态分析中复杂的控制流结构,我们开发了一种高效的上下文敏感算法来计算它们。据我们所知,这是第一次尝试静态计算矢量时钟。使用静态矢量时钟,我们可以大大提高现有MHP分析的效率,而不会损失精度:性能加速可达1828X,内存占用更小(减少多达150X)。我们已经在一个静态数据竞争检测器中实现了我们的分析,实验结果表明我们的MHP分析可以帮助去除高达88%的虚假数据竞争对。
{"title":"May-happen-in-parallel analysis with static vector clocks","authors":"Qing Zhou, Lian Li, Lei Wang, Jingling Xue, Xiaobing Feng","doi":"10.1145/3168813","DOIUrl":"https://doi.org/10.1145/3168813","url":null,"abstract":"May-Happen-in-Parallel (MHP) analysis computes whether two statements in a multi-threaded program may execute concurrently or not. It works as a basis for many analyses and optimization techniques of concurrent programs. This paper proposes a novel approach for MHP analysis, by statically computing vector clocks. Static vector clocks extend the classic vector clocks algorithm to handle the complex control flow structures in static analysis, and we have developed an efficient context-sensitive algorithm to compute them. To the best of our knowledge, this is the first attempt to compute vector clocks statically. Using static vector clocks, we can drastically improve the efficiency of existing MHP analyses, without loss of precision: the performance speedup can be up to 1828X, with a much smaller memory footprint (reduced by up to 150X). We have implemented our analysis in a static data race detector, and experimental results show that our MHP analysis can help remove up to 88% of spurious data race pairs.","PeriodicalId":103558,"journal":{"name":"Proceedings of the 2018 International Symposium on Code Generation and Optimization","volume":"23 14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128100730","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Dominance-based duplication simulation (DBDS): code duplication to enable compiler optimizations 基于优势的复制模拟(DBDS):代码复制以启用编译器优化
David Leopoldseder, Lukas Stadler, Thomas Würthinger, J. Eisl, Doug Simon, H. Mössenböck
Compilers perform a variety of advanced optimizations to improve the quality of the generated machine code. However, optimizations that depend on the data flow of a program are often limited by control-flow merges. Code duplication can solve this problem by hoisting, i.e. duplicating, instructions from merge blocks to their predecessors. However, finding optimization opportunities enabled by duplication is a non-trivial task that requires compile-time intensive analysis. This imposes a challenge on modern (just-in-time) compilers: Duplicating instructions tentatively at every control flow merge is not feasible because excessive duplication leads to uncontrolled code growth and compile time increases. Therefore, compilers need to find out whether a duplication is beneficial enough to be performed. This paper proposes a novel approach to determine which duplication operations should be performed to increase performance. The approach is based on a duplication simulation that enables a compiler to evaluate different success metrics per potential duplication. Using this information, the compiler can then select the most promising candidates for optimization. We show how to map duplication candidates into an optimization cost model that allows us to trade-off between different success metrics including peak performance, code size and compile time. We implemented the approach on top of the GraalVM and evaluated it with the benchmarks Java DaCapo, Scala DaCapo, JavaScript Octane and a micro-benchmark suite, in terms of performance, compilation time and code size increase. We show that our optimization can reach peak performance improvements of up to 40% with a mean peak performance increase of 5.89%, while it generates a mean code size increase of 9.93% and mean compile time increase of 18.44%.
编译器执行各种高级优化以提高生成的机器码的质量。然而,依赖于程序数据流的优化常常受到控制流合并的限制。代码复制可以通过提升(即复制)合并块中的指令到它们的前身来解决这个问题。然而,寻找通过复制实现的优化机会是一项非常重要的任务,需要进行编译时密集的分析。这给现代(即时)编译器带来了挑战:在每个控制流合并时暂时复制指令是不可行的,因为过度的复制会导致不受控制的代码增长和编译时间的增加。因此,编译器需要找出复制是否足够有益而值得执行。本文提出了一种新的方法来确定应该执行哪些复制操作以提高性能。该方法基于复制模拟,该模拟使编译器能够评估每个潜在复制的不同成功度量。使用这些信息,编译器可以选择最有希望进行优化的候选项。我们展示了如何将候选复制映射到一个优化成本模型中,该模型允许我们在不同的成功指标(包括峰值性能、代码大小和编译时间)之间进行权衡。我们在GraalVM上实现了这种方法,并使用Java DaCapo、Scala DaCapo、JavaScript Octane和一个微基准测试套件来评估它的性能、编译时间和代码大小的增加。我们表明,我们的优化可以达到高达40%的峰值性能改进,平均峰值性能提高5.89%,而它产生的平均代码大小增加9.93%,平均编译时间增加18.44%。
{"title":"Dominance-based duplication simulation (DBDS): code duplication to enable compiler optimizations","authors":"David Leopoldseder, Lukas Stadler, Thomas Würthinger, J. Eisl, Doug Simon, H. Mössenböck","doi":"10.1145/3168811","DOIUrl":"https://doi.org/10.1145/3168811","url":null,"abstract":"Compilers perform a variety of advanced optimizations to improve the quality of the generated machine code. However, optimizations that depend on the data flow of a program are often limited by control-flow merges. Code duplication can solve this problem by hoisting, i.e. duplicating, instructions from merge blocks to their predecessors. However, finding optimization opportunities enabled by duplication is a non-trivial task that requires compile-time intensive analysis. This imposes a challenge on modern (just-in-time) compilers: Duplicating instructions tentatively at every control flow merge is not feasible because excessive duplication leads to uncontrolled code growth and compile time increases. Therefore, compilers need to find out whether a duplication is beneficial enough to be performed. This paper proposes a novel approach to determine which duplication operations should be performed to increase performance. The approach is based on a duplication simulation that enables a compiler to evaluate different success metrics per potential duplication. Using this information, the compiler can then select the most promising candidates for optimization. We show how to map duplication candidates into an optimization cost model that allows us to trade-off between different success metrics including peak performance, code size and compile time. We implemented the approach on top of the GraalVM and evaluated it with the benchmarks Java DaCapo, Scala DaCapo, JavaScript Octane and a micro-benchmark suite, in terms of performance, compilation time and code size increase. We show that our optimization can reach peak performance improvements of up to 40% with a mean peak performance increase of 5.89%, while it generates a mean code size increase of 9.93% and mean compile time increase of 18.44%.","PeriodicalId":103558,"journal":{"name":"Proceedings of the 2018 International Symposium on Code Generation and Optimization","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132012835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 36
CVR: efficient vectorization of SpMV on x86 processors x86处理器上SpMV的高效矢量化
Biwei Xie, Jianfeng Zhan, Xu Liu, Wanling Gao, Zhen Jia, Xiwen He, Lixin Zhang
Sparse Matrix-vector Multiplication (SpMV) is an important computation kernel widely used in HPC and data centers. The irregularity of SpMV is a well-known challenge that limits SpMV’s parallelism with vectorization operations. Existing work achieves limited locality and vectorization efficiency with large preprocessing overheads. To address this issue, we present the Compressed Vectorization-oriented sparse Row (CVR), a novel SpMV representation targeting efficient vectorization. The CVR simultaneously processes multiple rows within the input matrix to increase cache efficiency and separates them into multiple SIMD lanes so as to take the advantage of vector processing units in modern processors. Our method is insensitive to the sparsity and irregularity of SpMV, and thus able to deal with various scale-free and HPC matrices. We implement and evaluate CVR on an Intel Knights Landing processor and compare it with five state-of-the-art approaches through using 58 scale-free and HPC sparse matrices. Experimental results show that CVR can achieve a speedup up to 1.70 × (1.33× on average) and a speedup up to 1.57× (1.10× on average) over the best existing approaches for scale-free and HPC sparse matrices, respectively. Moreover, CVR typically incurs the lowest preprocessing overhead compared with state-of-the-art approaches.
稀疏矩阵向量乘法(SpMV)是一种重要的计算内核,广泛应用于高性能计算和数据中心。SpMV的不规则性是一个众所周知的挑战,它限制了SpMV与矢量化操作的并行性。现有工作的局部性和矢量化效率有限,且预处理开销较大。为了解决这个问题,我们提出了面向矢量化的压缩稀疏行(CVR),这是一种针对高效矢量化的新颖SpMV表示。CVR同时处理输入矩阵内的多行,以提高缓存效率,并将它们分离成多个SIMD通道,以利用现代处理器中矢量处理单元的优势。该方法对SpMV的稀疏性和不规则性不敏感,因此能够处理各种无标度矩阵和HPC矩阵。我们在英特尔Knights Landing处理器上实现和评估CVR,并通过使用58个无标度和HPC稀疏矩阵将其与五种最先进的方法进行比较。实验结果表明,对于无标度稀疏矩阵和HPC稀疏矩阵,CVR算法的加速速度分别比现有最佳方法提高了1.70倍(平均1.33倍)和1.57倍(平均1.10倍)。此外,与最先进的方法相比,CVR通常会产生最低的预处理开销。
{"title":"CVR: efficient vectorization of SpMV on x86 processors","authors":"Biwei Xie, Jianfeng Zhan, Xu Liu, Wanling Gao, Zhen Jia, Xiwen He, Lixin Zhang","doi":"10.1145/3168818","DOIUrl":"https://doi.org/10.1145/3168818","url":null,"abstract":"Sparse Matrix-vector Multiplication (SpMV) is an important computation kernel widely used in HPC and data centers. The irregularity of SpMV is a well-known challenge that limits SpMV’s parallelism with vectorization operations. Existing work achieves limited locality and vectorization efficiency with large preprocessing overheads. To address this issue, we present the Compressed Vectorization-oriented sparse Row (CVR), a novel SpMV representation targeting efficient vectorization. The CVR simultaneously processes multiple rows within the input matrix to increase cache efficiency and separates them into multiple SIMD lanes so as to take the advantage of vector processing units in modern processors. Our method is insensitive to the sparsity and irregularity of SpMV, and thus able to deal with various scale-free and HPC matrices. We implement and evaluate CVR on an Intel Knights Landing processor and compare it with five state-of-the-art approaches through using 58 scale-free and HPC sparse matrices. Experimental results show that CVR can achieve a speedup up to 1.70 × (1.33× on average) and a speedup up to 1.57× (1.10× on average) over the best existing approaches for scale-free and HPC sparse matrices, respectively. Moreover, CVR typically incurs the lowest preprocessing overhead compared with state-of-the-art approaches.","PeriodicalId":103558,"journal":{"name":"Proceedings of the 2018 International Symposium on Code Generation and Optimization","volume":"130 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131273237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 62
Resilient decentralized Android application repackaging detection using logic bombs 使用逻辑炸弹的弹性去中心化Android应用程序重新包装检测
Qiang Zeng, Lannan Luo, Zhiyun Qian, Xiaojiang Du, Zhoujun Li
Application repackaging is a severe threat to Android users and the market. Existing countermeasures mostly detect repackaging based on app similarity measurement and rely on a central party to perform detection, which is unscalable and imprecise. We instead consider building the detection capability into apps, such that user devices are made use of to detect repackaging in a decentralized fashion. The main challenge is how to protect repackaging detection code from attacks. We propose a creative use of logic bombs, which are regularly used in malware, to conquer the challenge. A novel bomb structure is invented and used: the trigger conditions are constructed to exploit the differences between the attacker and users, such that a bomb that lies dormant on the attacker side will be activated on one of the user devices, while the repackaging detection code, which is packed as the bomb payload, is kept inactive until the trigger conditions are satisfied. Moreover, the repackaging detection code is woven into the original app code and gets encrypted; thus, attacks by modifying or deleting suspicious code will corrupt the app itself. We have implemented a prototype, named BombDroid, that builds the repackaging detection into apps through bytecode instrumentation, and the evaluation shows that the technique is effective, efficient, and resilient to various adversary analysis including symbol execution, multi-path exploration, and program slicing.
应用程序重新包装对Android用户和市场都是一个严重的威胁。现有的对策主要是基于应用相似性测量来检测重新包装,并依赖于一个中心方来执行检测,这是不可扩展和不精确的。相反,我们考虑将检测功能构建到应用程序中,这样用户设备就可以以分散的方式检测重新包装。主要的挑战是如何保护重包装检测代码免受攻击。我们提出了一个创造性的逻辑炸弹的使用,这是经常用于恶意软件,以克服挑战。发明并使用了一种新的炸弹结构:利用攻击者和用户之间的差异构造触发条件,使攻击方休眠的炸弹在其中一个用户设备上被激活,而作为炸弹有效载荷的重新包装检测代码在触发条件满足之前保持不活动状态。此外,重新包装检测代码被编织到原始应用程序代码中并被加密;因此,通过修改或删除可疑代码的攻击将破坏应用程序本身。我们已经实现了一个名为BombDroid的原型,它通过字节码检测将重新包装检测构建到应用程序中,评估表明该技术是有效的,高效的,并且对各种对手分析具有弹性,包括符号执行,多路径探索和程序切片。
{"title":"Resilient decentralized Android application repackaging detection using logic bombs","authors":"Qiang Zeng, Lannan Luo, Zhiyun Qian, Xiaojiang Du, Zhoujun Li","doi":"10.1145/3168820","DOIUrl":"https://doi.org/10.1145/3168820","url":null,"abstract":"Application repackaging is a severe threat to Android users and the market. Existing countermeasures mostly detect repackaging based on app similarity measurement and rely on a central party to perform detection, which is unscalable and imprecise. We instead consider building the detection capability into apps, such that user devices are made use of to detect repackaging in a decentralized fashion. The main challenge is how to protect repackaging detection code from attacks. We propose a creative use of logic bombs, which are regularly used in malware, to conquer the challenge. A novel bomb structure is invented and used: the trigger conditions are constructed to exploit the differences between the attacker and users, such that a bomb that lies dormant on the attacker side will be activated on one of the user devices, while the repackaging detection code, which is packed as the bomb payload, is kept inactive until the trigger conditions are satisfied. Moreover, the repackaging detection code is woven into the original app code and gets encrypted; thus, attacks by modifying or deleting suspicious code will corrupt the app itself. We have implemented a prototype, named BombDroid, that builds the repackaging detection into apps through bytecode instrumentation, and the evaluation shows that the technique is effective, efficient, and resilient to various adversary analysis including symbol execution, multi-path exploration, and program slicing.","PeriodicalId":103558,"journal":{"name":"Proceedings of the 2018 International Symposium on Code Generation and Optimization","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126022002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 27
CUDAAdvisor: LLVM-based runtime profiling for modern GPUs CUDAAdvisor:用于现代gpu的基于llvm的运行时分析
Du Shen, S. Song, Ang Li, Xu Liu
General-purpose GPUs have been widely utilized to accelerate parallel applications. Given a relatively complex programming model and fast architecture evolution, producing efficient GPU code is nontrivial. A variety of simulation and profiling tools have been developed to aid GPU application optimization and architecture design. However, existing tools are either limited by insufficient insights or lacking in support across different GPU architectures, runtime and driver versions. This paper presents CUDAAdvisor, a profiling framework to guide code optimization in modern NVIDIA GPUs. CUDAAdvisor performs various fine-grained analyses based on the profiling results from GPU kernels, such as memory-level analysis (e.g., reuse distance and memory divergence), control flow analysis (e.g., branch divergence) and code-/data-centric debugging. Unlike prior tools, CUDAAdvisor supports GPU profiling across different CUDA versions and architectures, including CUDA 8.0 and Pascal architecture. We demonstrate several case studies that derive significant insights to guide GPU code optimization for performance improvement.
通用gpu已被广泛用于加速并行应用程序。考虑到相对复杂的编程模型和快速的架构演变,生成高效的GPU代码是非常重要的。各种模拟和分析工具已经开发出来,以帮助GPU应用程序优化和架构设计。然而,现有的工具要么受到缺乏洞察力的限制,要么缺乏对不同GPU架构、运行时和驱动程序版本的支持。本文介绍了CUDAAdvisor,一个用于指导现代NVIDIA gpu代码优化的分析框架。CUDAAdvisor根据来自GPU内核的分析结果执行各种细粒度分析,例如内存级分析(例如,重用距离和内存分歧),控制流分析(例如,分支分歧)和以代码/数据为中心的调试。与之前的工具不同,CUDAAdvisor支持不同CUDA版本和架构的GPU分析,包括CUDA 8.0和Pascal架构。我们展示了几个案例研究,这些案例研究获得了重要的见解,以指导GPU代码优化以提高性能。
{"title":"CUDAAdvisor: LLVM-based runtime profiling for modern GPUs","authors":"Du Shen, S. Song, Ang Li, Xu Liu","doi":"10.1145/3168831","DOIUrl":"https://doi.org/10.1145/3168831","url":null,"abstract":"General-purpose GPUs have been widely utilized to accelerate parallel applications. Given a relatively complex programming model and fast architecture evolution, producing efficient GPU code is nontrivial. A variety of simulation and profiling tools have been developed to aid GPU application optimization and architecture design. However, existing tools are either limited by insufficient insights or lacking in support across different GPU architectures, runtime and driver versions. This paper presents CUDAAdvisor, a profiling framework to guide code optimization in modern NVIDIA GPUs. CUDAAdvisor performs various fine-grained analyses based on the profiling results from GPU kernels, such as memory-level analysis (e.g., reuse distance and memory divergence), control flow analysis (e.g., branch divergence) and code-/data-centric debugging. Unlike prior tools, CUDAAdvisor supports GPU profiling across different CUDA versions and architectures, including CUDA 8.0 and Pascal architecture. We demonstrate several case studies that derive significant insights to guide GPU code optimization for performance improvement.","PeriodicalId":103558,"journal":{"name":"Proceedings of the 2018 International Symposium on Code Generation and Optimization","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134401801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 37
Biological computation (keynote) 生物计算(主题演讲)
Sara-Jane Dunn
Unlike engineered systems, living cells self-generate, self-organise and self-repair, they undertake massively parallel operations with slow and noisy components in a noisy environment, they sense and actuate at molecular scales, and most intriguingly, they blur the line between software and hardware. Understanding this biological computation presents a huge challenge to the scientific community. Yet the ultimate destination and prize at the culmination of this scientific journey is the promise of revolutionary and transformative technology: the rational design and implementation of biological function, or more succinctly, the ability to program life.
与工程系统不同,活细胞可以自我生成、自我组织和自我修复,它们在嘈杂的环境中与缓慢而嘈杂的组件进行大规模并行操作,它们在分子尺度上感知和驱动,最有趣的是,它们模糊了软件和硬件之间的界限。理解这种生物计算对科学界来说是一个巨大的挑战。然而,这一科学之旅的最终目标和成果是革命性和变革性技术的前景:合理设计和实现生物功能,或者更简洁地说,是为生命编程的能力。
{"title":"Biological computation (keynote)","authors":"Sara-Jane Dunn","doi":"10.1145/3179541.3179542","DOIUrl":"https://doi.org/10.1145/3179541.3179542","url":null,"abstract":"Unlike engineered systems, living cells self-generate, self-organise and self-repair, they undertake massively parallel operations with slow and noisy components in a noisy environment, they sense and actuate at molecular scales, and most intriguingly, they blur the line between software and hardware. Understanding this biological computation presents a huge challenge to the scientific community. Yet the ultimate destination and prize at the culmination of this scientific journey is the promise of revolutionary and transformative technology: the rational design and implementation of biological function, or more succinctly, the ability to program life.","PeriodicalId":103558,"journal":{"name":"Proceedings of the 2018 International Symposium on Code Generation and Optimization","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130388482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Synthesizing programs that expose performance bottlenecks 合成暴露性能瓶颈的程序
Luca Della Toffola, Michael Pradel, T. Gross
Software often suffers from performance bottlenecks, e.g., because some code has a higher computational complexity than expected or because a code change introduces a performance regression. Finding such bottlenecks is challenging for developers and for profiling techniques because both rely on performance tests to execute the software, which are often not available in practice. This paper presents PerfSyn, an approach for synthesizing test programs that expose performance bottlenecks in a given method under test. The basic idea is to repeatedly mutate a program that uses the method to systematically increase the amount of work done by the method. We formulate the problem of synthesizing a bottleneck-exposing program as a combinatorial search and show that it can be effectively and efficiently addressed using well known graph search algorithms. We evaluate the approach with 147 methods from seven Java code bases. PerfSyn automatically synthesizes test programs that expose 22 bottlenecks. The bottlenecks are due to unexpectedly high computational complexity and due to performance differences between different versions of the same code.
软件经常遭受性能瓶颈,例如,因为一些代码具有比预期更高的计算复杂性,或者因为代码更改引入了性能回归。对于开发人员和分析技术来说,找到这样的瓶颈是具有挑战性的,因为两者都依赖于性能测试来执行软件,而这在实践中通常是不可用的。本文介绍了PerfSyn,一种用于综合测试程序的方法,该测试程序可以暴露给定测试方法中的性能瓶颈。其基本思想是反复改变使用该方法的程序,以系统地增加该方法完成的工作量。我们将瓶颈暴露程序的合成问题表述为组合搜索,并表明它可以使用众所周知的图搜索算法有效地解决。我们使用来自7个Java代码库的147个方法来评估该方法。PerfSyn会自动合成暴露22个瓶颈的测试程序。瓶颈是由于意外的高计算复杂性和相同代码的不同版本之间的性能差异。
{"title":"Synthesizing programs that expose performance bottlenecks","authors":"Luca Della Toffola, Michael Pradel, T. Gross","doi":"10.1145/3168830","DOIUrl":"https://doi.org/10.1145/3168830","url":null,"abstract":"Software often suffers from performance bottlenecks, e.g., because some code has a higher computational complexity than expected or because a code change introduces a performance regression. Finding such bottlenecks is challenging for developers and for profiling techniques because both rely on performance tests to execute the software, which are often not available in practice. This paper presents PerfSyn, an approach for synthesizing test programs that expose performance bottlenecks in a given method under test. The basic idea is to repeatedly mutate a program that uses the method to systematically increase the amount of work done by the method. We formulate the problem of synthesizing a bottleneck-exposing program as a combinatorial search and show that it can be effectively and efficiently addressed using well known graph search algorithms. We evaluate the approach with 147 methods from seven Java code bases. PerfSyn automatically synthesizes test programs that expose 22 bottlenecks. The bottlenecks are due to unexpectedly high computational complexity and due to performance differences between different versions of the same code.","PeriodicalId":103558,"journal":{"name":"Proceedings of the 2018 International Symposium on Code Generation and Optimization","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133142045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
High performance stencil code generation with Lift 使用Lift生成高性能模板代码
Bastian Hagedorn, Larisa Stoltzfus, Michel Steuwer, S. Gorlatch, Christophe Dubach
Stencil computations are widely used from physical simulations to machine-learning. They are embarrassingly parallel and perfectly fit modern hardware such as Graphic Processing Units. Although stencil computations have been extensively studied, optimizing them for increasingly diverse hardware remains challenging. Domain Specific Languages (DSLs) have raised the programming abstraction and offer good performance. However, this places the burden on DSL implementers who have to write almost full-fledged parallelizing compilers and optimizers. Lift has recently emerged as a promising approach to achieve performance portability and is based on a small set of reusable parallel primitives that DSL or library writers can build upon. Lift’s key novelty is in its encoding of optimizations as a system of extensible rewrite rules which are used to explore the optimization space. However, Lift has mostly focused on linear algebra operations and it remains to be seen whether this approach is applicable for other domains. This paper demonstrates how complex multidimensional stencil code and optimizations such as tiling are expressible using compositions of simple 1D Lift primitives. By leveraging existing Lift primitives and optimizations, we only require the addition of two primitives and one rewrite rule to do so. Our results show that this approach outperforms existing compiler approaches and hand-tuned codes.
从物理模拟到机器学习,模板计算被广泛应用。它们是令人尴尬的并行,非常适合现代硬件,如图形处理单元。尽管模板计算已经得到了广泛的研究,但为日益多样化的硬件优化它们仍然具有挑战性。领域特定语言(dsl)提高了编程的抽象性并提供了良好的性能。然而,这给DSL实现者带来了负担,他们必须编写几乎成熟的并行编译器和优化器。Lift最近成为实现性能可移植性的一种很有前途的方法,它基于一组可重用的并行原语,DSL或库编写者可以在其上构建。Lift的关键新颖之处在于它将优化编码为一个可扩展的重写规则系统,用于探索优化空间。然而,Lift主要集中在线性代数运算上,这种方法是否适用于其他领域还有待观察。本文演示了如何使用简单的1D Lift原语组合来表达复杂的多维模板代码和优化(如平铺)。通过利用现有的Lift原语和优化,我们只需要添加两个原语和一个重写规则即可。我们的结果表明,这种方法优于现有的编译器方法和手动调优的代码。
{"title":"High performance stencil code generation with Lift","authors":"Bastian Hagedorn, Larisa Stoltzfus, Michel Steuwer, S. Gorlatch, Christophe Dubach","doi":"10.1145/3168824","DOIUrl":"https://doi.org/10.1145/3168824","url":null,"abstract":"Stencil computations are widely used from physical simulations to machine-learning. They are embarrassingly parallel and perfectly fit modern hardware such as Graphic Processing Units. Although stencil computations have been extensively studied, optimizing them for increasingly diverse hardware remains challenging. Domain Specific Languages (DSLs) have raised the programming abstraction and offer good performance. However, this places the burden on DSL implementers who have to write almost full-fledged parallelizing compilers and optimizers. Lift has recently emerged as a promising approach to achieve performance portability and is based on a small set of reusable parallel primitives that DSL or library writers can build upon. Lift’s key novelty is in its encoding of optimizations as a system of extensible rewrite rules which are used to explore the optimization space. However, Lift has mostly focused on linear algebra operations and it remains to be seen whether this approach is applicable for other domains. This paper demonstrates how complex multidimensional stencil code and optimizations such as tiling are expressible using compositions of simple 1D Lift primitives. By leveraging existing Lift primitives and optimizations, we only require the addition of two primitives and one rewrite rule to do so. Our results show that this approach outperforms existing compiler approaches and hand-tuned codes.","PeriodicalId":103558,"journal":{"name":"Proceedings of the 2018 International Symposium on Code Generation and Optimization","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114269272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
期刊
Proceedings of the 2018 International Symposium on Code Generation and Optimization
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1