首页 > 最新文献

Proceedings of the 3rd International Workshop on Software Engineering for Parallel Systems最新文献

英文 中文
Exhaustive analysis of thread-level speculation 对线程级推测的详尽分析
Clark Verbrugge, Christopher J. F. Pickett, Alexander Krolik, Allan Kielstra
Thread-level Speculation (TLS) is a technique for automatic parallelization. The complexity of even prototype implementations, however, limits the ability to explore and compare the wide variety of possible design choices, and also makes understanding performance characteristics difficult. In this work we build a general analytical model of the method-level variant of TLS which we can use for determining program speedup under a wide range of TLS designs. Our approach is exhaustive, and using either simple brute force or more efficient dynamic programming implementations we are able to show how performance is strongly limited by program structure, as well as core choices in speculation design, irrespective of and complementary to the impact of data-dependencies. These results provide new, high-level insight into where and how thread-level speculation can and should be applied in order to produce practical speedup.
线程级推测(TLS)是一种自动并行化技术。然而,即使是原型实现的复杂性也限制了探索和比较各种可能的设计选择的能力,并且还使理解性能特征变得困难。在这项工作中,我们建立了TLS的方法级变体的一般分析模型,我们可以使用它来确定在各种TLS设计下的程序加速。我们的方法是详尽的,使用简单的蛮力或更有效的动态编程实现,我们能够展示性能如何受到程序结构的强烈限制,以及推测设计中的核心选择,而不考虑数据依赖性的影响并对其进行补充。这些结果提供了新的、高层次的见解,可以和应该在哪里以及如何应用线程级推测,以产生实际的加速。
{"title":"Exhaustive analysis of thread-level speculation","authors":"Clark Verbrugge, Christopher J. F. Pickett, Alexander Krolik, Allan Kielstra","doi":"10.1145/3002125.3002127","DOIUrl":"https://doi.org/10.1145/3002125.3002127","url":null,"abstract":"Thread-level Speculation (TLS) is a technique for automatic parallelization. The complexity of even prototype implementations, however, limits the ability to explore and compare the wide variety of possible design choices, and also makes understanding performance characteristics difficult. In this work we build a general analytical model of the method-level variant of TLS which we can use for determining program speedup under a wide range of TLS designs. Our approach is exhaustive, and using either simple brute force or more efficient dynamic programming implementations we are able to show how performance is strongly limited by program structure, as well as core choices in speculation design, irrespective of and complementary to the impact of data-dependencies. These results provide new, high-level insight into where and how thread-level speculation can and should be applied in order to produce practical speedup.","PeriodicalId":106508,"journal":{"name":"Proceedings of the 3rd International Workshop on Software Engineering for Parallel Systems","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131657640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A divide-and-conquer parallel pattern implementation for multicores 多核分而治之并行模式实现
M. Danelutto, T. D. Matteis, G. Mencagli, M. Torquati
Divide-and-Conquer (DaC) is a sequential programming paradigm which models a large class of algorithms used in real-life applications. Although suitable to extract parallelism in a straightforward way, the parallel implementation of DaC algorithms still requires some expertise in parallel programming tools by the programmer. In this paper we aim at providing to non-expert programmers a high-level solution for fast prototyping parallel DaC programs on multicores with minimal programming effort. Following the rationale of parallel design pattern methodology, we design a C++11-compliant template interface for developing parallel DaC programs. The interface is implemented using different back-end frameworks (i.e. OpenMP, Intel TBB and FastFlow) supporting source code reuse and a certain amount of performance portability. Experiments on a 24-core Intel server show the effectiveness of our approach: with a reduced programming effort the programmer easily prototypes parallel versions with performance comparable with hand-made parallelizations.
分而治之(DaC)是一种顺序编程范式,它对实际应用中使用的大量算法进行建模。虽然适合以直接的方式提取并行性,但DaC算法的并行实现仍然需要程序员在并行编程工具方面的一些专业知识。在本文中,我们的目标是为非专业程序员提供一个高水平的解决方案,以最小的编程工作量在多核上快速原型化并行DaC程序。遵循并行设计模式方法论的基本原理,我们设计了一个兼容c++ 11的模板接口,用于开发并行DaC程序。该接口使用不同的后端框架(即OpenMP、Intel TBB和FastFlow)实现,支持源代码重用和一定程度的性能可移植性。在24核英特尔服务器上的实验表明了我们的方法的有效性:通过减少编程工作量,程序员可以轻松地原型化并行版本,其性能与手工并行相当。
{"title":"A divide-and-conquer parallel pattern implementation for multicores","authors":"M. Danelutto, T. D. Matteis, G. Mencagli, M. Torquati","doi":"10.1145/3002125.3002128","DOIUrl":"https://doi.org/10.1145/3002125.3002128","url":null,"abstract":"Divide-and-Conquer (DaC) is a sequential programming paradigm which models a large class of algorithms used in real-life applications. Although suitable to extract parallelism in a straightforward way, the parallel implementation of DaC algorithms still requires some expertise in parallel programming tools by the programmer. In this paper we aim at providing to non-expert programmers a high-level solution for fast prototyping parallel DaC programs on multicores with minimal programming effort. Following the rationale of parallel design pattern methodology, we design a C++11-compliant template interface for developing parallel DaC programs. The interface is implemented using different back-end frameworks (i.e. OpenMP, Intel TBB and FastFlow) supporting source code reuse and a certain amount of performance portability. Experiments on a 24-core Intel server show the effectiveness of our approach: with a reduced programming effort the programmer easily prototypes parallel versions with performance comparable with hand-made parallelizations.","PeriodicalId":106508,"journal":{"name":"Proceedings of the 3rd International Workshop on Software Engineering for Parallel Systems","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133822339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Parallel evaluation of a DSP algorithm using julia 并行计算的一个DSP算法使用julia
Pjotr Kourzanov
Rapid pace of innovation in industrial research labs requires fast algorithm evaluation cycles. The use of multi-core hardware and distributed clusters is essential to achieve reasonable turnaround times for high-load simulations. Julia’s support for these as well as its pervasive multiple dispatch make it very attractive for high-performance technical computing. Our experiments in speeding up a Digital Signal Processing (DSP) Intellectual Property (IP) model simulation for a Wireless LAN (WLAN) product confirm this. We augment standard SystemC High-Level Synthesis (HLS) tool-flow by an interactive worksheet supporting performance visualization and rapid design space exploration cycles.
工业研究实验室的快速创新需要快速的算法评估周期。使用多核硬件和分布式集群对于实现高负载模拟的合理周转时间至关重要。Julia对这些特性的支持以及它普遍的多分派使得它对高性能技术计算非常有吸引力。我们在无线局域网(WLAN)产品中加速数字信号处理(DSP)知识产权(IP)模型仿真的实验证实了这一点。我们通过支持性能可视化和快速设计空间探索周期的交互式工作表增强了标准SystemC高级综合(HLS)工具流。
{"title":"Parallel evaluation of a DSP algorithm using julia","authors":"Pjotr Kourzanov","doi":"10.1145/3002125.3002126","DOIUrl":"https://doi.org/10.1145/3002125.3002126","url":null,"abstract":"Rapid pace of innovation in industrial research labs requires fast algorithm evaluation cycles. The use of multi-core hardware and distributed clusters is essential to achieve reasonable turnaround times for high-load simulations. Julia’s support for these as well as its pervasive multiple dispatch make it very attractive for high-performance technical computing. Our experiments in speeding up a Digital Signal Processing (DSP) Intellectual Property (IP) model simulation for a Wireless LAN (WLAN) product confirm this. We augment standard SystemC High-Level Synthesis (HLS) tool-flow by an interactive worksheet supporting performance visualization and rapid design space exploration cycles.","PeriodicalId":106508,"journal":{"name":"Proceedings of the 3rd International Workshop on Software Engineering for Parallel Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130673770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Reducing parallelizing compilation time by removing redundant analysis 通过删除冗余分析减少并行编译时间
Jixin Han, Rina Fujino, Ryota Tamura, Mamoru Shimaoka, Hiroki Mikami, M. Takamura, Sachio Kamiya, Kazuhiko Suzuki, Takahiro Miyajima, K. Kimura, H. Kasahara
Parallelizing compilers equipped with powerful compiler optimizations are essential tools to fully exploit performance from today's computer systems. These optimizations are supported by both highly sophisticated program analysis techniques and aggressive program restructuring techniques. However, the compilation time for such powerful compilers becomes larger and larger for real commercial application due to these strong program analysis techniques. In this paper, we propose a compilation time reduction technique for parallelizing compilers. The basic idea of the proposed technique is based on an observation that parallelizing compilers applies multiple program analysis passes and restructuring passes to a source program but all program analysis passes do not have to be applied to the whole source program. Thus, there is an opportunity for compilation time reduction by removing redundant program analysis. We describe the removing redundant program analysis techniques considering the inter-procedural propagation of annalysis update information in this paper. We implement the proposed technique into OSCAR automatically multigrain parallelizing compiler. We then evaluate the proposed technique by using three proprietary large scale programs. The proposed technique can remove 37.7% of program analysis time on average for basic analysis includes def-use analysis and dependence calculation, and 51.7% for pointer analysis, respectively.
具有强大编译器优化功能的并行编译器是充分利用当今计算机系统性能的必要工具。这些优化得到高度复杂的程序分析技术和积极的程序重构技术的支持。然而,由于这些强大的程序分析技术,这些强大的编译器的编译时间在实际的商业应用中变得越来越长。在本文中,我们提出了一种减少并行编译器编译时间的技术。所建议的技术的基本思想是基于这样一种观察,即并行编译器将多个程序分析通道和重构通道应用于源程序,但不必将所有程序分析通道应用于整个源程序。因此,通过删除冗余的程序分析,有机会减少编译时间。本文描述了考虑分析更新信息的过程间传播的去除冗余程序分析技术。我们将该技术实现到OSCAR自动多粒并行编译器中。然后,我们通过使用三个专有的大型程序来评估所提出的技术。在基本分析(包括自定义分析和依赖计算)和指针分析中,平均可减少37.7%的程序分析时间,平均可减少51.7%的程序分析时间。
{"title":"Reducing parallelizing compilation time by removing redundant analysis","authors":"Jixin Han, Rina Fujino, Ryota Tamura, Mamoru Shimaoka, Hiroki Mikami, M. Takamura, Sachio Kamiya, Kazuhiko Suzuki, Takahiro Miyajima, K. Kimura, H. Kasahara","doi":"10.1145/3002125.3002129","DOIUrl":"https://doi.org/10.1145/3002125.3002129","url":null,"abstract":"Parallelizing compilers equipped with powerful compiler optimizations are essential tools to fully exploit performance from today's computer systems. These optimizations are supported by both highly sophisticated program analysis techniques and aggressive program restructuring techniques. However, the compilation time for such powerful compilers becomes larger and larger for real commercial application due to these strong program analysis techniques. In this paper, we propose a compilation time reduction technique for parallelizing compilers. The basic idea of the proposed technique is based on an observation that parallelizing compilers applies multiple program analysis passes and restructuring passes to a source program but all program analysis passes do not have to be applied to the whole source program. Thus, there is an opportunity for compilation time reduction by removing redundant program analysis. We describe the removing redundant program analysis techniques considering the inter-procedural propagation of annalysis update information in this paper. We implement the proposed technique into OSCAR automatically multigrain parallelizing compiler. We then evaluate the proposed technique by using three proprietary large scale programs. The proposed technique can remove 37.7% of program analysis time on average for basic analysis includes def-use analysis and dependence calculation, and 51.7% for pointer analysis, respectively.","PeriodicalId":106508,"journal":{"name":"Proceedings of the 3rd International Workshop on Software Engineering for Parallel Systems","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128750301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Proceedings of the 3rd International Workshop on Software Engineering for Parallel Systems 第三届并行系统软件工程国际研讨会论文集
A. Jannesari, Yukinori Sato, Stefan Winter
{"title":"Proceedings of the 3rd International Workshop on Software Engineering for Parallel Systems","authors":"A. Jannesari, Yukinori Sato, Stefan Winter","doi":"10.1145/3002125","DOIUrl":"https://doi.org/10.1145/3002125","url":null,"abstract":"","PeriodicalId":106508,"journal":{"name":"Proceedings of the 3rd International Workshop on Software Engineering for Parallel Systems","volume":"113 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125998720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Proceedings of the 3rd International Workshop on Software Engineering for Parallel Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1