首页 > 最新文献

Proceedings of the ACM International Conference on Computing Frontiers最新文献

英文 中文
A non von neumann continuum computer architecture for scalability beyond Moore's law 一个非冯·诺伊曼连续体计算机体系结构的可扩展性超越摩尔定律
Pub Date : 2016-05-16 DOI: 10.1145/2903150.2903486
M. Brodowicz, T. Sterling
A strategic challenge confronting the continued advance of high performance computing (HPC) to extreme scale is the approaching near-nanoscale semiconductor technology and the end of Moore's Law. This paper introduces the foundations of an innovative class of parallel architecture reversing many of the conventional architecture directions, but benefiting from substantial prior art of previous decades. The Continuum Computer Architecture, or CCA, eschews traditional von Neumann-derived processing logic, instead employing structures composed of fine-grain cells (fontons) that combine functional units, memory, and network. The paper describes how CCA systems of various scales may be organized and implemented using currently available technology. As programming of such systems substantially differs from established practices, a still experimental ParalleX execution model is introduced to be used as a guide for the implementation of related software stack layers, ranging from the operating system to application level constructs. Finally, the HPX-5 runtime system, an advanced implementation of ParalleX core components, is presented as an intermediate software methodology for CCA system computation resource management.
高性能计算(HPC)的持续发展所面临的一个战略挑战是接近纳米级的半导体技术和摩尔定律的终结。本文介绍了一种创新的并行体系结构的基础,它颠覆了许多传统的体系结构方向,但受益于过去几十年的大量现有技术。连续体计算机体系结构(continuous Computer Architecture,简称CCA)避开了传统的冯·诺伊曼衍生的处理逻辑,而是采用了由细粒细胞(按钮)组成的结构,这些细胞结合了功能单元、存储器和网络。本文描述了如何利用现有技术组织和实施各种规模的CCA系统。由于此类系统的编程与已建立的实践有很大的不同,本文引入了一个仍处于实验阶段的parallelx执行模型,用于指导相关软件堆栈层的实现,范围从操作系统到应用程序级结构。最后,提出了基于ParalleX核心组件的HPX-5运行时系统作为CCA系统计算资源管理的中间软件方法。
{"title":"A non von neumann continuum computer architecture for scalability beyond Moore's law","authors":"M. Brodowicz, T. Sterling","doi":"10.1145/2903150.2903486","DOIUrl":"https://doi.org/10.1145/2903150.2903486","url":null,"abstract":"A strategic challenge confronting the continued advance of high performance computing (HPC) to extreme scale is the approaching near-nanoscale semiconductor technology and the end of Moore's Law. This paper introduces the foundations of an innovative class of parallel architecture reversing many of the conventional architecture directions, but benefiting from substantial prior art of previous decades. The Continuum Computer Architecture, or CCA, eschews traditional von Neumann-derived processing logic, instead employing structures composed of fine-grain cells (fontons) that combine functional units, memory, and network. The paper describes how CCA systems of various scales may be organized and implemented using currently available technology. As programming of such systems substantially differs from established practices, a still experimental ParalleX execution model is introduced to be used as a guide for the implementation of related software stack layers, ranging from the operating system to application level constructs. Finally, the HPX-5 runtime system, an advanced implementation of ParalleX core components, is presented as an intermediate software methodology for CCA system computation resource management.","PeriodicalId":226569,"journal":{"name":"Proceedings of the ACM International Conference on Computing Frontiers","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132785366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Power and clock gating modelling in coarse grained reconfigurable systems 粗粒度可重构系统中的功率和时钟门控建模
Pub Date : 2016-05-16 DOI: 10.1145/2903150.2911713
Tiziana Fanni, Carlo Sau, P. Meloni, L. Raffo, F. Palumbo
Power reduction is one of the biggest challenges in modern systems and tends to become a severe issue dealing with complex scenarios. To provide high-performance and flexibility, designers often opt for coarse-grained reconfigurable (CGR) systems. Nevertheless, these systems require specific attention to the power problem, since large set of resources may be underutilized while computing a certain task. This paper focuses on this issue. Targeting CGR devices, we propose a way to model in advance power and clock gating costs on the basis of the functional, technological and architectural parameters of the baseline CGR system. The proposed flow guides designers towards optimal implementations, saving designer effort and time.
功耗降低是现代系统中最大的挑战之一,并且在处理复杂场景时往往成为一个严重的问题。为了提供高性能和灵活性,设计人员通常选择粗粒度的可重构(CGR)系统。然而,这些系统需要特别注意电源问题,因为在计算某个任务时,大量资源可能未得到充分利用。本文对这一问题进行了研究。针对CGR器件,我们提出了一种基于基准CGR系统的功能、技术和结构参数的功率和时钟门控成本的预先建模方法。所建议的流程指导设计人员实现最佳实现,节省了设计人员的精力和时间。
{"title":"Power and clock gating modelling in coarse grained reconfigurable systems","authors":"Tiziana Fanni, Carlo Sau, P. Meloni, L. Raffo, F. Palumbo","doi":"10.1145/2903150.2911713","DOIUrl":"https://doi.org/10.1145/2903150.2911713","url":null,"abstract":"Power reduction is one of the biggest challenges in modern systems and tends to become a severe issue dealing with complex scenarios. To provide high-performance and flexibility, designers often opt for coarse-grained reconfigurable (CGR) systems. Nevertheless, these systems require specific attention to the power problem, since large set of resources may be underutilized while computing a certain task. This paper focuses on this issue. Targeting CGR devices, we propose a way to model in advance power and clock gating costs on the basis of the functional, technological and architectural parameters of the baseline CGR system. The proposed flow guides designers towards optimal implementations, saving designer effort and time.","PeriodicalId":226569,"journal":{"name":"Proceedings of the ACM International Conference on Computing Frontiers","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128285629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Optimizing sparse matrix computations through compiler-assisted programming 通过编译器辅助编程优化稀疏矩阵计算
Pub Date : 2016-05-16 DOI: 10.1145/2903150.2903157
K. Rietveld, H. Wijshoff
Existing high-performance implementations of sparse matrix codes are intricate and result in large code bases. In fact, a single floating-point operation requires 400 to 600 lines of additional code to "prepare" this operation. This imbalance severely obscures code development, thereby complicating maintenance and portability. In this paper, we propose a drastically different approach in order to continue to effectively handle these codes. We propose to only specify the essence of the computation on the level of individual matrix elements. All additional source code to embed these computations are then generated and optimized automatically by the compiler. This approach is far superior to existing library approaches and allows code to perform scatter/gather operations, matrix reordering, matrix data structure handling, handling of fill-in, etc., to be generated automatically. Experiments show that very efficient data structures can be generated and the resulting codes can be very competitive.
现有的稀疏矩阵代码的高性能实现是复杂的,并且导致大量的代码库。事实上,一个浮点操作需要400到600行额外的代码来“准备”这个操作。这种不平衡严重地模糊了代码开发,从而使维护和可移植性复杂化。在本文中,我们提出了一种完全不同的方法,以便继续有效地处理这些代码。我们建议只在单个矩阵元素的水平上指定计算的本质。然后编译器会自动生成和优化嵌入这些计算的所有附加源代码。这种方法远远优于现有的库方法,并允许代码自动生成分散/收集操作、矩阵重新排序、矩阵数据结构处理、填充处理等。实验表明,可以生成非常高效的数据结构,并且生成的代码具有很强的竞争力。
{"title":"Optimizing sparse matrix computations through compiler-assisted programming","authors":"K. Rietveld, H. Wijshoff","doi":"10.1145/2903150.2903157","DOIUrl":"https://doi.org/10.1145/2903150.2903157","url":null,"abstract":"Existing high-performance implementations of sparse matrix codes are intricate and result in large code bases. In fact, a single floating-point operation requires 400 to 600 lines of additional code to \"prepare\" this operation. This imbalance severely obscures code development, thereby complicating maintenance and portability. In this paper, we propose a drastically different approach in order to continue to effectively handle these codes. We propose to only specify the essence of the computation on the level of individual matrix elements. All additional source code to embed these computations are then generated and optimized automatically by the compiler. This approach is far superior to existing library approaches and allows code to perform scatter/gather operations, matrix reordering, matrix data structure handling, handling of fill-in, etc., to be generated automatically. Experiments show that very efficient data structures can be generated and the resulting codes can be very competitive.","PeriodicalId":226569,"journal":{"name":"Proceedings of the ACM International Conference on Computing Frontiers","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133671812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Breadth first search vectorization on the Intel Xeon Phi 基于Intel Xeon Phi处理器的宽度优先搜索矢量化
Pub Date : 2016-04-11 DOI: 10.1145/2903150.2903180
Mireya Paredes, G. Riley, M. Luján
Breadth First Search (BFS) is a building block for graph algorithms and has recently been used for large scale analysis of information in a variety of applications including social networks, graph databases and web searching. Due to its importance, a number of different parallel programming models and architectures have been exploited to optimize the BFS. However, due to the irregular memory access patterns and the unstructured nature of the large graphs, its efficient parallelization is a challenge. The Xeon Phi is a massively parallel architecture available as an off-the-shelf accelerator, which includes a powerful 512 bit vector unit with optimized scatter and gather functions. Given its potential benefits, work related to graph traversing on this architecture is an active area of research. We present a set of experiments in which we explore architectural features of the Xeon Phi and how best to exploit them in a top-down BFS algorithm but the techniques can be applied to the current state-of-the-art hybrid, top-down plus bottom-up, algorithms. We focus on the exploitation of the vector unit by developing an improved highly vectorized OpenMP parallel algorithm, using vector intrinsics, and understanding the use of data alignment and prefetching. In addition, we investigate the impact of hyperthreading and thread affinity on performance, a topic that appears under researched in the literature. As a result, we achieve what we believe is the fastest published top-down BFS algorithm on the version of Xeon Phi used in our experiments. The vectorized BFS top-down source code presented in this paper can be available on request as free-to-use software.
广度优先搜索(BFS)是图算法的一个构建块,最近被用于各种应用程序的大规模信息分析,包括社交网络、图数据库和网络搜索。由于其重要性,许多不同的并行编程模型和架构被用来优化BFS。然而,由于不规则的内存访问模式和大型图的非结构化性质,它的高效并行化是一个挑战。Xeon Phi是一款大规模并行架构的现成加速器,它包括一个强大的512位矢量单元,具有优化的散射和收集功能。考虑到其潜在的好处,与此架构上的图遍历相关的工作是一个活跃的研究领域。我们提出了一组实验,在这些实验中,我们探索了Xeon Phi的架构特征,以及如何在自上而下的BFS算法中最好地利用它们,但这些技术可以应用于当前最先进的自上而下加自下而上的混合算法。我们通过开发一种改进的高度向量化的OpenMP并行算法,使用向量本质,以及理解数据对齐和预取的使用,专注于向量单元的利用。此外,我们还研究了超线程和线程亲和性对性能的影响,这是一个在文献中尚未研究的主题。因此,我们在实验中使用的Xeon Phi版本上实现了我们认为最快的自顶向下BFS算法。本文中提出的矢量化BFS自顶向下源代码可以作为免费软件提供。
{"title":"Breadth first search vectorization on the Intel Xeon Phi","authors":"Mireya Paredes, G. Riley, M. Luján","doi":"10.1145/2903150.2903180","DOIUrl":"https://doi.org/10.1145/2903150.2903180","url":null,"abstract":"Breadth First Search (BFS) is a building block for graph algorithms and has recently been used for large scale analysis of information in a variety of applications including social networks, graph databases and web searching. Due to its importance, a number of different parallel programming models and architectures have been exploited to optimize the BFS. However, due to the irregular memory access patterns and the unstructured nature of the large graphs, its efficient parallelization is a challenge. The Xeon Phi is a massively parallel architecture available as an off-the-shelf accelerator, which includes a powerful 512 bit vector unit with optimized scatter and gather functions. Given its potential benefits, work related to graph traversing on this architecture is an active area of research. We present a set of experiments in which we explore architectural features of the Xeon Phi and how best to exploit them in a top-down BFS algorithm but the techniques can be applied to the current state-of-the-art hybrid, top-down plus bottom-up, algorithms. We focus on the exploitation of the vector unit by developing an improved highly vectorized OpenMP parallel algorithm, using vector intrinsics, and understanding the use of data alignment and prefetching. In addition, we investigate the impact of hyperthreading and thread affinity on performance, a topic that appears under researched in the literature. As a result, we achieve what we believe is the fastest published top-down BFS algorithm on the version of Xeon Phi used in our experiments. The vectorized BFS top-down source code presented in this paper can be available on request as free-to-use software.","PeriodicalId":226569,"journal":{"name":"Proceedings of the ACM International Conference on Computing Frontiers","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125625040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Towards co-designed optimizations in parallel frameworks: a MapReduce case study 迈向并行框架中的协同设计优化:一个MapReduce案例研究
Pub Date : 2016-03-31 DOI: 10.1145/2903150.2903162
Colin Barrett, Christos Kotselidis, M. Luján
The explosion of Big Data was followed by the proliferation of numerous complex parallel software stacks whose aim is to tackle the challenges of data deluge. A drawback of a such multi-layered hierarchical deployment is the inability to maintain and delegate vital semantic information between layers in the stack. Software abstractions increase the semantic distance between an application and its generated code. However, parallel software frameworks contain inherent semantic information that general purpose compilers are not designed to exploit. This paper presents a case study demonstrating how the specific semantic information of the MapReduce paradigm can be exploited on multicore architectures. MR4J has been implemented in Java and evaluated against hand-optimized C and C++ equivalents. The initial observed results led to the design of a semantically aware optimizer that runs automatically without requiring modification to application code. The optimizer is able to speedup the execution time of MR4J by up to 2.0x. The introduced optimization not only improves the performance of the generated code, during the map phase, but also reduces the pressure on the garbage collector. This demonstrates how semantic information can be harnessed without sacrificing sound software engineering practices when using parallel software frameworks.
随着大数据的爆炸式增长,大量复杂的并行软件栈应运而生,其目的是应对数据泛滥带来的挑战。这种多层分层部署的一个缺点是无法在堆栈中的各层之间维护和委派重要的语义信息。软件抽象增加了应用程序与其生成的代码之间的语义距离。然而,并行软件框架包含通用编译器无法利用的固有语义信息。本文介绍了一个案例研究,展示了MapReduce范式的特定语义信息如何在多核架构上被利用。MR4J已经在Java中实现,并根据手工优化的C和c++等效版本进行了评估。最初观察到的结果导致了语义感知优化器的设计,该优化器无需修改应用程序代码即可自动运行。优化器能够将MR4J的执行时间加快至多2.0倍。引入的优化不仅提高了在映射阶段生成的代码的性能,而且还减少了垃圾收集器的压力。这演示了在使用并行软件框架时,如何在不牺牲可靠的软件工程实践的情况下利用语义信息。
{"title":"Towards co-designed optimizations in parallel frameworks: a MapReduce case study","authors":"Colin Barrett, Christos Kotselidis, M. Luján","doi":"10.1145/2903150.2903162","DOIUrl":"https://doi.org/10.1145/2903150.2903162","url":null,"abstract":"The explosion of Big Data was followed by the proliferation of numerous complex parallel software stacks whose aim is to tackle the challenges of data deluge. A drawback of a such multi-layered hierarchical deployment is the inability to maintain and delegate vital semantic information between layers in the stack. Software abstractions increase the semantic distance between an application and its generated code. However, parallel software frameworks contain inherent semantic information that general purpose compilers are not designed to exploit. This paper presents a case study demonstrating how the specific semantic information of the MapReduce paradigm can be exploited on multicore architectures. MR4J has been implemented in Java and evaluated against hand-optimized C and C++ equivalents. The initial observed results led to the design of a semantically aware optimizer that runs automatically without requiring modification to application code. The optimizer is able to speedup the execution time of MR4J by up to 2.0x. The introduced optimization not only improves the performance of the generated code, during the map phase, but also reduces the pressure on the garbage collector. This demonstrates how semantic information can be harnessed without sacrificing sound software engineering practices when using parallel software frameworks.","PeriodicalId":226569,"journal":{"name":"Proceedings of the ACM International Conference on Computing Frontiers","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124116797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Proceedings of the ACM International Conference on Computing Frontiers ACM计算前沿国际会议论文集
{"title":"Proceedings of the ACM International Conference on Computing Frontiers","authors":"","doi":"10.1145/2903150","DOIUrl":"https://doi.org/10.1145/2903150","url":null,"abstract":"","PeriodicalId":226569,"journal":{"name":"Proceedings of the ACM International Conference on Computing Frontiers","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121403285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
期刊
Proceedings of the ACM International Conference on Computing Frontiers
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1