首页 > 最新文献

2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)最新文献

英文 中文
Optimizing Memory Access in TCF Processors with Compute-Update Operations 利用计算更新操作优化TCF处理器中的内存访问
Pub Date : 2020-05-01 DOI: 10.1109/IPDPSW50202.2020.00100
M. Forsell, J. Roivainen, J. Träff
The thick control flow (TCF) model is a data parallel abstraction of the thread model. It merges homogeneous threads (called fibers) flowing through the same control path to entities (called TCFs) with a single control flow and multiple data flows. Fibers of a TCF are executed synchronously with respect to each other and the number of them can be altered dynamically at runtime. Multiple TCFs can be executed in parallel to support control parallelism. In our previous work, we have outlined a special architecture, TPA (Thick control flow Processor Architecture), for executing TCF programs efficiently and shown that designing algorithms with the TCF model often leads to increased performance and simplified programs due to higher abstraction, eliminated loops and redundant program elements.Compute-update memory operations, such as multioperations and atomic instructions, are known to speed up parallel algorithms performing reductions and synchronizations. In this paper, we propose special compute-update memory operations for TCF processors to optimize iterative exclusive inter-fiber memory access patterns. Acceleration is achieved, e.g., in matrix addition and log-prefix style patterns in which multiple target locations can interchange data without reloads between the instructions that slows down execution. Our solution is based on modified active memory units and special memory operations that can send their reply value to another fiber than that initiating the access. We implement these operations in our TPA processor with a minimal HW cost and show that the expected speedups are achieved. Programming examples are given.
粗控制流(TCF)模型是线程模型的数据并行抽象。它将流经同一控制路径的同质线程(称为纤维)合并到具有单个控制流和多个数据流的实体(称为tcf)。TCF的纤维彼此同步执行,并且它们的数量可以在运行时动态更改。多个tcf可以并行执行,以支持控制并行性。在我们之前的工作中,我们概述了一种特殊的体系结构,TPA(厚控制流处理器体系结构),用于有效地执行TCF程序,并表明使用TCF模型设计算法通常会导致性能的提高和程序的简化,因为更高的抽象性,消除了循环和冗余的程序元素。众所周知,计算更新内存操作,如多操作和原子指令,可以加速执行缩减和同步的并行算法。在本文中,我们提出了TCF处理器的特殊计算更新存储器操作,以优化迭代独占光纤间存储器访问模式。实现了加速,例如,在矩阵加法和日志前缀样式模式中,多个目标位置可以交换数据,而无需在减慢执行速度的指令之间重新加载。我们的解决方案基于修改的活动内存单元和特殊的内存操作,这些操作可以将它们的应答值发送到另一根光纤,而不是发起访问的光纤。我们在TPA处理器中以最小的硬件成本实现了这些操作,并表明达到了预期的速度。给出了编程实例。
{"title":"Optimizing Memory Access in TCF Processors with Compute-Update Operations","authors":"M. Forsell, J. Roivainen, J. Träff","doi":"10.1109/IPDPSW50202.2020.00100","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00100","url":null,"abstract":"The thick control flow (TCF) model is a data parallel abstraction of the thread model. It merges homogeneous threads (called fibers) flowing through the same control path to entities (called TCFs) with a single control flow and multiple data flows. Fibers of a TCF are executed synchronously with respect to each other and the number of them can be altered dynamically at runtime. Multiple TCFs can be executed in parallel to support control parallelism. In our previous work, we have outlined a special architecture, TPA (Thick control flow Processor Architecture), for executing TCF programs efficiently and shown that designing algorithms with the TCF model often leads to increased performance and simplified programs due to higher abstraction, eliminated loops and redundant program elements.Compute-update memory operations, such as multioperations and atomic instructions, are known to speed up parallel algorithms performing reductions and synchronizations. In this paper, we propose special compute-update memory operations for TCF processors to optimize iterative exclusive inter-fiber memory access patterns. Acceleration is achieved, e.g., in matrix addition and log-prefix style patterns in which multiple target locations can interchange data without reloads between the instructions that slows down execution. Our solution is based on modified active memory units and special memory operations that can send their reply value to another fiber than that initiating the access. We implement these operations in our TPA processor with a minimal HW cost and show that the expected speedups are achieved. Programming examples are given.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134377914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Data Parallel Large Sparse Deep Neural Network on GPU 基于GPU的数据并行大型稀疏深度神经网络
Pub Date : 2020-05-01 DOI: 10.1109/IPDPSW50202.2020.00170
Naw Safrin Sattar, Shaikh Anfuzzaman
Sparse Deep Neural Network (DNN) is an emerging research area since deploying deep neural networks with limited resources is very challenging. In this work, we provide a scalable solution to the Sparse DNN Challenge-a challenge posed by MIT/IEEE/Amazon GraphChallenge.org-by designing data parallelism on GPUs. We provide a solution based on Python TensorFlow as it is a widely used tool in different scientific applications for deep learning. We use the datasets provided by GraphChallenge, derived from the MNIST handwritten letters. We use the Synthetic DNNs from RadiX-Net with varying number of neurons and layers. We implement a data parallel implementation of Sparse DNN using TensorFlow on GPU. Our solution shows up to 4.7× speedup over the basehne serial MATLAB implementation given in GraphChallenge. In addition to that, our TensorFlow GPU implementation demonstrates a 3-fold speedup over our TensorFloW CPU implementation.
稀疏深度神经网络(DNN)是一个新兴的研究领域,因为在有限的资源下部署深度神经网络非常具有挑战性。在这项工作中,我们通过在gpu上设计数据并行性,为稀疏DNN挑战(MIT/IEEE/Amazon graphchallenge.org提出的挑战)提供了一个可扩展的解决方案。我们提供了一个基于Python TensorFlow的解决方案,因为它是在不同的科学应用中广泛使用的深度学习工具。我们使用GraphChallenge提供的数据集,这些数据集来源于MNIST手写信件。我们使用来自RadiX-Net的合成dnn,具有不同数量的神经元和层。我们在GPU上使用TensorFlow实现了稀疏DNN的数据并行实现。我们的解决方案比GraphChallenge中给出的基础串行MATLAB实现的速度提高了4.7倍。除此之外,我们的TensorFlow GPU实现比我们的TensorFlow CPU实现的速度提高了3倍。
{"title":"Data Parallel Large Sparse Deep Neural Network on GPU","authors":"Naw Safrin Sattar, Shaikh Anfuzzaman","doi":"10.1109/IPDPSW50202.2020.00170","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00170","url":null,"abstract":"Sparse Deep Neural Network (DNN) is an emerging research area since deploying deep neural networks with limited resources is very challenging. In this work, we provide a scalable solution to the Sparse DNN Challenge-a challenge posed by MIT/IEEE/Amazon GraphChallenge.org-by designing data parallelism on GPUs. We provide a solution based on Python TensorFlow as it is a widely used tool in different scientific applications for deep learning. We use the datasets provided by GraphChallenge, derived from the MNIST handwritten letters. We use the Synthetic DNNs from RadiX-Net with varying number of neurons and layers. We implement a data parallel implementation of Sparse DNN using TensorFlow on GPU. Our solution shows up to 4.7× speedup over the basehne serial MATLAB implementation given in GraphChallenge. In addition to that, our TensorFlow GPU implementation demonstrates a 3-fold speedup over our TensorFloW CPU implementation.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133349207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Machine Learning-Based Prefetching for SCM Main Memory System 基于机器学习的单片机主存系统预取
Pub Date : 2020-05-01 DOI: 10.1109/ipdpsw50202.2020.00133
Mayuko Koezuka, Yusuke Shirota, S. Shirai, Tatsunori Kanai
Demand for in-memory data processing of large-scale data is expanding, and expectations for storage-class memories (SCMs) are increasing accordingly. SCM achieves low standby power and higher density compared to DRAM. However, SCM is relatively slower than DRAM and requires more dynamic power. Therefore, it is necessary to improve speeds and reduce power usage by SCM by performing memory hierarchical control such as power-efficient prefetch control according to application memory access characteristics. However, such memory hierarchical control is complicated, making it difficult to determine an optimal memory control. Therefore, we propose an auto-tuning framework for dynamically predicting optimal memory control for SCM main memory system using machine learning based on system-level time series performance data. In this paper, we describe application of the proposed framework to prefetch control and evaluate the feasibility of power-efficient prefetch control. The results confirm automatic generation of prediction models reflecting domain knowledge of computer systems, allowing high-speed low-power real-time memory control.
对大规模数据的内存数据处理的需求正在扩大,对存储级内存(scm)的期望也相应增加。与DRAM相比,单片机实现了低待机功耗和更高的密度。然而,SCM相对慢于DRAM,并且需要更多的动态功率。因此,有必要根据应用程序的内存访问特征,通过执行内存分层控制(如节能预取控制)来提高SCM的速度并降低功耗。然而,这种内存层次控制比较复杂,难以确定最优的内存控制。因此,我们提出了一个自动调整框架,用于基于系统级时间序列性能数据的机器学习动态预测SCM主存系统的最优内存控制。在本文中,我们描述了所提出的框架在预取控制中的应用,并评估了节能预取控制的可行性。结果证实了能够自动生成反映计算机系统领域知识的预测模型,从而实现高速、低功耗的实时内存控制。
{"title":"Machine Learning-Based Prefetching for SCM Main Memory System","authors":"Mayuko Koezuka, Yusuke Shirota, S. Shirai, Tatsunori Kanai","doi":"10.1109/ipdpsw50202.2020.00133","DOIUrl":"https://doi.org/10.1109/ipdpsw50202.2020.00133","url":null,"abstract":"Demand for in-memory data processing of large-scale data is expanding, and expectations for storage-class memories (SCMs) are increasing accordingly. SCM achieves low standby power and higher density compared to DRAM. However, SCM is relatively slower than DRAM and requires more dynamic power. Therefore, it is necessary to improve speeds and reduce power usage by SCM by performing memory hierarchical control such as power-efficient prefetch control according to application memory access characteristics. However, such memory hierarchical control is complicated, making it difficult to determine an optimal memory control. Therefore, we propose an auto-tuning framework for dynamically predicting optimal memory control for SCM main memory system using machine learning based on system-level time series performance data. In this paper, we describe application of the proposed framework to prefetch control and evaluate the feasibility of power-efficient prefetch control. The results confirm automatic generation of prediction models reflecting domain knowledge of computer systems, allowing high-speed low-power real-time memory control.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"539 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133423860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Teaching Cloud Computing: Motivations, Challenges and Tools 云计算教学:动机、挑战和工具
Pub Date : 2020-05-01 DOI: 10.1109/IPDPSW50202.2020.00062
C. Anglano, M. Canonico, Marco Guazzone
Teaching Cloud Computing is becoming crucial since this recent computing paradigm is used in many fields and it is changing the way we use the applications and the technology. As a matter of the fact, most of the applications that we use everyday through the web are based on cloud services. Unfortunately, the difficulty to set up a real testbed for students and, at the same time, the lack of an easy, open and collaborative educational material freely available make teaching Cloud Computing a hard task. In this paper we discuss the state of the art concerning teaching Cloud Computing and we propose education materials and tools that make Cloud Computing easy to use even for students/educators without any computer science skills.
云计算教学正变得至关重要,因为这种最新的计算范式被用于许多领域,它正在改变我们使用应用程序和技术的方式。事实上,我们每天通过网络使用的大多数应用程序都是基于云服务的。不幸的是,很难为学生建立一个真正的测试平台,同时,缺乏一个简单的、开放的、协作的免费教育材料,使得云计算教学成为一项艰巨的任务。在本文中,我们讨论了有关云计算教学的最新技术,并提出了使云计算易于使用的教育材料和工具,即使对于没有任何计算机科学技能的学生/教育工作者也是如此。
{"title":"Teaching Cloud Computing: Motivations, Challenges and Tools","authors":"C. Anglano, M. Canonico, Marco Guazzone","doi":"10.1109/IPDPSW50202.2020.00062","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00062","url":null,"abstract":"Teaching Cloud Computing is becoming crucial since this recent computing paradigm is used in many fields and it is changing the way we use the applications and the technology. As a matter of the fact, most of the applications that we use everyday through the web are based on cloud services. Unfortunately, the difficulty to set up a real testbed for students and, at the same time, the lack of an easy, open and collaborative educational material freely available make teaching Cloud Computing a hard task. In this paper we discuss the state of the art concerning teaching Cloud Computing and we propose education materials and tools that make Cloud Computing easy to use even for students/educators without any computer science skills.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"155 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114382295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Workshop 14: iWAPT Automatic Performance Tuning 工作坊14:iWAPT自动性能调优
Pub Date : 2020-05-01 DOI: 10.1109/IPDPSW50202.2020.00132
I. Chung, K. Komatsu
iWAPT (International Workshop on Automatic Performance Tuning) is a series of workshops that focus on research and techniques related to performance sustainability issues. The series provides an opportunity for researchers and users of automatic performance tuning (AT) technologies to exchange ideas and experiences acquired when applying such technologies to improve the performance of algorithms, libraries, and applications; in particular, on cutting edge computing platforms. Topics of interest include performance modeling; adaptive algorithms; autotuned numerical algorithms; libraries and scientific applications; empirical compilation; automated code generation; frameworks and theories of AT and software optimization; autonomic computing; and context-aware computing.
iWAPT(国际自动性能调优研讨会)是一系列专注于与性能可持续性问题相关的研究和技术的研讨会。该系列为自动性能调优(AT)技术的研究人员和用户提供了一个机会,以交流在应用此类技术以提高算法、库和应用程序的性能时获得的想法和经验;特别是在尖端计算平台上。感兴趣的主题包括性能建模;自适应算法;自调谐数值算法;图书馆和科学应用;实证编译;自动代码生成;自动化测试和软件优化的框架和理论;自主计算;以及情境感知计算。
{"title":"Workshop 14: iWAPT Automatic Performance Tuning","authors":"I. Chung, K. Komatsu","doi":"10.1109/IPDPSW50202.2020.00132","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00132","url":null,"abstract":"iWAPT (International Workshop on Automatic Performance Tuning) is a series of workshops that focus on research and techniques related to performance sustainability issues. The series provides an opportunity for researchers and users of automatic performance tuning (AT) technologies to exchange ideas and experiences acquired when applying such technologies to improve the performance of algorithms, libraries, and applications; in particular, on cutting edge computing platforms. Topics of interest include performance modeling; adaptive algorithms; autotuned numerical algorithms; libraries and scientific applications; empirical compilation; automated code generation; frameworks and theories of AT and software optimization; autonomic computing; and context-aware computing.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"260 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116234667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploring Chapel Productivity Using Some Graph Algorithms 使用一些图算法探索教堂生产力
Pub Date : 2020-05-01 DOI: 10.1109/IPDPSW50202.2020.00114
R. Barrett, Jeanine E. Cook, Stephen L. Olivier, O. Aaziz, Chris Jenkins, C. Vaughan
A broad set of data science and engineering questions may be organized as graphs, providing a powerful means for describing relational data. Although experts now routinely compute graph algorithms on huge, unstructured graphs using high performance computing (HPC) or cloud resources, this practice hasn’t yet broken into the mainstream. Such computations require great expertise, yet users often need rapid prototyping and development to quickly customize existing code. Toward that end, we are exploring the use of the Chapel programming language as a means of making some important graph analytics more accessible, examining the breadth of characteristics that would make for a productive programming environment, one that is expressive, performant, portable, and robust.
一组广泛的数据科学和工程问题可以组织成图,为描述关系数据提供了一种强大的方法。尽管专家们现在经常使用高性能计算(HPC)或云资源在巨大的非结构化图上计算图算法,但这种做法尚未成为主流。这样的计算需要大量的专业知识,而用户通常需要快速的原型和开发来快速定制现有的代码。为了达到这个目的,我们正在探索使用Chapel编程语言作为一种手段,使一些重要的图形分析更容易访问,检查特性的宽度,这将使一个富有成效的编程环境,一个富有表现力、高性能、可移植和健壮的。
{"title":"Exploring Chapel Productivity Using Some Graph Algorithms","authors":"R. Barrett, Jeanine E. Cook, Stephen L. Olivier, O. Aaziz, Chris Jenkins, C. Vaughan","doi":"10.1109/IPDPSW50202.2020.00114","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00114","url":null,"abstract":"A broad set of data science and engineering questions may be organized as graphs, providing a powerful means for describing relational data. Although experts now routinely compute graph algorithms on huge, unstructured graphs using high performance computing (HPC) or cloud resources, this practice hasn’t yet broken into the mainstream. Such computations require great expertise, yet users often need rapid prototyping and development to quickly customize existing code. Toward that end, we are exploring the use of the Chapel programming language as a means of making some important graph analytics more accessible, examining the breadth of characteristics that would make for a productive programming environment, one that is expressive, performant, portable, and robust.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"284 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114451756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
SpiderWeb - High Performance FPGA NoC SpiderWeb -高性能FPGA NoC
Pub Date : 2020-05-01 DOI: 10.1109/IPDPSW50202.2020.00025
M. Langhammer, Gregg Baeckler, Sergey Gribok
In this paper we introduce SpiderWeb, a new methodology for building high speed soft networks on FPGAs. There are many reasons why greater internal bandwidth is an increasingly important issue for FPGAs. Compute density is rapidly growing on FGPA, from historical precisions such as single precision floating point, to the massive parallel low precision operations required by machine learning inference. It is difficult for current FPGA fabrics, with designs developed using standard methods and tool flows, to provide a reliable way of generating wide and/or high speed data distribution busses. In contrast, SpiderWeb uses a specific NoC generation methodology which provides a predictable area and performance for these structures, with area and speed accurately known before compile time. The generated NoCs can be incorporated into large, complex designs, implemented with standard design flows, without compromising routability of the system.
本文介绍了一种在fpga上构建高速软网络的新方法SpiderWeb。对于fpga来说,更大的内部带宽是一个越来越重要的问题,原因有很多。计算密度在FGPA上快速增长,从历史精度(如单精度浮点)到机器学习推理所需的大量并行低精度操作。目前使用标准方法和工具流开发的FPGA结构很难提供一种可靠的方式来生成宽带和/或高速数据分发总线。相比之下,SpiderWeb使用特定的NoC生成方法,该方法为这些结构提供了可预测的面积和性能,在编译之前就可以准确地知道面积和速度。生成的noc可以集成到大型复杂设计中,使用标准设计流程实现,而不会影响系统的可达性。
{"title":"SpiderWeb - High Performance FPGA NoC","authors":"M. Langhammer, Gregg Baeckler, Sergey Gribok","doi":"10.1109/IPDPSW50202.2020.00025","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00025","url":null,"abstract":"In this paper we introduce SpiderWeb, a new methodology for building high speed soft networks on FPGAs. There are many reasons why greater internal bandwidth is an increasingly important issue for FPGAs. Compute density is rapidly growing on FGPA, from historical precisions such as single precision floating point, to the massive parallel low precision operations required by machine learning inference. It is difficult for current FPGA fabrics, with designs developed using standard methods and tool flows, to provide a reliable way of generating wide and/or high speed data distribution busses. In contrast, SpiderWeb uses a specific NoC generation methodology which provides a predictable area and performance for these structures, with area and speed accurately known before compile time. The generated NoCs can be incorporated into large, complex designs, implemented with standard design flows, without compromising routability of the system.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132281829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Workshop 13: PDSEC Parallel and Distributed Scientific and Engineering Computing 研讨会13:PDSEC并行和分布式科学与工程计算
Pub Date : 2020-05-01 DOI: 10.1109/IPDPSW50202.2020.00122
R. Couturier, P. Strazdins, E. Aubanel, S. Roller, L. Yang, T. Rauber, G. Rünger
The technological trends in HPC system evolution indicates an increasing burden placed on application developers due to the management of the unprecedented complexity levels of hardware and its associated performance characteristics. Many existing scientific applications codes are unlikely to perform well on future systems without major modifications or even complete rewrites. In the future, it will be necessary to utilize, in concert, many characteristics such as multiple levels of parallelism, many lightweight cores, complex memory hierarchies, novel I/O technology, power capping, system-wide temporal/spatial performance heterogeneity and reliability concerns. The parallel and distributed computing (PDC) community has developed new programming models, algorithms, libraries and tools to meet these challenges in order to accommodate productive code development and effective system use. However, the scientific application community still needs to identify the benefit through practical evaluations. Thus, the focus of this workshop is on methodologies and experiences used in scientific and engineering applications and algorithms to achieve sustainable code development for better productivity, application performance and reliability.
高性能计算系统发展的技术趋势表明,由于管理空前复杂的硬件及其相关性能特征,应用程序开发人员的负担越来越大。许多现有的科学应用程序代码不可能在未来的系统上表现良好,除非进行重大修改,甚至完全重写。在未来,有必要同时利用许多特性,如多层并行性、许多轻量级内核、复杂的内存层次、新颖的I/O技术、功率上限、系统范围内的时间/空间性能异质性和可靠性问题。并行和分布式计算(PDC)社区已经开发了新的编程模型、算法、库和工具来应对这些挑战,以适应高效的代码开发和有效的系统使用。然而,科学应用界仍然需要通过实际评估来确定效益。因此,本次研讨会的重点是在科学和工程应用程序和算法中使用的方法和经验,以实现更好的生产力、应用程序性能和可靠性的可持续代码开发。
{"title":"Workshop 13: PDSEC Parallel and Distributed Scientific and Engineering Computing","authors":"R. Couturier, P. Strazdins, E. Aubanel, S. Roller, L. Yang, T. Rauber, G. Rünger","doi":"10.1109/IPDPSW50202.2020.00122","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00122","url":null,"abstract":"The technological trends in HPC system evolution indicates an increasing burden placed on application developers due to the management of the unprecedented complexity levels of hardware and its associated performance characteristics. Many existing scientific applications codes are unlikely to perform well on future systems without major modifications or even complete rewrites. In the future, it will be necessary to utilize, in concert, many characteristics such as multiple levels of parallelism, many lightweight cores, complex memory hierarchies, novel I/O technology, power capping, system-wide temporal/spatial performance heterogeneity and reliability concerns. The parallel and distributed computing (PDC) community has developed new programming models, algorithms, libraries and tools to meet these challenges in order to accommodate productive code development and effective system use. However, the scientific application community still needs to identify the benefit through practical evaluations. Thus, the focus of this workshop is on methodologies and experiences used in scientific and engineering applications and algorithms to achieve sustainable code development for better productivity, application performance and reliability.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"275 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133917965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Scalable Deep Learning Inference: Algorithmic Approach 可扩展深度学习推理:算法方法
Pub Date : 2020-05-01 DOI: 10.1109/IPDPSW50202.2020.00166
Minsik Cho
Large-scale deep learning training has made significant progress in the last few years: more powerful systems/accelerators are delivered (i.e., Summit cluster), innovative training mechanisms are designed (i.e., sophisticated hyper-parm tuning), and advantage communication techniques are exercised (i.e., async-SGD). However, deep learning inference has rather limited options when it comes to scaling up the model density per device. Quantization to lower precision can be helpful along with sparsification such as pruning and compression yet suffers from the underlying hardware architecture and efficacy.
大规模深度学习训练在过去几年中取得了重大进展:交付了更强大的系统/加速器(即Summit集群),设计了创新的训练机制(即复杂的超参数调优),并运用了优势的通信技术(即async-SGD)。然而,当涉及到每个设备的模型密度时,深度学习推理的选择相当有限。量化到较低的精度可能有助于稀疏化,如修剪和压缩,但会受到底层硬件架构和效率的影响。
{"title":"Scalable Deep Learning Inference: Algorithmic Approach","authors":"Minsik Cho","doi":"10.1109/IPDPSW50202.2020.00166","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00166","url":null,"abstract":"Large-scale deep learning training has made significant progress in the last few years: more powerful systems/accelerators are delivered (i.e., Summit cluster), innovative training mechanisms are designed (i.e., sophisticated hyper-parm tuning), and advantage communication techniques are exercised (i.e., async-SGD). However, deep learning inference has rather limited options when it comes to scaling up the model density per device. Quantization to lower precision can be helpful along with sparsification such as pruning and compression yet suffers from the underlying hardware architecture and efficacy.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134121964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
In-Depth Optimization with the OpenACC-to-FPGA Framework on an Arria 10 FPGA 基于Arria 10 FPGA的OpenACC-to-FPGA框架的深度优化
Pub Date : 2020-05-01 DOI: 10.1109/IPDPSW50202.2020.00084
Jacob Lambert, Seyong Lee, J. Vetter, A. Malony
The reconfigurable computing paradigm that uses field programmable gate arrays (FPGAs) has received renewed interest in the high-performance computing field due to FPGAs’ unique combination of performance and energy efficiency. However, difficulties in programming and optimizing FPGAs have prevented them from being widely accepted as general-purpose computing devices. In accelerator-based heterogeneous computing, portability across diverse heterogeneous devices is also an important issue, but the unique architectural features in FPGAs make this difficult to achieve. To address these issues, a directive-based, high-level FPGA programming and optimization framework was previously developed. In this work, developed optimizations were combined holistically using the directive-based approach to show that each individual benchmark requires a unique set of optimizations to maximize performance. The relationships between FPGA resource usages and runtime performance were also explored.
使用现场可编程门阵列(fpga)的可重构计算范式由于fpga独特的性能和能效组合,在高性能计算领域重新引起了人们的兴趣。然而,fpga在编程和优化方面的困难阻碍了它们作为通用计算设备被广泛接受。在基于加速器的异构计算中,跨不同异构设备的可移植性也是一个重要问题,但是fpga独特的架构特性使得这一点难以实现。为了解决这些问题,之前开发了一个基于指令的高级FPGA编程和优化框架。在这项工作中,开发的优化使用基于指令的方法进行了整体组合,以显示每个单独的基准都需要一组独特的优化来最大化性能。探讨了FPGA资源使用与运行时性能之间的关系。
{"title":"In-Depth Optimization with the OpenACC-to-FPGA Framework on an Arria 10 FPGA","authors":"Jacob Lambert, Seyong Lee, J. Vetter, A. Malony","doi":"10.1109/IPDPSW50202.2020.00084","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00084","url":null,"abstract":"The reconfigurable computing paradigm that uses field programmable gate arrays (FPGAs) has received renewed interest in the high-performance computing field due to FPGAs’ unique combination of performance and energy efficiency. However, difficulties in programming and optimizing FPGAs have prevented them from being widely accepted as general-purpose computing devices. In accelerator-based heterogeneous computing, portability across diverse heterogeneous devices is also an important issue, but the unique architectural features in FPGAs make this difficult to achieve. To address these issues, a directive-based, high-level FPGA programming and optimization framework was previously developed. In this work, developed optimizations were combined holistically using the directive-based approach to show that each individual benchmark requires a unique set of optimizations to maximize performance. The relationships between FPGA resource usages and runtime performance were also explored.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"132 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116193973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
期刊
2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1