首页 > 最新文献

2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)最新文献

英文 中文
Pinocchio: A Blockchain-Based Algorithm for Sensor Fault Tolerance in Low Trust Environment 匹诺曹:低信任环境下基于区块链的传感器容错算法
Pub Date : 2020-05-01 DOI: 10.1109/IPDPSW50202.2020.00077
Chen Zeng, Yifan Wang, Fan Liang, Xiaohui Peng
As the advancement of the IoT market and technology, sensor-equipped devices belonging to different interest groups have more opportunities to cooperate, fulfilling an assignment. How to ensure an agreement on correct sensed data in such a low trust environment is a new challenge in nowadays market. To bridge the gap, we propose Pinocchio, a blockchain-based algorithm which can tolerate faults in WSN (Wireless Sensor Network) along with the ability to defend sensed data against malicious attacks of data tampering and masquerade. Compared to other distributed approaches of sensor fault tolerance, Pinocchio greatly reduced the message complexity of the entire network to $O(N)$ and that of a single node to $O(1)$. Considering the possible waste of resources brought about by vicious competition of hash power in the blockchain-based approach, we design the Geppetto algorithm to supervise and control hash power in a distributed manner, and its effectiveness is demonstrated by experiments.
随着物联网市场和技术的进步,属于不同利益群体的传感器设备有更多的合作机会,完成任务。如何在低信任环境下保证正确感知数据的一致性是当前市场面临的新挑战。为了弥补这一差距,我们提出了一种基于区块链的算法Pinocchio,它可以容忍WSN(无线传感器网络)中的故障,并能够保护感知数据免受数据篡改和伪装的恶意攻击。与其他分布式传感器容错方法相比,匹诺曹将整个网络的消息复杂度大大降低到$O(N)$,将单个节点的消息复杂度降低到$O(1)$。考虑到基于区块链的方法中哈希算力的恶性竞争可能带来的资源浪费,我们设计了gepepetto算法,以分布式的方式对哈希算力进行监督和控制,并通过实验证明了其有效性。
{"title":"Pinocchio: A Blockchain-Based Algorithm for Sensor Fault Tolerance in Low Trust Environment","authors":"Chen Zeng, Yifan Wang, Fan Liang, Xiaohui Peng","doi":"10.1109/IPDPSW50202.2020.00077","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00077","url":null,"abstract":"As the advancement of the IoT market and technology, sensor-equipped devices belonging to different interest groups have more opportunities to cooperate, fulfilling an assignment. How to ensure an agreement on correct sensed data in such a low trust environment is a new challenge in nowadays market. To bridge the gap, we propose Pinocchio, a blockchain-based algorithm which can tolerate faults in WSN (Wireless Sensor Network) along with the ability to defend sensed data against malicious attacks of data tampering and masquerade. Compared to other distributed approaches of sensor fault tolerance, Pinocchio greatly reduced the message complexity of the entire network to $O(N)$ and that of a single node to $O(1)$. Considering the possible waste of resources brought about by vicious competition of hash power in the blockchain-based approach, we design the Geppetto algorithm to supervise and control hash power in a distributed manner, and its effectiveness is demonstrated by experiments.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126405275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Analyzing and Understanding the Impact of Interconnect Performance on HPC, Big Data, and Deep Learning Applications: A Case Study with InfiniBand EDR and HDR 分析和理解互联性能对高性能计算、大数据和深度学习应用的影响:以InfiniBand EDR和HDR为例
Pub Date : 2020-05-01 DOI: 10.1109/IPDPSW50202.2020.00147
Amit Ruhela, Shulei Xu, K. V. Manian, H. Subramoni, D. Panda
Communication interfaces of High Performance Computing (HPC) systems, Cloud middleware, and Deep Learning (DL) frameworks have been continually evolving to meet the ever-increasing communication demands being placed on them by HPC, Cloud, and DL applications. Modern high performance interconnects like InfiniBand EDR 100 Gbps, InfiniBand HDR 200 Gbps are capable of delivering 100 Gbps and 200 Gbps speeds. However, no previous study has demonstrated how much benefit an end-user in the HPC, Cloud, and DL computing domain can expect by utilizing newer generations of these interconnects over the older ones. In this paper, we evaluate the InfiniBand EDR and HDR high performance interconnects over the PCIe Gen3 interface with HPC, Cloud, and DL workloads. Our comprehensive analysis, done at different levels, provides a global scope of the impact these modern interconnects have on the performance of HPC, Cloud, and DL applications. The results of our experiments show that the latest InfiniBand HDR interconnect gives the best performance for all three computing domains.
高性能计算(HPC)系统、云中间件和深度学习(DL)框架的通信接口一直在不断发展,以满足HPC、云和深度学习应用程序对它们日益增长的通信需求。现代高性能互连,如InfiniBand EDR 100 Gbps, InfiniBand HDR 200 Gbps,能够提供100 Gbps和200 Gbps的速度。然而,没有先前的研究表明,在高性能计算、云计算和深度学习计算领域,终端用户可以通过使用新一代的互连而不是旧的互连来获得多少好处。在本文中,我们评估了基于PCIe Gen3接口的InfiniBand EDR和HDR高性能互连与HPC,云和DL工作负载。我们在不同层次上进行了全面的分析,提供了这些现代互连对HPC、云和DL应用程序性能的影响的全球范围。实验结果表明,最新的InfiniBand HDR互连在所有三个计算领域都具有最佳性能。
{"title":"Analyzing and Understanding the Impact of Interconnect Performance on HPC, Big Data, and Deep Learning Applications: A Case Study with InfiniBand EDR and HDR","authors":"Amit Ruhela, Shulei Xu, K. V. Manian, H. Subramoni, D. Panda","doi":"10.1109/IPDPSW50202.2020.00147","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00147","url":null,"abstract":"Communication interfaces of High Performance Computing (HPC) systems, Cloud middleware, and Deep Learning (DL) frameworks have been continually evolving to meet the ever-increasing communication demands being placed on them by HPC, Cloud, and DL applications. Modern high performance interconnects like InfiniBand EDR 100 Gbps, InfiniBand HDR 200 Gbps are capable of delivering 100 Gbps and 200 Gbps speeds. However, no previous study has demonstrated how much benefit an end-user in the HPC, Cloud, and DL computing domain can expect by utilizing newer generations of these interconnects over the older ones. In this paper, we evaluate the InfiniBand EDR and HDR high performance interconnects over the PCIe Gen3 interface with HPC, Cloud, and DL workloads. Our comprehensive analysis, done at different levels, provides a global scope of the impact these modern interconnects have on the performance of HPC, Cloud, and DL applications. The results of our experiments show that the latest InfiniBand HDR interconnect gives the best performance for all three computing domains.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125907614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Real-time Automatic Modulation Classification using RFSoC 基于RFSoC的实时自动调制分类
Pub Date : 2020-05-01 DOI: 10.1109/IPDPSW50202.2020.00021
Stephen Tridgell, D. Boland, P. Leong, R. Kastner, Alireza Khodamoradi, Siddhartha
The computational complexity of deep learning has led to research efforts to reduce the computation required. The use of low precision is particularly effective on FPGAs as they are not restricted to byte addressable operations. Very low precision activations and weights can have a significant impact on the accuracy however. This work demonstrates by exploiting throughput matching that higher precision on certain layers can be used to recover this accuracy. This is applied to the domain of automatic modulation classification for radio signals leveraging the RF capabilities offered by the Xilinx ZCU111 RFSoC platform. The implemented networks achieve high-speed real-time performance with a classification latency of $approx8mu$s, and an operational throughput of 488k classifications per second. On the open-source RadioML dataset, we demonstrate how to recover 4.3% in accuracy with the same hardware usage with our technique.
深度学习的计算复杂性导致研究人员努力减少所需的计算量。低精度的使用在fpga上特别有效,因为它们不局限于字节可寻址操作。然而,非常低的精度激活和权重会对精度产生重大影响。这项工作通过利用吞吐量匹配证明,在某些层上可以使用更高的精度来恢复这种精度。这适用于利用Xilinx ZCU111 RFSoC平台提供的RF功能的无线电信号自动调制分类领域。实现的网络实现了高速实时性能,分类延迟为$approx8mu$ s,操作吞吐量为每秒488k个分类。在开源的RadioML数据集上,我们演示了如何恢复4.3% in accuracy with the same hardware usage with our technique.
{"title":"Real-time Automatic Modulation Classification using RFSoC","authors":"Stephen Tridgell, D. Boland, P. Leong, R. Kastner, Alireza Khodamoradi, Siddhartha","doi":"10.1109/IPDPSW50202.2020.00021","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00021","url":null,"abstract":"The computational complexity of deep learning has led to research efforts to reduce the computation required. The use of low precision is particularly effective on FPGAs as they are not restricted to byte addressable operations. Very low precision activations and weights can have a significant impact on the accuracy however. This work demonstrates by exploiting throughput matching that higher precision on certain layers can be used to recover this accuracy. This is applied to the domain of automatic modulation classification for radio signals leveraging the RF capabilities offered by the Xilinx ZCU111 RFSoC platform. The implemented networks achieve high-speed real-time performance with a classification latency of $approx8mu$s, and an operational throughput of 488k classifications per second. On the open-source RadioML dataset, we demonstrate how to recover 4.3% in accuracy with the same hardware usage with our technique.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122262462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Performance Evaluation of Pipelined Communication Combined with Computation in OpenCL Programming on FPGA 基于FPGA的OpenCL编程中管道通信与计算相结合的性能评价
Pub Date : 2020-05-01 DOI: 10.1109/IPDPSW50202.2020.00083
N. Fujita, Ryohei Kobayashi, Y. Yamaguchi, Tomohiro Ueno, K. Sano, T. Boku
In recent years, much High Performance Computing (HPC) researchers attract to utilize Field Programmable Gate Arrays (FPGAs) for HPC applications. We can use FPGAs for communication as well as computation thanks to FPGA’s I/O capabilities. HPC scientists cannot utilize FPGAs for their applications because of the difficulty of the FPGA development, however High Level Synthesis (HLS) allows them to use with appropriate costs. In this study, we propose a Communication Integrated Reconfigurable CompUting System (CIRCUS) to enable us to utilize high-speed interconnection of FPGAS from OpenCL. CIRCUS makes a fused single pipeline combining the computation and the communication, which hides the communication latency by completely overlapping them. In this paper, we present the detail of the implementation and the evaluation result using two benchmarks: pingpong benchmark and allreduce benchmark.
近年来,越来越多的高性能计算(HPC)研究人员将现场可编程门阵列(fpga)用于高性能计算应用。由于FPGA的I/O功能,我们可以使用FPGA进行通信和计算。由于FPGA开发的困难,高性能计算科学家不能将FPGA用于他们的应用,但是高层次合成(HLS)允许他们以适当的成本使用。在这项研究中,我们提出了一个通信集成可重构计算系统(CIRCUS),使我们能够利用OpenCL fpga的高速互连。CIRCUS将计算和通信融合成一个单一的管道,通过完全重叠来隐藏通信延迟。在本文中,我们给出了具体的实现和使用乒乓基准和allreduce基准的评估结果。
{"title":"Performance Evaluation of Pipelined Communication Combined with Computation in OpenCL Programming on FPGA","authors":"N. Fujita, Ryohei Kobayashi, Y. Yamaguchi, Tomohiro Ueno, K. Sano, T. Boku","doi":"10.1109/IPDPSW50202.2020.00083","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00083","url":null,"abstract":"In recent years, much High Performance Computing (HPC) researchers attract to utilize Field Programmable Gate Arrays (FPGAs) for HPC applications. We can use FPGAs for communication as well as computation thanks to FPGA’s I/O capabilities. HPC scientists cannot utilize FPGAs for their applications because of the difficulty of the FPGA development, however High Level Synthesis (HLS) allows them to use with appropriate costs. In this study, we propose a Communication Integrated Reconfigurable CompUting System (CIRCUS) to enable us to utilize high-speed interconnection of FPGAS from OpenCL. CIRCUS makes a fused single pipeline combining the computation and the communication, which hides the communication latency by completely overlapping them. In this paper, we present the detail of the implementation and the evaluation result using two benchmarks: pingpong benchmark and allreduce benchmark.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125726824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Autonomous Task Dropping Mechanism to Achieve Robustness in Heterogeneous Computing Systems 实现异构计算系统鲁棒性的自主任务丢弃机制
Pub Date : 2020-05-01 DOI: 10.1109/IPDPSW50202.2020.00013
Ali Mokhtari, Chavit Denninnart, M. Salehi
Robustness of a distributed computing system is defined as the ability to maintain its performance in the presence of uncertain parameters. Uncertainty is a key problem in heterogeneous (and even homogeneous) distributed computing systems that perturbs system robustness. Notably, the performance of these systems is perturbed by uncertainty in both task execution time and arrival. Accordingly, our goal is to make the system robust against these uncertainties. Considering task execution time as a random variable, we use probabilistic analysis to develop an autonomous proactive task dropping mechanism to attain our robustness goal. Specifically, we provide a mathematical model that identifies the optimality of a task dropping decision, so that the system robustness is maximized. Then, we leverage the mathematical model to develop a task dropping heuristic that achieves the system robustness within a feasible time complexity. Although the proposed model is generic and can be applied to any distributed system, we concentrate on heterogeneous computing (HC) systems that have a higher degree of exposure to uncertainty than homogeneous systems. Experimental results demonstrate that the autonomous proactive dropping mechanism can improve the system robustness by up to 20%.
分布式计算系统的鲁棒性被定义为在不确定参数存在的情况下保持其性能的能力。在异构(甚至同质)分布式计算系统中,不确定性是影响系统鲁棒性的关键问题。值得注意的是,这些系统的性能受到任务执行时间和到达时间的不确定性的干扰。因此,我们的目标是使系统对这些不确定性具有鲁棒性。将任务执行时间作为一个随机变量,利用概率分析方法建立了一种自主主动任务丢弃机制,以实现鲁棒性目标。具体来说,我们提供了一个数学模型来识别任务丢弃决策的最优性,从而使系统的鲁棒性最大化。然后,我们利用数学模型开发了一种任务丢弃启发式算法,在可行的时间复杂度内实现了系统的鲁棒性。虽然所提出的模型是通用的,可以应用于任何分布式系统,但我们关注的是异构计算(HC)系统,它比同构系统具有更高程度的不确定性。实验结果表明,自主主动跌落机制可使系统鲁棒性提高20%。
{"title":"Autonomous Task Dropping Mechanism to Achieve Robustness in Heterogeneous Computing Systems","authors":"Ali Mokhtari, Chavit Denninnart, M. Salehi","doi":"10.1109/IPDPSW50202.2020.00013","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00013","url":null,"abstract":"Robustness of a distributed computing system is defined as the ability to maintain its performance in the presence of uncertain parameters. Uncertainty is a key problem in heterogeneous (and even homogeneous) distributed computing systems that perturbs system robustness. Notably, the performance of these systems is perturbed by uncertainty in both task execution time and arrival. Accordingly, our goal is to make the system robust against these uncertainties. Considering task execution time as a random variable, we use probabilistic analysis to develop an autonomous proactive task dropping mechanism to attain our robustness goal. Specifically, we provide a mathematical model that identifies the optimality of a task dropping decision, so that the system robustness is maximized. Then, we leverage the mathematical model to develop a task dropping heuristic that achieves the system robustness within a feasible time complexity. Although the proposed model is generic and can be applied to any distributed system, we concentrate on heterogeneous computing (HC) systems that have a higher degree of exposure to uncertainty than homogeneous systems. Experimental results demonstrate that the autonomous proactive dropping mechanism can improve the system robustness by up to 20%.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127134518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
PDCunplugged: A Free Repository of Unplugged Parallel Distributed Computing Activities PDCunplugged:一个不插电并行分布式计算活动的免费存储库
Pub Date : 2020-05-01 DOI: 10.1109/IPDPSW50202.2020.00060
Suzanne J. Matthews
Integrating parallel and distributed computing (PDC) topics in core computing courses is a topic of increasing interest for educators. However, there is a question of how best to introduce PDC to undergraduates. Several educators have proposed the use of “unplugged activities”, such as role-playing dramatizations and analogies, to introduce PDC concepts. Yet, unplugged activities for PDC are widely-scattered and often difficult to find, making it challenging for educators to create and incorporate unplugged interventions in their classrooms. The PDCunplugged project seeks to rectify these issues by providing a free repository where educators can find and share unplugged activities related to PDC. The existing curation contains nearly forty unique unplugged activities collected from thirty years of the PDC literature and from all over the Internet, and maps each activity to relevant CS2013 PDC knowledge units and TCPP PDC topic areas. Learn more about the project at pdcunplugged.org.
将并行和分布式计算(PDC)主题整合到核心计算课程中是教育工作者越来越感兴趣的话题。然而,如何最好地向大学生介绍PDC是一个问题。一些教育工作者建议使用“不插电活动”,例如角色扮演戏剧和类比,来介绍PDC概念。然而,PDC的不插电作业非常分散,很难找到,这给教育工作者在课堂上创建和纳入不插电干预措施带来了挑战。PDCunplugged项目旨在通过提供一个免费的存储库来纠正这些问题,教育工作者可以在该存储库中找到并分享与PDC相关的不插电活动。现有的策展包含近40个独特的不插电活动,这些活动收集自30年来的PDC文献和整个互联网,并将每个活动映射到相关的CS2013 PDC知识单元和TCPP PDC主题领域。在pdcunplugged.org上了解更多关于这个项目的信息。
{"title":"PDCunplugged: A Free Repository of Unplugged Parallel Distributed Computing Activities","authors":"Suzanne J. Matthews","doi":"10.1109/IPDPSW50202.2020.00060","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00060","url":null,"abstract":"Integrating parallel and distributed computing (PDC) topics in core computing courses is a topic of increasing interest for educators. However, there is a question of how best to introduce PDC to undergraduates. Several educators have proposed the use of “unplugged activities”, such as role-playing dramatizations and analogies, to introduce PDC concepts. Yet, unplugged activities for PDC are widely-scattered and often difficult to find, making it challenging for educators to create and incorporate unplugged interventions in their classrooms. The PDCunplugged project seeks to rectify these issues by providing a free repository where educators can find and share unplugged activities related to PDC. The existing curation contains nearly forty unique unplugged activities collected from thirty years of the PDC literature and from all over the Internet, and maps each activity to relevant CS2013 PDC knowledge units and TCPP PDC topic areas. Learn more about the project at pdcunplugged.org.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"9 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114006888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Importance of Selecting Data Layouts in the Tsunami Simulation Code 海啸模拟代码中选择数据布局的重要性
Pub Date : 2020-05-01 DOI: 10.1109/IPDPSW50202.2020.00140
Takumi Kishitani, K. Komatsu, Masayuki Sato, A. Musa, Hiroaki Kobayashi
Exploiting the memory performance is one of the keys to accelerate the memory-intensive applications. A way for improving the memory performance is to make memory accesses efficient. Since the memory access pattern changes depending on data layouts, it is necessary for effective memory access to choose the appropriate data layout. This paper focuses on the tsunami simulation as one of the high performance computing applications that require the high memory performance. To examine the performance variance due to the data layouts, several data layouts are applied to the tsunami simulation. From the evaluation results, this paper clarifies that the performance of the tsunami simulation is sensitive to the input data, the computing systems, and the data layouts. The execution time of the tsunami simulation with an array of structures is much longer than those with a discrete array and a structure of arrays. The performances of the discrete array and the structure of arrays are not high in specific cases but changed according to the computing systems and the input data. Based on these observations, this paper indicates the importance of the data layout selection to exploit the memory performance.
利用内存性能是加速内存密集型应用程序的关键之一。提高内存性能的一种方法是使内存访问更高效。由于内存访问模式会根据数据布局而变化,因此有必要选择适当的数据布局进行有效的内存访问。海啸模拟是对存储性能要求较高的高性能计算应用之一。为了检查由于数据布局导致的性能差异,将几种数据布局应用于海啸模拟。从评价结果来看,海啸模拟的性能对输入数据、计算系统和数据布局都很敏感。结构阵列海啸仿真的执行时间比离散阵列和结构阵列海啸仿真的执行时间要长得多。离散阵列的性能和阵列的结构在特定情况下并不高,而是根据计算系统和输入数据的不同而变化。在此基础上,本文指出了数据布局选择对提高存储性能的重要性。
{"title":"Importance of Selecting Data Layouts in the Tsunami Simulation Code","authors":"Takumi Kishitani, K. Komatsu, Masayuki Sato, A. Musa, Hiroaki Kobayashi","doi":"10.1109/IPDPSW50202.2020.00140","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00140","url":null,"abstract":"Exploiting the memory performance is one of the keys to accelerate the memory-intensive applications. A way for improving the memory performance is to make memory accesses efficient. Since the memory access pattern changes depending on data layouts, it is necessary for effective memory access to choose the appropriate data layout. This paper focuses on the tsunami simulation as one of the high performance computing applications that require the high memory performance. To examine the performance variance due to the data layouts, several data layouts are applied to the tsunami simulation. From the evaluation results, this paper clarifies that the performance of the tsunami simulation is sensitive to the input data, the computing systems, and the data layouts. The execution time of the tsunami simulation with an array of structures is much longer than those with a discrete array and a structure of arrays. The performances of the discrete array and the structure of arrays are not high in specific cases but changed according to the computing systems and the input data. Based on these observations, this paper indicates the importance of the data layout selection to exploit the memory performance.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122683193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Kronecker Graph Generation with Ground Truth for 4-Cycles and Dense Structure in Bipartite Graphs 二部图中4环和密集结构的Kronecker图生成及其真值
Pub Date : 2020-05-01 DOI: 10.1109/IPDPSW50202.2020.00052
Trevor Steil, Scott McMillan, G. Sanders, R. Pearce, Benjamin W. Priest
We demonstrate nonstochastic Kronecker graph generators produce massive-scale bipartite graphs with ground truth global and local properties and discuss their use for validation of graph analytics. Given two small connected scalefree graphs with adjacency matrices $A$ and $B$, their Kronecker product graph [1] has adjacency matrix $C=Aotimes B$. We first demonstrate that having one factor $A$ non-bipartite (alternatively, adding all self loops to a bipartite $A$) with other factor $B$ bipartite ensures $mathcal {G}c$ is bipartite and connected. Formulas for ground truth of many graph properties (including degree, diameter, and eccentricity) carry over directly from the general case presented in previous work [2], [3]. However, the analysis of higher-order structure and dense structure is different in bipartite graphs, as no odd-length cycles exist (including triangles) and the densest possible structures are bicliques. We derive formulas to give ground truth for 4-cycles (a.k. a. squares or butterflies) at every vertex and edge in $mathcal {G}c$. Additionally, we demonstrate that bipartite communities (dense vertex subsets) in the factors $A, B$ yield dense bipartite communities in the Kronecker product $C.$ We additionally discuss interesting properties of Kronecker product graphs revealed by the formulas an their impact on using them as benchmarks with ground truth for various complex analytics. For example, for connected $A$ and $B$ of nontrivial size, $mathcal {G}c$ has 4-cycles at vertices/edges associated with vertices/edges in $A$ and $B$ that have none, making it difficult to generate graphs with ground truth bipartite generalizations of truss decomposition (e.g., the k-wing decomposition of [4]).
我们证明了非随机Kronecker图生成器产生具有真实全局和局部性质的大规模二部图,并讨论了它们在图分析验证中的应用。给定两个具有邻接矩阵$A$和$B$的小连通无标度图,它们的Kronecker积图[1]具有邻接矩阵$C=Ao * B$。我们首先证明了一个因子$A$非二部(或者,将所有自循环加到一个二部$A$)与另一个因子$B$二部保证$mathcal {G}c$是二部且连通的。许多图属性(包括度、直径和偏心率)的基本真值公式直接继承了先前工作[2]、[3]中提出的一般情况。然而,在二部图中,高阶结构和密集结构的分析是不同的,因为不存在奇长环(包括三角形),最密集的可能结构是双曲线。我们推导出在$mathcal {G}c$中每个顶点和边上的4个循环(也就是正方形或蝴蝶)的基本真理的公式。此外,我们证明了因子$A, $ B$中的二部群落(密集顶点子集)在Kronecker积$C中产生密集的二部群落。我们还讨论了公式揭示的克罗内克产品图的有趣属性及其对使用它们作为各种复杂分析的基准的影响。例如,对于连通的非平凡大小的$A$和$B$, $mathcal {G}c$在与$A$和$B$的顶点/边相关联的顶点/边处有4个循环,而$A$和$B$的顶点/边没有循环,这使得很难生成具有桁架分解的真二部推广的图(例如,[4]的k翼分解)。
{"title":"Kronecker Graph Generation with Ground Truth for 4-Cycles and Dense Structure in Bipartite Graphs","authors":"Trevor Steil, Scott McMillan, G. Sanders, R. Pearce, Benjamin W. Priest","doi":"10.1109/IPDPSW50202.2020.00052","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00052","url":null,"abstract":"We demonstrate nonstochastic Kronecker graph generators produce massive-scale bipartite graphs with ground truth global and local properties and discuss their use for validation of graph analytics. Given two small connected scalefree graphs with adjacency matrices $A$ and $B$, their Kronecker product graph [1] has adjacency matrix $C=Aotimes B$. We first demonstrate that having one factor $A$ non-bipartite (alternatively, adding all self loops to a bipartite $A$) with other factor $B$ bipartite ensures $mathcal {G}c$ is bipartite and connected. Formulas for ground truth of many graph properties (including degree, diameter, and eccentricity) carry over directly from the general case presented in previous work [2], [3]. However, the analysis of higher-order structure and dense structure is different in bipartite graphs, as no odd-length cycles exist (including triangles) and the densest possible structures are bicliques. We derive formulas to give ground truth for 4-cycles (a.k. a. squares or butterflies) at every vertex and edge in $mathcal {G}c$. Additionally, we demonstrate that bipartite communities (dense vertex subsets) in the factors $A, B$ yield dense bipartite communities in the Kronecker product $C.$ We additionally discuss interesting properties of Kronecker product graphs revealed by the formulas an their impact on using them as benchmarks with ground truth for various complex analytics. For example, for connected $A$ and $B$ of nontrivial size, $mathcal {G}c$ has 4-cycles at vertices/edges associated with vertices/edges in $A$ and $B$ that have none, making it difficult to generate graphs with ground truth bipartite generalizations of truss decomposition (e.g., the k-wing decomposition of [4]).","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121274152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Communication Avoiding 2D Stencil Implementations over PaRSEC Task-Based Runtime 在基于PaRSEC任务的运行时上避免2D模板实现的通信
Pub Date : 2020-05-01 DOI: 10.1109/IPDPSW50202.2020.00127
Yu Pei, Qinglei Cao, G. Bosilca, P. Luszczek, V. Eijkhout, J. Dongarra
Stencil computation or general sparse matrix-vector product (SpMV) are key components in many algorithms like geometric multigrid or Krylov solvers. But their low arithmetic intensity means that memory bandwidth and network latency will be the performance limiting factors. The current architectural trend favors computations over bandwidth, worsening the already unfavorable imbalance. Previous work approached stencil kernel optimization either by improving memory bandwidth usage or by providing a Communication Avoiding (CA) scheme to minimize network latency in repeated sparse vector multiplication by replicating remote work in order to delay communications on the critical path. Focusing on minimizing communication bottleneck in distributed stencil computation, in this study we combine a CA scheme with the computation and communication overlapping that is inherent in a dataflow task-based runtime system such as PaRSEC to demonstrate their combined benefits. We implemented the 2D five point stencil (Jacobi iteration) in PETSc, and over PaRSEC in two flavors, full communications (base-PaRSEC) and CA-PaRSEC which operate directly on a 2D compute grid. Our results running on two clusters, NaCL and Stampede2 indicate that we can achieve 2X speedup over the standard SpMV solution implemented in PETSc, and in certain cases when kernel execution is not dominating the execution time, the CA-PaRSEC version achieved up to 57% and 33% speedup over base-PaRSEC implementation on NaCL and Stampede2 respectively.
模板计算或一般稀疏矩阵向量积(SpMV)是几何多重网格或Krylov求解等算法的关键组成部分。但是它们较低的算术强度意味着内存带宽和网络延迟将成为性能限制因素。当前的架构趋势更倾向于计算而不是带宽,从而加剧了本已不利的不平衡。以前的工作通过提高内存带宽使用或提供通信避免(CA)方案来实现模板内核优化,通过复制远程工作来最小化重复稀疏向量乘法中的网络延迟,从而延迟关键路径上的通信。为了最大限度地减少分布式模板计算中的通信瓶颈,在本研究中,我们将CA方案与基于数据流任务的运行时系统(如PaRSEC)中固有的计算和通信重叠结合起来,以展示它们的综合优势。我们在PETSc中实现了2D五点模板(Jacobi迭代),并在PaRSEC上以两种方式实现,即完全通信(基本PaRSEC)和直接在2D计算网格上操作的CA-PaRSEC。我们在两个集群(NaCL和Stampede2)上运行的结果表明,我们可以比在PETSc中实现的标准SpMV解决方案实现2倍的加速,并且在内核执行不主导执行时间的某些情况下,CA-PaRSEC版本分别比在NaCL和Stampede2上实现的base-PaRSEC实现实现达到57%和33%的加速。
{"title":"Communication Avoiding 2D Stencil Implementations over PaRSEC Task-Based Runtime","authors":"Yu Pei, Qinglei Cao, G. Bosilca, P. Luszczek, V. Eijkhout, J. Dongarra","doi":"10.1109/IPDPSW50202.2020.00127","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00127","url":null,"abstract":"Stencil computation or general sparse matrix-vector product (SpMV) are key components in many algorithms like geometric multigrid or Krylov solvers. But their low arithmetic intensity means that memory bandwidth and network latency will be the performance limiting factors. The current architectural trend favors computations over bandwidth, worsening the already unfavorable imbalance. Previous work approached stencil kernel optimization either by improving memory bandwidth usage or by providing a Communication Avoiding (CA) scheme to minimize network latency in repeated sparse vector multiplication by replicating remote work in order to delay communications on the critical path. Focusing on minimizing communication bottleneck in distributed stencil computation, in this study we combine a CA scheme with the computation and communication overlapping that is inherent in a dataflow task-based runtime system such as PaRSEC to demonstrate their combined benefits. We implemented the 2D five point stencil (Jacobi iteration) in PETSc, and over PaRSEC in two flavors, full communications (base-PaRSEC) and CA-PaRSEC which operate directly on a 2D compute grid. Our results running on two clusters, NaCL and Stampede2 indicate that we can achieve 2X speedup over the standard SpMV solution implemented in PETSc, and in certain cases when kernel execution is not dominating the execution time, the CA-PaRSEC version achieved up to 57% and 33% speedup over base-PaRSEC implementation on NaCL and Stampede2 respectively.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115515028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
EduPar-20 Keynote Speaker edupar20主题演讲
Pub Date : 2020-05-01 DOI: 10.1109/ipdpsw50202.2020.00054
Martin Langhammer
The fields of computer and information science and engineering (CISE) are central to nearly all of society’s needs, opportunities, and challenges. The US National Science Foundation (NSF) was created 70 years ago with a broad mission to promote the progress of science and to catalyze societal and economic benefits. NSF, largely through its CISE directorate which has an annual budget of more than $1B, accounts for over 85% of federally-funded, academic, fundamental computer science research in the US. My talk will give an overview of NSF/CISE research, education, and research infrastructure programs, and relate them to the technical and societal trends and topics that will impact their future trajectory. My talk will highlight opportunity areas for education and workforce development across the computing and information sciences, with a particular emphasis on parallelism and advanced computing and information topics.
计算机和信息科学与工程(CISE)领域是几乎所有社会需求、机遇和挑战的核心。美国国家科学基金会(NSF)成立于70年前,肩负着促进科学进步、促进社会和经济效益的广泛使命。NSF主要通过其CISE理事会,其年度预算超过10亿美元,占美国联邦政府资助的学术基础计算机科学研究的85%以上。我的演讲将概述NSF/CISE的研究、教育和研究基础设施项目,并将它们与影响其未来轨迹的技术和社会趋势和主题联系起来。我的演讲将强调在计算和信息科学中教育和劳动力发展的机会领域,特别强调并行性和高级计算和信息主题。
{"title":"EduPar-20 Keynote Speaker","authors":"Martin Langhammer","doi":"10.1109/ipdpsw50202.2020.00054","DOIUrl":"https://doi.org/10.1109/ipdpsw50202.2020.00054","url":null,"abstract":"The fields of computer and information science and engineering (CISE) are central to nearly all of society’s needs, opportunities, and challenges. The US National Science Foundation (NSF) was created 70 years ago with a broad mission to promote the progress of science and to catalyze societal and economic benefits. NSF, largely through its CISE directorate which has an annual budget of more than $1B, accounts for over 85% of federally-funded, academic, fundamental computer science research in the US. My talk will give an overview of NSF/CISE research, education, and research infrastructure programs, and relate them to the technical and societal trends and topics that will impact their future trajectory. My talk will highlight opportunity areas for education and workforce development across the computing and information sciences, with a particular emphasis on parallelism and advanced computing and information topics.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115413213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1