首页 > 最新文献

Workshop Proceedings of the 49th International Conference on Parallel Processing最新文献

英文 中文
Symmetric Tokens based Group Mutual Exclusion 基于群互斥的对称令牌
A. Aravind
The group mutual exclusion (GME) problem is a generalization of the mutual exclusion problem. The problem is fundamental to parallel and distributed processing, as it is inherent in several applications in the modern multicore-integrated cloud era of the distributed computing world. This paper proposes a First-Come-First-Served (FCFS) GME algorithm that only uses atomic read/write operations for n threads. The proposed algorithm has three key features: (i) its simplicity; (ii) it has complexity in both space (shared variable requirement) and time (remote memory references (RMR)) in cache coherent (CC) models; and (ii) it settles the open problem posed in 2001.
群互斥问题是互斥问题的推广。这个问题是并行和分布式处理的基础,因为它是分布式计算世界的现代多核集成云时代的几个应用程序所固有的。本文提出了一种先到先服务(FCFS)的GME算法,该算法仅对n个线程使用原子读/写操作。该算法具有三个关键特征:(1)简单;(ii)在缓存一致性(CC)模型中,它在空间(共享变量需求)和时间(远程内存引用(RMR))上都具有复杂性;(二)解决了2001年提出的开放性问题。
{"title":"Symmetric Tokens based Group Mutual Exclusion","authors":"A. Aravind","doi":"10.1145/3409390.3409395","DOIUrl":"https://doi.org/10.1145/3409390.3409395","url":null,"abstract":"The group mutual exclusion (GME) problem is a generalization of the mutual exclusion problem. The problem is fundamental to parallel and distributed processing, as it is inherent in several applications in the modern multicore-integrated cloud era of the distributed computing world. This paper proposes a First-Come-First-Served (FCFS) GME algorithm that only uses atomic read/write operations for n threads. The proposed algorithm has three key features: (i) its simplicity; (ii) it has complexity in both space (shared variable requirement) and time (remote memory references (RMR)) in cache coherent (CC) models; and (ii) it settles the open problem posed in 2001.","PeriodicalId":350506,"journal":{"name":"Workshop Proceedings of the 49th International Conference on Parallel Processing","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115081700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Fast Modeling of Network Contention in Batch Point-to-point Communications by Packet-level Simulation with Dynamic Time-stepping 批处理点对点通信中网络争用的动态时间步进分组级仿真快速建模
Zhang Yang, Jintao Peng, Qingkai Liu
Network contention has long been one of the root causes of performance loss in large-scale parallel applications. With the increasing importance of performance modeling to both large-scale application optimization and application-system co-design, the conflict of speed and accuracy in contention modeling is becoming prominent. Cycle-accurate network simulators are often too slow for large scale applications, while point-to-point analytical models are not accurate enough to capture the contention effects. To model the network contention in batch point-to-point communications, we propose a unified contention model after the flow-fair end-to-end congestion control mechanism. The model uses packet-level simulations to be accurate, but can be approximated by a flow-level semi-analytical model when messages are large enough, thus is fast. Furthermore, we propose a dynamic time-stepping technique which significantly speeds up the packet-level simulation with only minor accuracy loss. Experiments with typical communication patterns and application traces show that our model accurately predicates the communication time with an average error of 9%(fixed time step) and the dynamic time-stepping technique improve the simulation performance by up to 131 folds with an average accuracy loss of 10.5% for real application traces.
长期以来,网络争用一直是导致大规模并行应用程序性能下降的根本原因之一。随着性能建模在大规模应用优化和应用系统协同设计中的重要性日益提高,争用建模的速度和准确性矛盾日益突出。对于大规模应用来说,周期精确的网络模拟器通常太慢,而点对点分析模型不够精确,无法捕捉争用效应。为了对批处理点对点通信中的网络争用进行建模,我们提出了一种基于流量公平的端到端拥塞控制机制的统一争用模型。该模型使用包级模拟来保证准确性,但当消息足够大时,可以使用流级半分析模型进行近似,因此速度很快。此外,我们提出了一种动态时间步进技术,该技术显著加快了数据包级模拟的速度,而且精度损失很小。典型通信模式和应用轨迹的实验表明,我们的模型准确地预测了通信时间,平均误差为9%(固定时间步长),动态时间步长技术在实际应用轨迹的平均精度损失为10.5%的情况下,将仿真性能提高了131倍。
{"title":"Fast Modeling of Network Contention in Batch Point-to-point Communications by Packet-level Simulation with Dynamic Time-stepping","authors":"Zhang Yang, Jintao Peng, Qingkai Liu","doi":"10.1145/3409390.3409398","DOIUrl":"https://doi.org/10.1145/3409390.3409398","url":null,"abstract":"Network contention has long been one of the root causes of performance loss in large-scale parallel applications. With the increasing importance of performance modeling to both large-scale application optimization and application-system co-design, the conflict of speed and accuracy in contention modeling is becoming prominent. Cycle-accurate network simulators are often too slow for large scale applications, while point-to-point analytical models are not accurate enough to capture the contention effects. To model the network contention in batch point-to-point communications, we propose a unified contention model after the flow-fair end-to-end congestion control mechanism. The model uses packet-level simulations to be accurate, but can be approximated by a flow-level semi-analytical model when messages are large enough, thus is fast. Furthermore, we propose a dynamic time-stepping technique which significantly speeds up the packet-level simulation with only minor accuracy loss. Experiments with typical communication patterns and application traces show that our model accurately predicates the communication time with an average error of 9%(fixed time step) and the dynamic time-stepping technique improve the simulation performance by up to 131 folds with an average accuracy loss of 10.5% for real application traces.","PeriodicalId":350506,"journal":{"name":"Workshop Proceedings of the 49th International Conference on Parallel Processing","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115218870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Communication-aware Job Scheduling using SLURM 使用SLURM的感知通信的作业调度
P. Mishra, Tushar Agrawal, Preeti Malakar
Job schedulers play an important role in selecting optimal resources for the submitted jobs. However, most of the current job schedulers do not consider job-specific characteristics such as communication patterns during resource allocation. This often leads to sub-optimal node allocations. We propose three node allocation algorithms that consider the job’s communication behavior to improve the performance of communication-intensive jobs. We develop our algorithms for tree-based network topologies. The proposed algorithms aim at minimizing network contention by allocating nodes on the least contended switches. We also show that allocating nodes in powers of two leads to a decrease in inter-switch communication for MPI communications, which further improves performance. We implement and evaluate our algorithms using SLURM, a widely-used and well-known job scheduler. We show that the proposed algorithms can reduce the execution times of communication-intensive jobs by 9% (326 hours) on average. The average wait time of jobs is reduced by 31% across three supercomputer job logs.
作业调度器在为提交的作业选择最佳资源方面发挥着重要作用。但是,当前大多数作业调度器都不考虑特定于作业的特征,例如资源分配期间的通信模式。这通常会导致次优节点分配。为了提高通信密集型作业的性能,我们提出了三种考虑作业通信行为的节点分配算法。我们开发了基于树的网络拓扑算法。提出的算法旨在通过在竞争最少的交换机上分配节点来减少网络竞争。我们还表明,以2的幂分配节点会导致MPI通信的交换机间通信减少,从而进一步提高性能。我们使用SLURM来实现和评估我们的算法,SLURM是一个广泛使用和知名的作业调度器。我们的研究表明,所提出的算法可以将通信密集型作业的执行时间平均减少9%(326小时)。在三个超级计算机作业日志中,作业的平均等待时间减少了31%。
{"title":"Communication-aware Job Scheduling using SLURM","authors":"P. Mishra, Tushar Agrawal, Preeti Malakar","doi":"10.1145/3409390.3409410","DOIUrl":"https://doi.org/10.1145/3409390.3409410","url":null,"abstract":"Job schedulers play an important role in selecting optimal resources for the submitted jobs. However, most of the current job schedulers do not consider job-specific characteristics such as communication patterns during resource allocation. This often leads to sub-optimal node allocations. We propose three node allocation algorithms that consider the job’s communication behavior to improve the performance of communication-intensive jobs. We develop our algorithms for tree-based network topologies. The proposed algorithms aim at minimizing network contention by allocating nodes on the least contended switches. We also show that allocating nodes in powers of two leads to a decrease in inter-switch communication for MPI communications, which further improves performance. We implement and evaluate our algorithms using SLURM, a widely-used and well-known job scheduler. We show that the proposed algorithms can reduce the execution times of communication-intensive jobs by 9% (326 hours) on average. The average wait time of jobs is reduced by 31% across three supercomputer job logs.","PeriodicalId":350506,"journal":{"name":"Workshop Proceedings of the 49th International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129069036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Preference Aware Smart Hospital Selection System for Patients 患者偏好感知智能医院选择系统
Md. Solaiman Chowdhury, Jenifar Rahman, Md. Mahfuzur Rahman
With the rapid enhancement of wireless and mobile technologies, the context information of the user or environment can now easily be collected and analyzed to create useful services. The traditional healthcare facilities in most developing countries do not provide their medical services with equal quality. The patients face lots of difficulties in choosing the best-suited medical services or hospitals when they become sick. To make the proper decision for appropriate services, the patients need to consider many criteria that often create complexity. An efficient system is required to help the patients automatically accumulate the information necessary in making correct medical service selection. In this paper, we have proposed a preference-aware hospital selection model integrated into a cloud computing based context-aware system to satisfy the patients in selecting appropriate services. Through experimentation, we have shown that the developed system makes decisions accurately for the patients.
随着无线和移动技术的快速发展,现在可以很容易地收集和分析用户或环境的上下文信息,以创建有用的服务。大多数发展中国家的传统保健设施不能提供同等质量的医疗服务。当病人生病时,他们在选择最适合的医疗服务或医院时面临许多困难。为了对适当的服务做出正确的决定,患者需要考虑许多通常会造成复杂性的标准。需要一个有效的系统来帮助患者自动积累必要的信息,以做出正确的医疗服务选择。在本文中,我们提出了一个基于偏好感知的医院选择模型,该模型与基于云计算的上下文感知系统相结合,以满足患者选择合适的服务。通过实验,我们已经证明开发的系统可以准确地为患者做出决策。
{"title":"Preference Aware Smart Hospital Selection System for Patients","authors":"Md. Solaiman Chowdhury, Jenifar Rahman, Md. Mahfuzur Rahman","doi":"10.1145/3409390.3409391","DOIUrl":"https://doi.org/10.1145/3409390.3409391","url":null,"abstract":"With the rapid enhancement of wireless and mobile technologies, the context information of the user or environment can now easily be collected and analyzed to create useful services. The traditional healthcare facilities in most developing countries do not provide their medical services with equal quality. The patients face lots of difficulties in choosing the best-suited medical services or hospitals when they become sick. To make the proper decision for appropriate services, the patients need to consider many criteria that often create complexity. An efficient system is required to help the patients automatically accumulate the information necessary in making correct medical service selection. In this paper, we have proposed a preference-aware hospital selection model integrated into a cloud computing based context-aware system to satisfy the patients in selecting appropriate services. Through experimentation, we have shown that the developed system makes decisions accurately for the patients.","PeriodicalId":350506,"journal":{"name":"Workshop Proceedings of the 49th International Conference on Parallel Processing","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125589167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A GCC-based Compliance Checker for Single-translation-unit, Identifier-related MISRA-C Rules 针对单个翻译单元、标识符相关的MISRA-C规则的基于gcc的遵从性检查器
Guan-Ren Wang, Peng-Sheng Chen
MISRA-C is a well-defined software specification for the C programming language that gives programmers criteria to develop reliable programs. This paper implements a MISRA-C compliance checker based on the GCC compiler infrastructure. It focuses on identifier-related rules that are single-translation-unit-labeled. We describe and develop strategies for implementing the checking codes. We also discuss the rules that can be detected by existing GCC options. For the tested benchmark programs, the modified GCC compiler can correctly assess compliance with the target MISRA- C rules.
MISRA-C是一种定义良好的C编程语言软件规范,它为程序员提供了开发可靠程序的标准。本文实现了一个基于GCC编译器基础结构的MISRA-C遵从性检查器。它主要关注与标识符相关的规则,这些规则是单个翻译单元标记的。我们描述和开发实现检查代码的策略。我们还讨论了现有GCC选项可以检测到的规则。对于测试的基准程序,修改后的GCC编译器可以正确地评估是否符合目标MISRA- C规则。
{"title":"A GCC-based Compliance Checker for Single-translation-unit, Identifier-related MISRA-C Rules","authors":"Guan-Ren Wang, Peng-Sheng Chen","doi":"10.1145/3409390.3409396","DOIUrl":"https://doi.org/10.1145/3409390.3409396","url":null,"abstract":"MISRA-C is a well-defined software specification for the C programming language that gives programmers criteria to develop reliable programs. This paper implements a MISRA-C compliance checker based on the GCC compiler infrastructure. It focuses on identifier-related rules that are single-translation-unit-labeled. We describe and develop strategies for implementing the checking codes. We also discuss the rules that can be detected by existing GCC options. For the tested benchmark programs, the modified GCC compiler can correctly assess compliance with the target MISRA- C rules.","PeriodicalId":350506,"journal":{"name":"Workshop Proceedings of the 49th International Conference on Parallel Processing","volume":"2030 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129774815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Assessing the Overhead of Offloading Compression Tasks 评估卸载压缩任务的开销
L. Promberger, R. Schwemmer, H. Fröning
Exploring compression is increasingly promising as trade-off between computations and data movement. There are two main reasons: First, the gap between processing speed and I/O continues to grow, and technology trends indicate a continuation of this. Second, performance is determined by energy efficiency, and the overall power consumption is dominated by the consumption of data movements. For these reasons there is already a plethora of related works on compression from various domains. Most recently, a couple of accelerators have been introduced to offload compression tasks from the main processor, for instance by AHA, Intel and Microsoft. Yet, one lacks the understanding of the overhead of compression when offloading tasks. In particular, such offloading is most beneficial for overlap with other tasks, if the associated overhead on the main processor is negligible. This work evaluates the integration costs compared to a solely software-based solution considering multiple compression algorithms. Among others, High Energy Physics data are used as a prime example of big data sources. The results imply that on average the zlib implementation on the accelerator achieves a comparable compression ratio to zlib level 2 on a CPU, while having up to 17 times the throughput and utilizing over 80 % less CPU resources. These results suggest that, given the right orchestration of compression and data movement tasks, the overhead of offloading compression is limited but present. Considering that compression is only a single task of a larger data processing pipeline, this overhead cannot be neglected.
探索压缩作为计算和数据移动之间的权衡越来越有前途。主要有两个原因:首先,处理速度和I/O之间的差距继续扩大,技术趋势表明这种差距将继续扩大。其次,性能是由能源效率决定的,总体功耗主要是数据移动的消耗。由于这些原因,已经有大量来自不同领域的压缩相关工作。最近,一些加速器被引入来从主处理器中卸载压缩任务,例如AHA, Intel和Microsoft。然而,人们缺乏对卸载任务时压缩开销的理解。特别是,如果主处理器上的相关开销可以忽略不计,那么这种卸载对于与其他任务重叠是最有利的。这项工作评估了与考虑多种压缩算法的单独基于软件的解决方案相比的集成成本。其中,高能物理数据被用作大数据源的主要示例。结果表明,平均而言,加速器上的zlib实现实现了与CPU上的zlib级别2相当的压缩比,同时具有高达17倍的吞吐量并使用超过80%的CPU资源。这些结果表明,在正确编排压缩和数据移动任务的情况下,卸载压缩的开销是有限的,但仍然存在。考虑到压缩只是一个更大的数据处理管道的一个任务,这个开销是不能忽视的。
{"title":"Assessing the Overhead of Offloading Compression Tasks","authors":"L. Promberger, R. Schwemmer, H. Fröning","doi":"10.1145/3409390.3409405","DOIUrl":"https://doi.org/10.1145/3409390.3409405","url":null,"abstract":"Exploring compression is increasingly promising as trade-off between computations and data movement. There are two main reasons: First, the gap between processing speed and I/O continues to grow, and technology trends indicate a continuation of this. Second, performance is determined by energy efficiency, and the overall power consumption is dominated by the consumption of data movements. For these reasons there is already a plethora of related works on compression from various domains. Most recently, a couple of accelerators have been introduced to offload compression tasks from the main processor, for instance by AHA, Intel and Microsoft. Yet, one lacks the understanding of the overhead of compression when offloading tasks. In particular, such offloading is most beneficial for overlap with other tasks, if the associated overhead on the main processor is negligible. This work evaluates the integration costs compared to a solely software-based solution considering multiple compression algorithms. Among others, High Energy Physics data are used as a prime example of big data sources. The results imply that on average the zlib implementation on the accelerator achieves a comparable compression ratio to zlib level 2 on a CPU, while having up to 17 times the throughput and utilizing over 80 % less CPU resources. These results suggest that, given the right orchestration of compression and data movement tasks, the overhead of offloading compression is limited but present. Considering that compression is only a single task of a larger data processing pipeline, this overhead cannot be neglected.","PeriodicalId":350506,"journal":{"name":"Workshop Proceedings of the 49th International Conference on Parallel Processing","volume":"272 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122763264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Improving the Space-Time Efficiency of Matrix Multiplication Algorithms 提高矩阵乘法算法的空时效率
Yuan Tang
Classic cache-oblivious parallel matrix multiplication algorithms achieve optimality either in time or space, but not both, which promotes lots of research on the best possible balance or trade-off of such algorithms. We study modern processor-oblivious runtime systems and figure out several ways to improve algorithm’s time complexity while still bounding space and cache requirements to be asymptotically optimal. By our study, we give out sub-linear time, optimal work, space and caching algorithms for both general matrix multiplication on a semiring and Strassen-like fast algorithms on a ring. Our experiments show such algorithms have empirical advantages over classic counterparts. Our study provides new insights and research angles on how to optimize cache-oblivious parallel algorithms from both theoretical and empirical perspectives.
经典的缓参无关并行矩阵乘法算法要么在时间上实现最优,要么在空间上实现最优,但不能同时在时间和空间上实现最优,这促进了对这类算法的最佳平衡或权衡的大量研究。我们研究了现代处理器无关的运行时系统,并找出了几种方法来提高算法的时间复杂度,同时仍然约束空间和缓存需求以达到渐近最优。通过我们的研究,我们给出了半环上一般矩阵乘法和环上类似strassen的快速算法的次线性时间、最优工作、空间和缓存算法。我们的实验表明,这种算法比经典算法具有经验优势。我们的研究从理论和实证两方面为如何优化无缓存并行算法提供了新的见解和研究角度。
{"title":"Improving the Space-Time Efficiency of Matrix Multiplication Algorithms","authors":"Yuan Tang","doi":"10.1145/3409390.3409404","DOIUrl":"https://doi.org/10.1145/3409390.3409404","url":null,"abstract":"Classic cache-oblivious parallel matrix multiplication algorithms achieve optimality either in time or space, but not both, which promotes lots of research on the best possible balance or trade-off of such algorithms. We study modern processor-oblivious runtime systems and figure out several ways to improve algorithm’s time complexity while still bounding space and cache requirements to be asymptotically optimal. By our study, we give out sub-linear time, optimal work, space and caching algorithms for both general matrix multiplication on a semiring and Strassen-like fast algorithms on a ring. Our experiments show such algorithms have empirical advantages over classic counterparts. Our study provides new insights and research angles on how to optimize cache-oblivious parallel algorithms from both theoretical and empirical perspectives.","PeriodicalId":350506,"journal":{"name":"Workshop Proceedings of the 49th International Conference on Parallel Processing","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122846715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Network and Load-Aware Resource Manager for MPI Programs MPI程序的网络和负载感知资源管理器
Ashish Kumar Kumar, N. Jain, Preeti Malakar
We present a resource broker for MPI jobs in a shared cluster, considering the current compute load and available network bandwidths. MPI programs are generally communication-intensive. Thus the current network availability between the compute nodes impacts performance. Many existing resource allocation techniques mostly consider static node attributes and some dynamic resource attributes. This does not lead to a good allocation in case of shared clusters because the network usage and system load vary. We developed a load and network-aware heuristic for resource allocation. We incorporated the current network state in our heuristic. It is able to reduce execution times by more than 38% on average as compared to the default allocation.
考虑到当前的计算负载和可用的网络带宽,我们为共享集群中的MPI作业提供了一个资源代理。MPI程序通常是通信密集型的。因此,计算节点之间的当前网络可用性会影响性能。现有的资源分配技术大多考虑静态节点属性和一些动态资源属性。在共享集群的情况下,这不会导致良好的分配,因为网络使用情况和系统负载各不相同。我们为资源分配开发了一种负载和网络感知启发式方法。我们将当前网络状态纳入我们的启发式算法中。与默认分配相比,它能够将执行时间平均减少38%以上。
{"title":"Network and Load-Aware Resource Manager for MPI Programs","authors":"Ashish Kumar Kumar, N. Jain, Preeti Malakar","doi":"10.1145/3409390.3409406","DOIUrl":"https://doi.org/10.1145/3409390.3409406","url":null,"abstract":"We present a resource broker for MPI jobs in a shared cluster, considering the current compute load and available network bandwidths. MPI programs are generally communication-intensive. Thus the current network availability between the compute nodes impacts performance. Many existing resource allocation techniques mostly consider static node attributes and some dynamic resource attributes. This does not lead to a good allocation in case of shared clusters because the network usage and system load vary. We developed a load and network-aware heuristic for resource allocation. We incorporated the current network state in our heuristic. It is able to reduce execution times by more than 38% on average as compared to the default allocation.","PeriodicalId":350506,"journal":{"name":"Workshop Proceedings of the 49th International Conference on Parallel Processing","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122013226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
BSRNG: A High Throughput Parallel BitSliced Approach for Random Number Generators BSRNG:一种用于随机数生成器的高吞吐量并行位切片方法
Saleh Khalaj Monfared, Omid Hajihassani, M. Kiarostami, S. M. Zanjani, Dara Rahmati, S. Gorgin
In this work, a high throughput method for generating high-quality Pseudo-Random Numbers using the bitslicing technique is proposed. In such a technique, instead of the conventional row-major data representation, column-major data representation is employed, which allows the bitslicing implementation to take full advantage of all the available datapath of the hardware platform. By employing this data representation as building blocks of algorithms, we showcase the capability and scalability of our proposed method in various PRNG methods in the category of block and stream ciphers. The LFSR-based (Linear Feedback Shift Register) nature of the PRNG in our implementation perfectly suits the GPU’s many-core structure due to its register oriented architecture. In the proposed SIMD vectorized GPU implementation, each GPU thread can generate several 32 pseudo-random bits in each LFSR clock cycle. We then compare our implementation with some of the most significant PRNGs that display a satisfactory performance throughput and randomness criteria. The proposed implementation successfully passes the NIST test for statistical randomness and bit-wise correlation criteria. For computer-based PRNG and the optical solutions in terms of performance and performance per cost, this technique is efficient while maintaining an acceptable randomness measure. Our highest performance among all of the implemented CPRNGs with the proposed method is achieved by the MICKEY 2.0 algorithm, which shows 40% improvement over state of the art NVIDIA’s proprietary high-performance PRNG, cuRAND library, achieving 2.72 Tb/s of throughput on the affordable NVIDIA GTX 2080 Ti.
在这项工作中,提出了一种使用位切片技术生成高质量伪随机数的高吞吐量方法。在这种技术中,采用的不是传统的以行为主的数据表示,而是以列为主的数据表示,这使得位切片实现能够充分利用硬件平台的所有可用数据路径。通过使用这种数据表示作为算法的构建块,我们展示了我们提出的方法在块和流密码类别中的各种PRNG方法中的能力和可扩展性。在我们的实现中,基于lfsr(线性反馈移位寄存器)的PRNG性质非常适合GPU的多核结构,因为它是面向寄存器的架构。在提出的SIMD矢量化GPU实现中,每个GPU线程可以在每个LFSR时钟周期内生成多个32个伪随机比特。然后,我们将我们的实现与一些最重要的prng进行比较,这些prng显示出令人满意的性能吞吐量和随机性标准。提出的实现成功地通过了NIST对统计随机性和逐位相关标准的测试。对于基于计算机的PRNG和光学解决方案,在性能和每成本性能方面,该技术在保持可接受的随机性度量的同时是有效的。在所有已实现的cprng中,我们的最高性能是通过MICKEY 2.0算法实现的,该算法比NVIDIA专有的高性能PRNG cuRAND库提高了40%,在价格合理的NVIDIA GTX 2080 Ti上实现了2.72 Tb/s的吞吐量。
{"title":"BSRNG: A High Throughput Parallel BitSliced Approach for Random Number Generators","authors":"Saleh Khalaj Monfared, Omid Hajihassani, M. Kiarostami, S. M. Zanjani, Dara Rahmati, S. Gorgin","doi":"10.1145/3409390.3409402","DOIUrl":"https://doi.org/10.1145/3409390.3409402","url":null,"abstract":"In this work, a high throughput method for generating high-quality Pseudo-Random Numbers using the bitslicing technique is proposed. In such a technique, instead of the conventional row-major data representation, column-major data representation is employed, which allows the bitslicing implementation to take full advantage of all the available datapath of the hardware platform. By employing this data representation as building blocks of algorithms, we showcase the capability and scalability of our proposed method in various PRNG methods in the category of block and stream ciphers. The LFSR-based (Linear Feedback Shift Register) nature of the PRNG in our implementation perfectly suits the GPU’s many-core structure due to its register oriented architecture. In the proposed SIMD vectorized GPU implementation, each GPU thread can generate several 32 pseudo-random bits in each LFSR clock cycle. We then compare our implementation with some of the most significant PRNGs that display a satisfactory performance throughput and randomness criteria. The proposed implementation successfully passes the NIST test for statistical randomness and bit-wise correlation criteria. For computer-based PRNG and the optical solutions in terms of performance and performance per cost, this technique is efficient while maintaining an acceptable randomness measure. Our highest performance among all of the implemented CPRNGs with the proposed method is achieved by the MICKEY 2.0 algorithm, which shows 40% improvement over state of the art NVIDIA’s proprietary high-performance PRNG, cuRAND library, achieving 2.72 Tb/s of throughput on the affordable NVIDIA GTX 2080 Ti.","PeriodicalId":350506,"journal":{"name":"Workshop Proceedings of the 49th International Conference on Parallel Processing","volume":"111 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127508806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Randomized Authentication using IBE for Opportunistic Networks 基于IBE的机会网络随机认证
Kai Wang, Kazuya Sakai
Opportunistic networks (ONs) are widely used in many critical network applications, and security/privacy issues in ONs are critical for its wide adaption. In this paper, we propose a randomized authentication protocol which consists of node registration and authentication phases using identity-based encpryption (IBE) and trust framework. The key ideas of our authentication protocol are to generate public keys from publicly available node IDs, and not only central registration server but also the nodes with a high trust value can authenticate nodes in a network. By doing this, our protocol is of light-weight and the authentication process is randomized in a distributed way. In addition, to accommodate the disadvantage of IBE, we introduce the idea of distributed KGCs (key generation centers) and the trust framework. The protocol level security of the proposed scheme is proven by indistinguishability-based provable security analysis using random oracles, and the qualitative security analyses for various attacks are conducted.
机会网络被广泛应用于许多关键的网络应用中,而机会网络的安全/隐私问题是其广泛应用的关键。本文提出了一种基于身份加密和信任框架的随机认证协议,该协议由节点注册和认证阶段组成。我们的认证协议的核心思想是从公开可用的节点id生成公钥,不仅中央注册服务器,而且具有高信任值的节点都可以对网络中的节点进行认证。通过这样做,我们的协议是轻量级的,身份验证过程以分布式的方式随机化。此外,为了适应IBE的缺点,我们引入了分布式密钥生成中心(kgc)和信任框架的思想。通过基于不可区分性的可证明安全分析,利用随机预言器验证了该方案的协议级安全性,并对各种攻击进行了定性安全分析。
{"title":"Randomized Authentication using IBE for Opportunistic Networks","authors":"Kai Wang, Kazuya Sakai","doi":"10.1145/3409390.3409392","DOIUrl":"https://doi.org/10.1145/3409390.3409392","url":null,"abstract":"Opportunistic networks (ONs) are widely used in many critical network applications, and security/privacy issues in ONs are critical for its wide adaption. In this paper, we propose a randomized authentication protocol which consists of node registration and authentication phases using identity-based encpryption (IBE) and trust framework. The key ideas of our authentication protocol are to generate public keys from publicly available node IDs, and not only central registration server but also the nodes with a high trust value can authenticate nodes in a network. By doing this, our protocol is of light-weight and the authentication process is randomized in a distributed way. In addition, to accommodate the disadvantage of IBE, we introduce the idea of distributed KGCs (key generation centers) and the trust framework. The protocol level security of the proposed scheme is proven by indistinguishability-based provable security analysis using random oracles, and the qualitative security analyses for various attacks are conducted.","PeriodicalId":350506,"journal":{"name":"Workshop Proceedings of the 49th International Conference on Parallel Processing","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128326974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
Workshop Proceedings of the 49th International Conference on Parallel Processing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1