首页 > 最新文献

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献

英文 中文
Rethinking Support for Region Conflict Exceptions 重新考虑对区域冲突例外的支持
Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00116
Swarnendu Biswas, Rui Zhang, Michael D. Bond, Brandon Lucia
Current shared-memory systems provide well-defined execution semantics only for data-race-free executions. A state-of-the-art technique called Conflict Exceptions (CE) extends M(O) ESI-based coherence to provide defined semantics to all program executions. However, CE incurs significant performance costs because of its need to frequently access metadata in memory. In this work, we explore designs for practical architecture support for region conflict exceptions. First, we propose an on-chip metadata cache called access information memory (AIM) to reduce memory accesses in CE. The extended design is called CE+. In spite of the AIM, CE+ stresses or saturates the on-chip interconnect and the off-chip memory network bandwidth because of its reliance on eager write-invalidation-based coherence. We explore whether detecting conflicts is potentially better suited to cache coherence based on release consistency and self-invalidation, rather than M(O) ESI-based coherence. We realize this insight in a novel architecture design called ARC. Our evaluation shows that CE+ improves the run-time performance and energy usage over CE for several applications across different core counts, but can suffer performance penalties from network saturation. ARC generally outperforms CE, and is competitive with CE+ on average while stressing the on-chip interconnect and off-chip memory network much less, showing that coherence based on release consistency and self-invalidation is well suited to detecting region conflicts.
当前的共享内存系统仅为无数据竞争的执行提供定义良好的执行语义。一种称为冲突异常(CE)的最新技术扩展了基于M(O) esi的一致性,为所有程序执行提供定义的语义。但是,由于CE需要频繁访问内存中的元数据,因此会产生很大的性能成本。在这项工作中,我们探索了为区域冲突例外提供实用架构支持的设计。首先,我们提出了一种称为访问信息存储器(AIM)的片上元数据缓存,以减少CE中的内存访问。扩展设计称为CE+。尽管采用了AIM,但由于CE+依赖于基于写入无效的相干性,因此对片上互连和片外存储网络带宽造成了压力或饱和。我们探讨了检测冲突是否可能更适合基于释放一致性和自我失效的缓存一致性,而不是基于M(O) esi的一致性。我们在一种名为ARC的新颖建筑设计中实现了这一见解。我们的评估表明,对于不同核心数量的几个应用程序,CE+比CE提高了运行时性能和能源使用,但可能会受到网络饱和的性能损失。ARC总体上优于CE,平均与CE+竞争,同时对片上互连和片外存储网络的压力要小得多,这表明基于释放一致性和自失效的相干性非常适合检测区域冲突。
{"title":"Rethinking Support for Region Conflict Exceptions","authors":"Swarnendu Biswas, Rui Zhang, Michael D. Bond, Brandon Lucia","doi":"10.1109/IPDPS.2019.00116","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00116","url":null,"abstract":"Current shared-memory systems provide well-defined execution semantics only for data-race-free executions. A state-of-the-art technique called Conflict Exceptions (CE) extends M(O) ESI-based coherence to provide defined semantics to all program executions. However, CE incurs significant performance costs because of its need to frequently access metadata in memory. In this work, we explore designs for practical architecture support for region conflict exceptions. First, we propose an on-chip metadata cache called access information memory (AIM) to reduce memory accesses in CE. The extended design is called CE+. In spite of the AIM, CE+ stresses or saturates the on-chip interconnect and the off-chip memory network bandwidth because of its reliance on eager write-invalidation-based coherence. We explore whether detecting conflicts is potentially better suited to cache coherence based on release consistency and self-invalidation, rather than M(O) ESI-based coherence. We realize this insight in a novel architecture design called ARC. Our evaluation shows that CE+ improves the run-time performance and energy usage over CE for several applications across different core counts, but can suffer performance penalties from network saturation. ARC generally outperforms CE, and is competitive with CE+ on average while stressing the on-chip interconnect and off-chip memory network much less, showing that coherence based on release consistency and self-invalidation is well suited to detecting region conflicts.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128661858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
MULTISKIPGRAPH: A Self-Stabilizing Overlay Network that Maintains Monotonic Searchability MULTISKIPGRAPH:一种保持单调可搜索性的自稳定覆盖网络
Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00093
Linghui Luo, C. Scheideler, Thim Strothmann
Self-stabilizing overlay networks have the advantage of being able to recover from illegal states and faults. However, the majority of these networks cannot give any guarantees on their functionality while the recovery process is going on. We are especially interested in searchability, i.e., the functionality that search messages for a specific node are answered successfully if a node exists in the network. In this paper we investigate overlay networks that ensure the maintenance of monotonic searchability while the self-stabilization is going on. More precisely, once a search message from node u to another node v is successfully delivered, all future search messages from u to v succeed as well. We extend the existing research by focusing on skip graphs and present a solution for two scenarios: (i) the goal topology is a super graph of the perfect skip graph and (ii) the goal topology is exactly the perfect skip graph.
自稳定覆盖网络具有从非法状态和故障中恢复的优点。然而,在恢复过程进行时,这些网络中的大多数不能对其功能提供任何保证。我们对可搜索性特别感兴趣,也就是说,如果网络中存在一个节点,那么搜索特定节点的消息的功能就会得到成功的回答。本文研究了覆盖网络在自稳定过程中保持单调可搜索性的问题。更准确地说,一旦成功传递了从节点u到另一个节点v的搜索消息,以后所有从u到v的搜索消息也会成功。我们扩展了已有的研究,重点关注跳跃图,并给出了两种情况的解决方案:(i)目标拓扑是完美跳跃图的超图,(ii)目标拓扑正是完美跳跃图。
{"title":"MULTISKIPGRAPH: A Self-Stabilizing Overlay Network that Maintains Monotonic Searchability","authors":"Linghui Luo, C. Scheideler, Thim Strothmann","doi":"10.1109/IPDPS.2019.00093","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00093","url":null,"abstract":"Self-stabilizing overlay networks have the advantage of being able to recover from illegal states and faults. However, the majority of these networks cannot give any guarantees on their functionality while the recovery process is going on. We are especially interested in searchability, i.e., the functionality that search messages for a specific node are answered successfully if a node exists in the network. In this paper we investigate overlay networks that ensure the maintenance of monotonic searchability while the self-stabilization is going on. More precisely, once a search message from node u to another node v is successfully delivered, all future search messages from u to v succeed as well. We extend the existing research by focusing on skip graphs and present a solution for two scenarios: (i) the goal topology is a super graph of the perfect skip graph and (ii) the goal topology is exactly the perfect skip graph.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129175069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
D3: Deterministic Data Distribution for Efficient Data Reconstruction in Erasure-Coded Distributed Storage Systems 基于确定性数据分布的擦除编码分布式存储系统的高效数据重构
Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00064
Zhipeng Li, Min Lv, Yinlong Xu, Yongkun Li, Liangliang Xu
Due to individual unreliable commodity components, failures are common in large-scale distributed storage systems. Erasure codes are widely deployed in practical storage systems to provide fault tolerance with low storage overhead. However, the commonly used random data placement in storage systems based on erasure codes induces to heavy cross-rack traffic, load imbalance, and random access, which slow down the recovery process upon failures. In this paper, with orthogonal arrays, we define a Deterministic Data Distribution (D^3) of blocks to nodes and racks, and propose an efficient failure recovery approach based on D^3. D^3 not only uniformly distributes data/parity blocks among storage servers, but also balances the repair traffic among racks and storage servers for failure recovery. Furthermore, D^3 also minimizes the cross-rack repair traffic for data layouts against a single rack failure and provides sequential access for failure recovery. We implement D3 in Hadoop Distributed File System (HDFS) with a cluster of 28 machines. Our experiments show that D^3 significantly speeds up the failure recovery process compared with random data distribution, e.g., 2.21 times for (6, 3)-RS code in a system consisting of eight racks and three nodes in each rack.
由于单个不可靠的商品组件,故障在大规模分布式存储系统中是常见的。Erasure码被广泛应用于实际的存储系统中,以提供低存储开销的容错能力。但是,存储系统中常用的基于擦除码的随机数据放置方式会导致大的跨机架流量、负载不均衡和随机访问,导致故障恢复速度变慢。本文用正交阵列定义了块到节点和机架的确定性数据分布(D^3),并提出了一种基于D^3的有效故障恢复方法。D^3不仅在存储服务器之间均匀分配数据/奇偶校验块,而且在机架和存储服务器之间平衡故障恢复的修复流量。此外,D^3还最大限度地减少了针对单个机架故障的数据布局的跨机架修复流量,并为故障恢复提供了顺序访问。我们使用28台机器的集群在Hadoop分布式文件系统(HDFS)中实现D3。我们的实验表明,与随机数据分布相比,D^3显著加快了故障恢复过程,例如,在由8个机架和每个机架三个节点组成的系统中,(6,3)-RS代码的故障恢复速度为2.21倍。
{"title":"D3: Deterministic Data Distribution for Efficient Data Reconstruction in Erasure-Coded Distributed Storage Systems","authors":"Zhipeng Li, Min Lv, Yinlong Xu, Yongkun Li, Liangliang Xu","doi":"10.1109/IPDPS.2019.00064","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00064","url":null,"abstract":"Due to individual unreliable commodity components, failures are common in large-scale distributed storage systems. Erasure codes are widely deployed in practical storage systems to provide fault tolerance with low storage overhead. However, the commonly used random data placement in storage systems based on erasure codes induces to heavy cross-rack traffic, load imbalance, and random access, which slow down the recovery process upon failures. In this paper, with orthogonal arrays, we define a Deterministic Data Distribution (D^3) of blocks to nodes and racks, and propose an efficient failure recovery approach based on D^3. D^3 not only uniformly distributes data/parity blocks among storage servers, but also balances the repair traffic among racks and storage servers for failure recovery. Furthermore, D^3 also minimizes the cross-rack repair traffic for data layouts against a single rack failure and provides sequential access for failure recovery. We implement D3 in Hadoop Distributed File System (HDFS) with a cluster of 28 machines. Our experiments show that D^3 significantly speeds up the failure recovery process compared with random data distribution, e.g., 2.21 times for (6, 3)-RS code in a system consisting of eight racks and three nodes in each rack.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129873918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Load-Balanced Sparse MTTKRP on GPUs gpu上的负载均衡稀疏MTTKRP
Pub Date : 2019-04-06 DOI: 10.1109/IPDPS.2019.00023
Israt Nisa, Jiajia Li, Aravind Sukumaran-Rajam, R. Vuduc, P. Sadayappan
Sparse matricized tensor times Khatri-Rao product (MTTKRP) is one of the most computationally expensive kernels in sparse tensor computations. This work focuses on optimizing the MTTKRP operation on GPUs, addressing both performance and storage requirements. We begin by identifying the performance bottlenecks in directly extending the state-of-the-art CSF (compressed sparse fiber) format from CPUs to GPUs. A significant challenge with GPUs compared to multicore CPUs is that of utilizing the much greater degree of parallelism in a load-balanced fashion for irregular computations like sparse MTTKRP. To address this issue, we develop a new storage-efficient representation for tensors that enables high-performance, load-balanced execution of MTTKRP on GPUs. A GPU implementation of sparse MTTKRP using the new sparse tensor representation is shown to outperform all currently known parallel sparse CPU and GPU MTTKRP implementations.
稀疏矩阵张量乘Khatri-Rao积(MTTKRP)是稀疏张量计算中计算量最大的核之一。这项工作的重点是优化gpu上的MTTKRP操作,解决性能和存储需求。我们首先确定直接将最先进的CSF(压缩稀疏光纤)格式从cpu扩展到gpu的性能瓶颈。与多核cpu相比,gpu面临的一个重大挑战是以负载平衡的方式利用更高程度的并行性来进行稀疏MTTKRP等不规则计算。为了解决这个问题,我们为张量开发了一种新的存储效率表示,可以在gpu上实现高性能,负载均衡的MTTKRP执行。使用新的稀疏张量表示的稀疏MTTKRP的GPU实现被证明优于所有目前已知的并行稀疏CPU和GPU MTTKRP实现。
{"title":"Load-Balanced Sparse MTTKRP on GPUs","authors":"Israt Nisa, Jiajia Li, Aravind Sukumaran-Rajam, R. Vuduc, P. Sadayappan","doi":"10.1109/IPDPS.2019.00023","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00023","url":null,"abstract":"Sparse matricized tensor times Khatri-Rao product (MTTKRP) is one of the most computationally expensive kernels in sparse tensor computations. This work focuses on optimizing the MTTKRP operation on GPUs, addressing both performance and storage requirements. We begin by identifying the performance bottlenecks in directly extending the state-of-the-art CSF (compressed sparse fiber) format from CPUs to GPUs. A significant challenge with GPUs compared to multicore CPUs is that of utilizing the much greater degree of parallelism in a load-balanced fashion for irregular computations like sparse MTTKRP. To address this issue, we develop a new storage-efficient representation for tensors that enables high-performance, load-balanced execution of MTTKRP on GPUs. A GPU implementation of sparse MTTKRP using the new sparse tensor representation is shown to outperform all currently known parallel sparse CPU and GPU MTTKRP implementations.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129269260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 36
Improving Strong-Scaling of CNN Training by Exploiting Finer-Grained Parallelism 利用细粒度并行性改进CNN训练的强尺度
Pub Date : 2019-03-15 DOI: 10.1109/IPDPS.2019.00031
Nikoli Dryden, N. Maruyama, Tom Benson, Tim Moon, M. Snir, B. V. Essen
Scaling CNN training is necessary to keep up with growing datasets and reduce training time. We also see an emerging need to handle datasets with very large samples, where memory requirements for training are large. Existing training frameworks use a data-parallel approach that partitions samples within a mini-batch, but limits to scaling the mini-batch size and memory consumption makes this untenable for large samples. We describe and implement new approaches to convolution, which parallelize using spatial decomposition or a combination of sample and spatial decomposition. This introduces many performance knobs for a network, so we develop a performance model for CNNs and present a method for using it to automatically determine efficient parallelization strategies. We evaluate our algorithms with microbenchmarks and image classification with ResNet-50. Our algorithms allow us to prototype a model for a mesh-tangling dataset, where sample sizes are very large. We show that our parallelization achieves excellent strong and weak scaling and enables training for previously unreachable datasets.
缩放CNN训练对于跟上不断增长的数据集和减少训练时间是必要的。我们还看到了处理具有非常大样本的数据集的新兴需求,其中训练的内存需求很大。现有的训练框架使用数据并行方法在一个小批量中划分样本,但是对小批量大小和内存消耗的限制使得这种方法不适用于大样本。我们描述并实现了卷积的新方法,该方法使用空间分解或样本和空间分解的组合进行并行化。这为网络引入了许多性能旋钮,因此我们为cnn开发了一个性能模型,并提出了一种使用它来自动确定有效并行化策略的方法。我们用微基准测试和ResNet-50图像分类来评估我们的算法。我们的算法允许我们为网格缠结数据集创建模型原型,其中样本量非常大。我们表明,我们的并行化实现了出色的强和弱缩放,并能够训练以前无法到达的数据集。
{"title":"Improving Strong-Scaling of CNN Training by Exploiting Finer-Grained Parallelism","authors":"Nikoli Dryden, N. Maruyama, Tom Benson, Tim Moon, M. Snir, B. V. Essen","doi":"10.1109/IPDPS.2019.00031","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00031","url":null,"abstract":"Scaling CNN training is necessary to keep up with growing datasets and reduce training time. We also see an emerging need to handle datasets with very large samples, where memory requirements for training are large. Existing training frameworks use a data-parallel approach that partitions samples within a mini-batch, but limits to scaling the mini-batch size and memory consumption makes this untenable for large samples. We describe and implement new approaches to convolution, which parallelize using spatial decomposition or a combination of sample and spatial decomposition. This introduces many performance knobs for a network, so we develop a performance model for CNNs and present a method for using it to automatically determine efficient parallelization strategies. We evaluate our algorithms with microbenchmarks and image classification with ResNet-50. Our algorithms allow us to prototype a model for a mesh-tangling dataset, where sample sizes are very large. We show that our parallelization achieves excellent strong and weak scaling and enables training for previously unreachable datasets.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130279763","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 43
A Modular Benchmarking Infrastructure for High-Performance and Reproducible Deep Learning 面向高性能和可复制深度学习的模块化基准基础设施
Pub Date : 2019-01-29 DOI: 10.1109/IPDPS.2019.00018
Tal Ben-Nun, Maciej Besta, Simon Huber, A. Ziogas, D. Peter, T. Hoefler
We introduce Deep500: the first customizable benchmarking infrastructure that enables fair comparison of the plethora of deep learning frameworks, algorithms, libraries, and techniques. The key idea behind Deep500 is its modular design, where deep learning is factorized into four distinct levels: operators, network processing, training, and distributed training. Our evaluation illustrates that Deep500 is customizable (enables combining and benchmarking different deep learning codes) and fair (uses carefully selected metrics). Moreover, Deep500 is fast (incurs negligible overheads), verifiable (offers infrastructure to analyze correctness), and reproducible. Finally, as the first distributed and reproducible benchmarking system for deep learning, Deep500 provides software infrastructure to utilize the most powerful supercomputers for extreme-scale workloads.
我们介绍了Deep500:第一个可定制的基准基础设施,可以对大量的深度学习框架、算法、库和技术进行公平比较。Deep500背后的关键思想是它的模块化设计,其中深度学习被分解为四个不同的层次:操作员、网络处理、训练和分布式训练。我们的评估表明,Deep500是可定制的(可以组合不同的深度学习代码并对其进行基准测试)和公平的(使用精心选择的指标)。此外,Deep500速度快(产生的开销可以忽略不计),可验证(提供分析正确性的基础设施),并且可复制。最后,作为第一个分布式和可复制的深度学习基准测试系统,Deep500提供了软件基础设施,以利用最强大的超级计算机来处理极端规模的工作负载。
{"title":"A Modular Benchmarking Infrastructure for High-Performance and Reproducible Deep Learning","authors":"Tal Ben-Nun, Maciej Besta, Simon Huber, A. Ziogas, D. Peter, T. Hoefler","doi":"10.1109/IPDPS.2019.00018","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00018","url":null,"abstract":"We introduce Deep500: the first customizable benchmarking infrastructure that enables fair comparison of the plethora of deep learning frameworks, algorithms, libraries, and techniques. The key idea behind Deep500 is its modular design, where deep learning is factorized into four distinct levels: operators, network processing, training, and distributed training. Our evaluation illustrates that Deep500 is customizable (enables combining and benchmarking different deep learning codes) and fair (uses carefully selected metrics). Moreover, Deep500 is fast (incurs negligible overheads), verifiable (offers infrastructure to analyze correctness), and reproducible. Finally, as the first distributed and reproducible benchmarking system for deep learning, Deep500 provides software infrastructure to utilize the most powerful supercomputers for extreme-scale workloads.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125714605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 76
Robust Dynamic Resource Allocation via Probabilistic Task Pruning in Heterogeneous Computing Systems 异构计算系统中基于概率任务修剪的鲁棒动态资源分配
Pub Date : 2019-01-27 DOI: 10.1109/IPDPS.2019.00047
James Gentry, Chavit Denninnart, M. Salehi
In heterogeneous distributed computing (HC) systems, diversity can exist in both computational resources and arriving tasks. In an inconsistently heterogeneous computing system, task types have different execution times on heterogeneous machines. A method is required to map arriving tasks to machines based on machine availability and performance, maximizing the number of tasks meeting deadlines (defined as robustness). For tasks with hard deadlines (e.g., those in live video streaming), tasks that miss their deadlines are dropped. The problem investigated in this research is maximizing the robustness of an oversubscribed HC system. A way to maximize this robustness is to prune (i.e., defer or drop) tasks with low probability of meeting their deadlines to increase the probability of other tasks meeting their deadlines. In this paper, we first provide a mathematical model to estimate a task's probability of meeting its deadline in the presence of task dropping. We then investigate methods for engaging probabilistic dropping and we find thresholds for dropping and deferring. Next, we develop a pruning-aware mapping heuristic and extend it to engender fairness across various task types. We show the cost benefit of using probabilistic pruning in an HC system. Simulation results, harnessing a selection of mapping heuristics, show efficacy of the pruning mechanism in improving robustness (on average by around 25%) and cost in an oversubscribed HC system by up to around 40%.
在异构分布式计算系统中,计算资源和到达任务都存在多样性。在不一致的异构计算系统中,任务类型在异构机器上具有不同的执行时间。需要一种方法来根据机器的可用性和性能将到达的任务映射到机器上,从而最大化满足截止日期的任务数量(定义为鲁棒性)。对于具有严格截止日期的任务(例如,那些在实时视频流中的任务),错过截止日期的任务将被丢弃。本文研究的问题是超额认购HC系统的鲁棒性最大化问题。最大化这一稳健性的一种方法是删减(即推迟或放弃)完成截止日期概率较低的任务,以增加其他任务完成截止日期的概率。在本文中,我们首先提供了一个数学模型来估计在任务掉落的情况下任务满足其截止日期的概率。然后,我们研究了参与概率下降的方法,并找到了下降和延迟的阈值。接下来,我们开发了一个修剪感知映射启发式算法,并将其扩展到不同任务类型之间的公平性。我们展示了在HC系统中使用概率剪枝的成本效益。模拟结果显示,利用映射启发式的选择,修剪机制的有效性,提高鲁棒性(平均约25%)和成本,在一个超额认购的HC系统高达40%左右。
{"title":"Robust Dynamic Resource Allocation via Probabilistic Task Pruning in Heterogeneous Computing Systems","authors":"James Gentry, Chavit Denninnart, M. Salehi","doi":"10.1109/IPDPS.2019.00047","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00047","url":null,"abstract":"In heterogeneous distributed computing (HC) systems, diversity can exist in both computational resources and arriving tasks. In an inconsistently heterogeneous computing system, task types have different execution times on heterogeneous machines. A method is required to map arriving tasks to machines based on machine availability and performance, maximizing the number of tasks meeting deadlines (defined as robustness). For tasks with hard deadlines (e.g., those in live video streaming), tasks that miss their deadlines are dropped. The problem investigated in this research is maximizing the robustness of an oversubscribed HC system. A way to maximize this robustness is to prune (i.e., defer or drop) tasks with low probability of meeting their deadlines to increase the probability of other tasks meeting their deadlines. In this paper, we first provide a mathematical model to estimate a task's probability of meeting its deadline in the presence of task dropping. We then investigate methods for engaging probabilistic dropping and we find thresholds for dropping and deferring. Next, we develop a pruning-aware mapping heuristic and extend it to engender fairness across various task types. We show the cost benefit of using probabilistic pruning in an HC system. Simulation results, harnessing a selection of mapping heuristics, show efficacy of the pruning mechanism in improving robustness (on average by around 25%) and cost in an oversubscribed HC system by up to around 40%.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"460 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124346947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
SimFS: A Simulation Data Virtualizing File System Interface SimFS:模拟数据虚拟化文件系统接口
Pub Date : 2019-01-24 DOI: 10.1109/IPDPS.2019.00071
S. D. Girolamo, Pirmin Schmid, T. Schulthess, T. Hoefler
Nowadays simulations can produce petabytes of data to be stored in parallel filesystems or large-scale databases. This data is accessed over the course of decades often by thousands of analysts and scientists. However, storing these volumes of data for long periods of time is not cost effective and, in some cases, practically impossible. We propose to transparently virtualize the simulation data, relaxing the storage requirements by not storing the full output and re-simulating the missing data on demand. We develop SimFS, a file system interface that exposes a virtualized view of the simulation output to the analysis applications and manages the re-simulations. SimFS monitors the access patterns of the analysis applications in order to (1) decide the data to keep stored for faster accesses and (2) to employ prefetching strategies to reduce the access time of missing data. Virtualizing simulation data allows us to trade storage for computation: this paradigm becomes similar to traditional on-disk analysis (all data is stored) or in situ (no data is stored) according with the storage resources that are assigned to SimFS. Overall, by exploiting the growing computing power and relaxing the storage capacity requirements, SimFS offers a viable path towards exa-scale simulations.
如今,模拟可以产生pb级的数据,存储在并行文件系统或大型数据库中。这些数据在几十年的时间里经常被成千上万的分析师和科学家访问。然而,长时间存储这些数据量并不符合成本效益,在某些情况下,实际上是不可能的。我们建议透明地虚拟化模拟数据,通过不存储完整的输出来放松存储需求,并根据需要重新模拟缺失的数据。我们开发了SimFS,这是一个文件系统接口,它向分析应用程序公开模拟输出的虚拟视图并管理重新模拟。SimFS监视分析应用程序的访问模式,以便(1)决定保存哪些数据以实现更快的访问,(2)采用预取策略来减少丢失数据的访问时间。虚拟化模拟数据允许我们将存储交换为计算:根据分配给SimFS的存储资源,这种范式变得类似于传统的磁盘上分析(所有数据都存储)或原位分析(没有数据存储)。总的来说,通过利用不断增长的计算能力和放松存储容量要求,SimFS为超大规模模拟提供了一条可行的途径。
{"title":"SimFS: A Simulation Data Virtualizing File System Interface","authors":"S. D. Girolamo, Pirmin Schmid, T. Schulthess, T. Hoefler","doi":"10.1109/IPDPS.2019.00071","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00071","url":null,"abstract":"Nowadays simulations can produce petabytes of data to be stored in parallel filesystems or large-scale databases. This data is accessed over the course of decades often by thousands of analysts and scientists. However, storing these volumes of data for long periods of time is not cost effective and, in some cases, practically impossible. We propose to transparently virtualize the simulation data, relaxing the storage requirements by not storing the full output and re-simulating the missing data on demand. We develop SimFS, a file system interface that exposes a virtualized view of the simulation output to the analysis applications and manages the re-simulations. SimFS monitors the access patterns of the analysis applications in order to (1) decide the data to keep stored for faster accesses and (2) to employ prefetching strategies to reduce the access time of missing data. Virtualizing simulation data allows us to trade storage for computation: this paradigm becomes similar to traditional on-disk analysis (all data is stored) or in situ (no data is stored) according with the storage resources that are assigned to SimFS. Overall, by exploiting the growing computing power and relaxing the storage capacity requirements, SimFS offers a viable path towards exa-scale simulations.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131117523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
DLHub: Model and Data Serving for Science DLHub:为科学服务的模型和数据
Pub Date : 2018-11-27 DOI: 10.1109/IPDPS.2019.00038
Ryan Chard, Zhuozhao Li, K. Chard, Logan T. Ward, Y. Babuji, A. Woodard, S. Tuecke, B. Blaiszik, M. Franklin, Ian T Foster
While the Machine Learning (ML) landscape is evolving rapidly, there has been a relative lag in the development of the "learning systems" needed to enable broad adoption. Furthermore, few such systems are designed to support the specialized requirements of scientific ML. Here we present the Data and Learning Hub for science (DLHub), a multi-tenant system that provides both model repository and serving capabilities with a focus on science applications. DLHub addresses two significant shortcomings in current systems. First, its self-service model repository allows users to share, publish, verify, reproduce, and reuse models, and addresses concerns related to model reproducibility by packaging and distributing models and all constituent components. Second, it implements scalable and low-latency serving capabilities that can leverage parallel and distributed computing resources to democratize access to published models through a simple web interface. Unlike other model serving frameworks, DLHub can store and serve any Python 3-compatible model or processing function, plus multiple-function pipelines. We show that relative to other model serving systems including TensorFlow Serving, SageMaker, and Clipper, DLHub provides greater capabilities, comparable performance without memoization and batching, and significantly better performance when the latter two techniques can be employed. We also describe early uses of DLHub for scientific applications.
虽然机器学习(ML)领域正在迅速发展,但实现广泛采用所需的“学习系统”的发展相对滞后。此外,很少有这样的系统是为支持科学机器学习的特殊需求而设计的。在这里,我们介绍了科学数据和学习中心(DLHub),这是一个多租户系统,它提供了模型存储库和服务功能,重点是科学应用程序。DLHub解决了当前系统中的两个重大缺陷。首先,它的自助服务模型存储库允许用户共享、发布、验证、复制和重用模型,并通过打包和分发模型和所有组成组件来解决与模型再现性相关的问题。其次,它实现了可伸缩和低延迟的服务功能,可以利用并行和分布式计算资源,通过简单的web界面实现对发布模型的民主化访问。与其他模型服务框架不同,DLHub可以存储和服务任何与Python 3兼容的模型或处理函数,以及多功能管道。我们表明,相对于其他模型服务系统(包括TensorFlow services、SageMaker和Clipper), DLHub提供了更大的功能,在没有记忆和批处理的情况下提供了相当的性能,并且在使用后两种技术时提供了明显更好的性能。我们还描述了DLHub在科学应用中的早期使用。
{"title":"DLHub: Model and Data Serving for Science","authors":"Ryan Chard, Zhuozhao Li, K. Chard, Logan T. Ward, Y. Babuji, A. Woodard, S. Tuecke, B. Blaiszik, M. Franklin, Ian T Foster","doi":"10.1109/IPDPS.2019.00038","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00038","url":null,"abstract":"While the Machine Learning (ML) landscape is evolving rapidly, there has been a relative lag in the development of the \"learning systems\" needed to enable broad adoption. Furthermore, few such systems are designed to support the specialized requirements of scientific ML. Here we present the Data and Learning Hub for science (DLHub), a multi-tenant system that provides both model repository and serving capabilities with a focus on science applications. DLHub addresses two significant shortcomings in current systems. First, its self-service model repository allows users to share, publish, verify, reproduce, and reuse models, and addresses concerns related to model reproducibility by packaging and distributing models and all constituent components. Second, it implements scalable and low-latency serving capabilities that can leverage parallel and distributed computing resources to democratize access to published models through a simple web interface. Unlike other model serving frameworks, DLHub can store and serve any Python 3-compatible model or processing function, plus multiple-function pipelines. We show that relative to other model serving systems including TensorFlow Serving, SageMaker, and Clipper, DLHub provides greater capabilities, comparable performance without memoization and batching, and significantly better performance when the latter two techniques can be employed. We also describe early uses of DLHub for scientific applications.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114967162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 66
MD-GAN: Multi-Discriminator Generative Adversarial Networks for Distributed Datasets MD-GAN:分布式数据集的多鉴别生成对抗网络
Pub Date : 2018-11-09 DOI: 10.1109/IPDPS.2019.00095
Corentin Hardy, E. L. Merrer, B. Sericola
A recent technical breakthrough in the domain of machine learning is the discovery and the multiple applications of Generative Adversarial Networks (GANs). Those generative models are computationally demanding, as a GAN is composed of two deep neural networks, and because it trains on large datasets. A GAN is generally trained on a single server. In this paper, we address the problem of distributing GANs so that they are able to train over datasets that are spread on multiple workers. MD-GAN is exposed as the first solution for this problem: we propose a novel learning procedure for GANs so that they fit this distributed setup. We then compare the performance of MD-GAN to an adapted version of federated learning to GANs, using the MNIST, CIFAR10 and CelebA datasets. MD-GAN exhibits a reduction by a factor of two of the learning complexity on each worker node, while providing better or identical performances with the adaptation of federated learning. We finally discuss the practical implications of distributing GANs.
机器学习领域最近的一个技术突破是生成对抗网络(GANs)的发现和多种应用。由于GAN由两个深度神经网络组成,而且它在大型数据集上进行训练,这些生成模型的计算要求很高。GAN通常在单个服务器上进行训练。在本文中,我们解决了分布式gan的问题,以便它们能够在分布在多个工作人员上的数据集上进行训练。MD-GAN是这个问题的第一个解决方案:我们提出了一种新的gan学习过程,使它们适合这种分布式设置。然后,我们使用MNIST, CIFAR10和CelebA数据集,将MD-GAN的性能与gan的联邦学习的改编版本进行比较。MD-GAN在每个工作节点上的学习复杂性降低了两倍,同时通过联邦学习的适应提供更好或相同的性能。最后讨论了分布式gan的实际意义。
{"title":"MD-GAN: Multi-Discriminator Generative Adversarial Networks for Distributed Datasets","authors":"Corentin Hardy, E. L. Merrer, B. Sericola","doi":"10.1109/IPDPS.2019.00095","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00095","url":null,"abstract":"A recent technical breakthrough in the domain of machine learning is the discovery and the multiple applications of Generative Adversarial Networks (GANs). Those generative models are computationally demanding, as a GAN is composed of two deep neural networks, and because it trains on large datasets. A GAN is generally trained on a single server. In this paper, we address the problem of distributing GANs so that they are able to train over datasets that are spread on multiple workers. MD-GAN is exposed as the first solution for this problem: we propose a novel learning procedure for GANs so that they fit this distributed setup. We then compare the performance of MD-GAN to an adapted version of federated learning to GANs, using the MNIST, CIFAR10 and CelebA datasets. MD-GAN exhibits a reduction by a factor of two of the learning complexity on each worker node, while providing better or identical performances with the adaptation of federated learning. We finally discuss the practical implications of distributing GANs.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114270725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 132
期刊
2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1