首页 > 最新文献

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献

英文 中文
NVIDIA Deep Learning Tutorial NVIDIA深度学习教程
J. Bernauer
Learn how hardware and software stacks enable not only quick prototyping, but also efficient large-scale production deployments. The tutorial will conclude with a discussion about hands-on deep learning training opportunities as well as free academic teaching materials and GPU cloud platforms for university faculty.
了解硬件和软件堆栈如何不仅支持快速原型设计,还支持高效的大规模生产部署。本教程最后将讨论有关实践深度学习培训机会以及免费学术教材和GPU云平台的大学教师。
{"title":"NVIDIA Deep Learning Tutorial","authors":"J. Bernauer","doi":"10.1109/IPDPS.2017.7","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.7","url":null,"abstract":"Learn how hardware and software stacks enable not only quick prototyping, but also efficient large-scale production deployments. The tutorial will conclude with a discussion about hands-on deep learning training opportunities as well as free academic teaching materials and GPU cloud platforms for university faculty.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132925360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Container-Based Cloud Platform for Mobile Computation Offloading 基于容器的移动计算卸载云平台
Song Wu, Chao Niu, J. Rao, Hai Jin, Xiaohai Dai
With the explosive growth of smartphones and cloud computing, mobile cloud, which leverages cloud resource to boost the performance of mobile applications, becomes attrac- tive. Many efforts have been made to improve the performance and reduce energy consumption of mobile devices by offloading computational codes to the cloud. However, the offloading cost caused by the cloud platform has been ignored for many years. In this paper, we propose Rattrap, a lightweight cloud platform which improves the offloading performance from cloud side. To achieve such goals, we analyze the characteristics of typical of- floading workloads and design our platform solution accordingly. Rattrap develops a new runtime environment, Cloud Android Container, for mobile computation offloading, replacing heavy- weight virtual machines (VMs). Our design exploits the idea of running operating systems with differential kernel features inside containers with driver extensions, which partially breaks the limitation of OS-level virtualization. With proposed resource sharing and code cache mechanism, Rattrap fundamentally improves the offloading performance. Our evaluation shows that Rattrap not only reduces the startup time of runtime environments and shows an average speedup of 16x, but also saves a large amount of system resources such as 75% memory footprint and at least 79% disk capacity. Moreover, Rattrap improves offloading response by as high as 63% over the cloud platform based on VM, and thus saving the battery life.
随着智能手机和云计算的爆炸式增长,利用云资源提升移动应用程序性能的移动云变得具有吸引力。通过将计算代码卸载到云端,已经做出了许多努力来提高移动设备的性能并降低能耗。然而,云平台带来的卸载成本多年来一直被忽视。在本文中,我们提出了Rattrap,一个轻量级的云平台,提高了从云端的卸载性能。为了实现这些目标,我们分析了典型负载负载的特征,并相应地设计了我们的平台解决方案。Rattrap开发了一个新的运行时环境,云Android容器,用于移动计算卸载,取代重型虚拟机(vm)。我们的设计利用了在带有驱动扩展的容器内运行具有不同内核特性的操作系统的思想,这在一定程度上打破了操作系统级虚拟化的限制。通过提出的资源共享和代码缓存机制,Rattrap从根本上提高了卸载性能。我们的评估表明,Rattrap不仅减少了运行时环境的启动时间,平均速度提高了16倍,而且还节省了大量的系统资源,例如75%的内存占用和至少79%的磁盘容量。此外,Rattrap在基于VM的云平台上将卸载响应提高了63%,从而节省了电池寿命。
{"title":"Container-Based Cloud Platform for Mobile Computation Offloading","authors":"Song Wu, Chao Niu, J. Rao, Hai Jin, Xiaohai Dai","doi":"10.1109/IPDPS.2017.47","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.47","url":null,"abstract":"With the explosive growth of smartphones and cloud computing, mobile cloud, which leverages cloud resource to boost the performance of mobile applications, becomes attrac- tive. Many efforts have been made to improve the performance and reduce energy consumption of mobile devices by offloading computational codes to the cloud. However, the offloading cost caused by the cloud platform has been ignored for many years. In this paper, we propose Rattrap, a lightweight cloud platform which improves the offloading performance from cloud side. To achieve such goals, we analyze the characteristics of typical of- floading workloads and design our platform solution accordingly. Rattrap develops a new runtime environment, Cloud Android Container, for mobile computation offloading, replacing heavy- weight virtual machines (VMs). Our design exploits the idea of running operating systems with differential kernel features inside containers with driver extensions, which partially breaks the limitation of OS-level virtualization. With proposed resource sharing and code cache mechanism, Rattrap fundamentally improves the offloading performance. Our evaluation shows that Rattrap not only reduces the startup time of runtime environments and shows an average speedup of 16x, but also saves a large amount of system resources such as 75% memory footprint and at least 79% disk capacity. Moreover, Rattrap improves offloading response by as high as 63% over the cloud platform based on VM, and thus saving the battery life.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132973209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 47
Efficient and Deterministic Scheduling for Parallel State Machine Replication 并行状态机复制的高效和确定性调度
O. Mendizabal, R. S. D. Moura, F. Dotti, F. Pedone
Many services used in large scale web applications should be able to tolerate faults without impacting their performance. State machine replication is a well-known approach to implementing fault-tolerant services, providing high availability and strong consistency. To boost the performance of state machine replication, recent proposals have introduced parallel execution of commands. In parallel state machine replication, incoming commands may or may not depend on other commands that are waiting for execution. Although dependent commands must be processed in the same relative order at every replica to avoid inconsistencies, independent commands can be executed in parallel and benefit from multi-core architectures. Since many application workloads are mostly composed of independent commands, these parallel models promise high throughput without sacrificing strong consistency. The efficient execution of commands in such environments, however, requires effective scheduling strategies. Existing approaches rely on dependency tracking based on pairwise comparison between commands, which introduces scheduling contention. In this paper, we propose a new and highly efficient scheduler for parallel state machine replication. Our scheduler considers batches of commands, instead of commands individually. Moreover, each batch of commands is augmented with a compact data structure that encodes commands information needed to the dependency analysis. We show, by means of experimental evaluation, that our technique outperforms schedulers for parallel state machine replication by a fairly large margin.
在大规模web应用程序中使用的许多服务应该能够容忍错误而不影响其性能。状态机复制是一种众所周知的实现容错服务的方法,提供高可用性和强一致性。为了提高状态机复制的性能,最近的建议引入了命令的并行执行。在并行状态机复制中,传入的命令可能依赖于也可能不依赖于正在等待执行的其他命令。尽管依赖命令必须在每个副本上以相同的相对顺序处理,以避免不一致,但是独立命令可以并行执行,并受益于多核体系结构。由于许多应用程序工作负载主要由独立的命令组成,因此这些并行模型承诺在不牺牲强一致性的情况下实现高吞吐量。然而,在这样的环境中有效地执行命令需要有效的调度策略。现有的方法依赖于基于命令之间成对比较的依赖跟踪,这引入了调度争用。本文提出了一种新的、高效的并行状态机复制调度程序。我们的调度器考虑批量命令,而不是单个命令。此外,每批命令都使用紧凑的数据结构进行扩充,该数据结构对依赖分析所需的命令信息进行编码。我们通过实验评估表明,我们的技术在相当大的范围内优于并行状态机复制的调度器。
{"title":"Efficient and Deterministic Scheduling for Parallel State Machine Replication","authors":"O. Mendizabal, R. S. D. Moura, F. Dotti, F. Pedone","doi":"10.1109/IPDPS.2017.29","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.29","url":null,"abstract":"Many services used in large scale web applications should be able to tolerate faults without impacting their performance. State machine replication is a well-known approach to implementing fault-tolerant services, providing high availability and strong consistency. To boost the performance of state machine replication, recent proposals have introduced parallel execution of commands. In parallel state machine replication, incoming commands may or may not depend on other commands that are waiting for execution. Although dependent commands must be processed in the same relative order at every replica to avoid inconsistencies, independent commands can be executed in parallel and benefit from multi-core architectures. Since many application workloads are mostly composed of independent commands, these parallel models promise high throughput without sacrificing strong consistency. The efficient execution of commands in such environments, however, requires effective scheduling strategies. Existing approaches rely on dependency tracking based on pairwise comparison between commands, which introduces scheduling contention. In this paper, we propose a new and highly efficient scheduler for parallel state machine replication. Our scheduler considers batches of commands, instead of commands individually. Moreover, each batch of commands is augmented with a compact data structure that encodes commands information needed to the dependency analysis. We show, by means of experimental evaluation, that our technique outperforms schedulers for parallel state machine replication by a fairly large margin.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114002334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Large Scale Manycore-Aware PIC Simulation with Efficient Particle Binning 基于高效粒子分割的大规模多核感知PIC仿真
H. Nakashima, Yoshiki Summura, Keisuke Kikura, Y. Miyake
We are now developing a manycore-aware implementation of multiprocessed PIC (particle-in-cell) simulation code with automatic load balancing. A key issue of the implementation is how to exploit the wide SIMD mechanism of manycore processors such as Intel Xeon Phi. Our solution is "particle binning" to rank all particles in a cell (voxel) in a chunk of SOA (structure-of-arrays) type one-dimensional arrays so that "particle-push" and "current-scatter" operations on them are efficiently SIMD-vectorized by our compiler. In addition, our sophisticated binning mechanism performs sorting of particles according to their positions "on-the-fly", efficiently coping with occasional "bin overflow" in a fully multithreaded manner. Our performance evaluation with up to 64 nodes of Cray XC30 and XC40 supercomputers, equipped with Xeon Phi 5120D (Knights Corner) and 7250 (Knights Landing) respectively, not only exhibited good parallel performance, but also proved the effectiveness of our binning mechanism.
我们现在正在开发具有自动负载平衡的多核感知多处理PIC (particle-in-cell)仿真代码的实现。实现的一个关键问题是如何利用多核处理器(如Intel Xeon Phi)的宽SIMD机制。我们的解决方案是“粒子分组”,在SOA(数组结构)类型的一维数组块中对单元格(体素)中的所有粒子进行排序,以便我们的编译器对它们进行“粒子推进”和“电流散射”操作,从而有效地进行simd矢量化。此外,我们先进的分仓机制根据颗粒的位置“实时”进行分类,以完全多线程的方式有效地应对偶尔的“分仓溢出”。我们对64个节点的Cray XC30和XC40超级计算机进行性能评估,分别配备Xeon Phi 5120D (Knights Corner)和7250 (Knights Landing),不仅表现出良好的并行性能,而且证明了我们的分组机制的有效性。
{"title":"Large Scale Manycore-Aware PIC Simulation with Efficient Particle Binning","authors":"H. Nakashima, Yoshiki Summura, Keisuke Kikura, Y. Miyake","doi":"10.1109/IPDPS.2017.65","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.65","url":null,"abstract":"We are now developing a manycore-aware implementation of multiprocessed PIC (particle-in-cell) simulation code with automatic load balancing. A key issue of the implementation is how to exploit the wide SIMD mechanism of manycore processors such as Intel Xeon Phi. Our solution is \"particle binning\" to rank all particles in a cell (voxel) in a chunk of SOA (structure-of-arrays) type one-dimensional arrays so that \"particle-push\" and \"current-scatter\" operations on them are efficiently SIMD-vectorized by our compiler. In addition, our sophisticated binning mechanism performs sorting of particles according to their positions \"on-the-fly\", efficiently coping with occasional \"bin overflow\" in a fully multithreaded manner. Our performance evaluation with up to 64 nodes of Cray XC30 and XC40 supercomputers, equipped with Xeon Phi 5120D (Knights Corner) and 7250 (Knights Landing) respectively, not only exhibited good parallel performance, but also proved the effectiveness of our binning mechanism.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116168905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Elastic-Cache: GPU Cache Architecture for Efficient Fine- and Coarse-Grained Cache-Line Management 弹性缓存:高效的细粒度和粗粒度缓存线管理的GPU缓存架构
Bingchao Li, Ji-zhou Sun, M. Annavaram, N. Kim
GPUs provide high-bandwidth/low-latency on-chip shared memory and L1 cache to efficiently service a large number of concurrent memory requests (to contiguous memory space). To support warp-wide accesses to L1 cache, GPU L1 cache lines are very wide. However, such L1 cache architecture cannot always be efficiently utilized when applications generate many memory requests with irregular access patterns especially due to branch and memory divergences. In this paper, we propose Elastic-Cache that can efficiently support both fine- and coarse-grained L1 cache-line management for applications with both regular and irregular memory access patterns. Specifically, it can store 32- or 64-byte words in non-contiguous memory space to a single 128-byte cache line. Furthermore, it neither requires an extra tag storage structure nor reduces the capacity of L1 cache since it stores auxiliary tags for fine-grained L1 cache-line managements in sharedmemory space that is not fully used in many applications. Our experiment shows that Elastic-Cache improves the geo-mean performance of applications with irregular memory access patterns by 58% without degrading performance of applications with regular memory access patterns.
gpu提供高带宽/低延迟的片上共享内存和L1缓存,以有效地服务大量并发内存请求(对连续内存空间)。为了支持对L1缓存的warp-wide访问,GPU L1缓存线非常宽。但是,当应用程序生成许多具有不规则访问模式的内存请求时,特别是由于分支和内存分歧,这种L1缓存体系结构并不总是能够有效地利用。在本文中,我们提出了弹性缓存,它可以有效地支持细粒度和粗粒度L1缓存线管理,用于具有规则和不规则内存访问模式的应用程序。具体来说,它可以将非连续内存空间中的32字节或64字节的字存储到单个128字节的高速缓存线上。此外,它既不需要额外的标记存储结构,也不减少L1缓存的容量,因为它将用于细粒度L1缓存线管理的辅助标记存储在共享内存空间中,而共享内存空间在许多应用程序中并未得到充分利用。我们的实验表明,弹性缓存将具有不规则内存访问模式的应用程序的地理平均性能提高了58%,而不会降低具有常规内存访问模式的应用程序的性能。
{"title":"Elastic-Cache: GPU Cache Architecture for Efficient Fine- and Coarse-Grained Cache-Line Management","authors":"Bingchao Li, Ji-zhou Sun, M. Annavaram, N. Kim","doi":"10.1109/IPDPS.2017.81","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.81","url":null,"abstract":"GPUs provide high-bandwidth/low-latency on-chip shared memory and L1 cache to efficiently service a large number of concurrent memory requests (to contiguous memory space). To support warp-wide accesses to L1 cache, GPU L1 cache lines are very wide. However, such L1 cache architecture cannot always be efficiently utilized when applications generate many memory requests with irregular access patterns especially due to branch and memory divergences. In this paper, we propose Elastic-Cache that can efficiently support both fine- and coarse-grained L1 cache-line management for applications with both regular and irregular memory access patterns. Specifically, it can store 32- or 64-byte words in non-contiguous memory space to a single 128-byte cache line. Furthermore, it neither requires an extra tag storage structure nor reduces the capacity of L1 cache since it stores auxiliary tags for fine-grained L1 cache-line managements in sharedmemory space that is not fully used in many applications. Our experiment shows that Elastic-Cache improves the geo-mean performance of applications with irregular memory access patterns by 58% without degrading performance of applications with regular memory access patterns.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114087177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
High-Performance Virtual Machine Migration Framework for MPI Applications on SR-IOV Enabled InfiniBand Clusters 支持SR-IOV的ib集群上MPI应用的高性能虚拟机迁移框架
Jie Zhang, Xiaoyi Lu, D. Panda
High-speed interconnects (e.g. InfiniBand) have been widely deployed on modern HPC clusters. With the emergence of HPC in the cloud, high-speed interconnects have paved their way into the cloud with recently introduced Single Root I/O Virtualization (SR-IOV) technology, which is able to provide efficient sharing of high-speed interconnect resources and achieve near-native I/O performance. However, recent studies have shown that SR-IOV-based virtual networks prevent virtual machine migration, which is an essential virtualization capability towards high availability and resource provisioning. Although several initial solutions have been pro- posed in the literature to solve this problem, our investigations show that there are still many restrictions on these proposed approaches, such as depending on specific network adapters and/or hypervisors, which will limit the usage scope of these solutions on HPC environments. In this paper, we propose a high-performance virtual machine migration framework for MPI applications on SR-IOV enabled InfiniBand clusters. Our proposed method does not need any modification to the hypervisor and InfiniBand drivers and it can efficiently handle virtual machine (VM) migration with SR-IOV IB device. Our evaluation results indicate that the proposed design is able to not only achieve fast VM migration speed but also guarantee the high performance for MPI applications during the migration in the HPC cloud. At the application level, for NPB LU benchmark running inside VM, our proposed design is able to completely hide the migration overhead through the computation and migration overlapping. Furthermore, our proposed design shows good scaling when migrating multiple VMs.
高速互连(例如InfiniBand)已广泛部署在现代高性能计算集群上。随着高性能计算在云端的出现,高速互连已经通过最近引入的单根I/O虚拟化(SR-IOV)技术为云铺平了道路,该技术能够提供高速互连资源的有效共享,并实现接近本地的I/O性能。然而,最近的研究表明,基于sr - iov的虚拟网络阻止了虚拟机迁移,而虚拟机迁移是实现高可用性和资源供应的必要虚拟化能力。尽管文献中已经提出了几个解决这个问题的初始解决方案,但我们的调查表明,这些建议的方法仍然有许多限制,例如依赖于特定的网络适配器和/或管理程序,这将限制这些解决方案在HPC环境中的使用范围。在本文中,我们提出了一个高性能的虚拟机迁移框架,用于支持SR-IOV的ib集群上的MPI应用程序。我们提出的方法不需要对hypervisor和IB驱动程序进行任何修改,并且可以有效地处理SR-IOV IB设备的虚拟机迁移。我们的评估结果表明,所提出的设计不仅能够实现快速的VM迁移速度,而且能够保证MPI应用程序在HPC云迁移过程中的高性能。在应用程序级别,对于在VM内运行的NPB LU基准测试,我们提出的设计能够通过计算和迁移重叠完全隐藏迁移开销。此外,我们提出的设计在迁移多个vm时显示出良好的可伸缩性。
{"title":"High-Performance Virtual Machine Migration Framework for MPI Applications on SR-IOV Enabled InfiniBand Clusters","authors":"Jie Zhang, Xiaoyi Lu, D. Panda","doi":"10.1109/IPDPS.2017.43","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.43","url":null,"abstract":"High-speed interconnects (e.g. InfiniBand) have been widely deployed on modern HPC clusters. With the emergence of HPC in the cloud, high-speed interconnects have paved their way into the cloud with recently introduced Single Root I/O Virtualization (SR-IOV) technology, which is able to provide efficient sharing of high-speed interconnect resources and achieve near-native I/O performance. However, recent studies have shown that SR-IOV-based virtual networks prevent virtual machine migration, which is an essential virtualization capability towards high availability and resource provisioning. Although several initial solutions have been pro- posed in the literature to solve this problem, our investigations show that there are still many restrictions on these proposed approaches, such as depending on specific network adapters and/or hypervisors, which will limit the usage scope of these solutions on HPC environments. In this paper, we propose a high-performance virtual machine migration framework for MPI applications on SR-IOV enabled InfiniBand clusters. Our proposed method does not need any modification to the hypervisor and InfiniBand drivers and it can efficiently handle virtual machine (VM) migration with SR-IOV IB device. Our evaluation results indicate that the proposed design is able to not only achieve fast VM migration speed but also guarantee the high performance for MPI applications during the migration in the HPC cloud. At the application level, for NPB LU benchmark running inside VM, our proposed design is able to completely hide the migration overhead through the computation and migration overlapping. Furthermore, our proposed design shows good scaling when migrating multiple VMs.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114613208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
On Optimizing Distributed Tucker Decomposition for Dense Tensors 密集张量分布Tucker分解的优化研究
Venkatesan T. Chakaravarthy, Jee W. Choi, Douglas J. Joseph, Xing Liu, Prakash Murali, Yogish Sabharwal, D. Sreedhar
The Tucker decomposition expresses a given tensor as the product of a small core tensor and a set of factor matrices. Our objective is to develop an efficient distributed implementation for the case of dense tensors. The implementation is based on the HOOI (Higher Order Orthogonal Iterator) procedure, wherein the tensor-times-matrix product forms the core routine. Prior work have proposed heuristics for reducing the computational load and communication volume incurred by the routine. We study the two metrics in a formal and systematic manner, and design strategies that are optimal under the two fundamental metrics. Our experimental evaluation on a large benchmark of tensors shows that the optimal strategies provide significant reduction in load and volume compared to prior heuristics, and provide up to 7x speed-up in the overall running time.
Tucker分解将给定张量表示为一个小核心张量和一组因子矩阵的乘积。我们的目标是为密集张量的情况开发一个高效的分布式实现。实现基于HOOI(高阶正交迭代器)过程,其中张量-时间-矩阵乘积构成核心例程。先前的工作提出了启发式方法来减少例程带来的计算负荷和通信量。我们以形式化和系统化的方式研究了这两个指标,并设计了在这两个基本指标下的最优策略。我们在一个大型张量基准上的实验评估表明,与之前的启发式方法相比,最优策略显著减少了负载和体积,并在总体运行时间上提供了高达7倍的加速。
{"title":"On Optimizing Distributed Tucker Decomposition for Dense Tensors","authors":"Venkatesan T. Chakaravarthy, Jee W. Choi, Douglas J. Joseph, Xing Liu, Prakash Murali, Yogish Sabharwal, D. Sreedhar","doi":"10.1109/IPDPS.2017.86","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.86","url":null,"abstract":"The Tucker decomposition expresses a given tensor as the product of a small core tensor and a set of factor matrices. Our objective is to develop an efficient distributed implementation for the case of dense tensors. The implementation is based on the HOOI (Higher Order Orthogonal Iterator) procedure, wherein the tensor-times-matrix product forms the core routine. Prior work have proposed heuristics for reducing the computational load and communication volume incurred by the routine. We study the two metrics in a formal and systematic manner, and design strategies that are optimal under the two fundamental metrics. Our experimental evaluation on a large benchmark of tensors shows that the optimal strategies provide significant reduction in load and volume compared to prior heuristics, and provide up to 7x speed-up in the overall running time.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114847538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
Mimir: Memory-Efficient and Scalable MapReduce for Large Supercomputing Systems Mimir:用于大型超级计算系统的内存效率和可扩展MapReduce
Tao Gao, Yanfei Guo, Boyu Zhang, Pietro Cicotti, Yutong Lu, P. Balaji, M. Taufer
In this paper we present Mimir, a new implementation of MapReduce over MPI. Mimir inherits the core principles of existing MapReduce frameworks, such as MR-MPI, while redesigning the execution model to incorporate a number of sophisticated optimization techniques that achieve similar or better performance with significant reduction in the amount of memory used. Consequently, Mimir allows significantly larger problems to be executed in memory, achieving large performance gains. We evaluate Mimir with three benchmarks on two highend platforms to demonstrate its superiority compared with that of other frameworks.
在本文中,我们提出了Mimir,一种基于MPI的MapReduce的新实现。Mimir继承了现有MapReduce框架的核心原则,例如MR-MPI,同时重新设计了执行模型,以结合许多复杂的优化技术,这些技术可以在显著减少内存使用量的情况下实现类似或更好的性能。因此,Mimir允许在内存中执行更大的问题,从而获得很大的性能提升。我们在两个高端平台上对Mimir进行了三个基准测试,以证明其与其他框架相比的优势。
{"title":"Mimir: Memory-Efficient and Scalable MapReduce for Large Supercomputing Systems","authors":"Tao Gao, Yanfei Guo, Boyu Zhang, Pietro Cicotti, Yutong Lu, P. Balaji, M. Taufer","doi":"10.1109/IPDPS.2017.31","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.31","url":null,"abstract":"In this paper we present Mimir, a new implementation of MapReduce over MPI. Mimir inherits the core principles of existing MapReduce frameworks, such as MR-MPI, while redesigning the execution model to incorporate a number of sophisticated optimization techniques that achieve similar or better performance with significant reduction in the amount of memory used. Consequently, Mimir allows significantly larger problems to be executed in memory, achieving large performance gains. We evaluate Mimir with three benchmarks on two highend platforms to demonstrate its superiority compared with that of other frameworks.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127408696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
ATM: Approximate Task Memoization in the Runtime System ATM:运行时系统中的近似任务记忆
I. Brumar, Marc Casas, Miquel Moretó, M. Valero, G. Sohi
Redundant computations appear during the execution of real programs. Multiple factors contribute to these unnecessary computations, such as repetitive inputs and patterns, calling functions with the same parameters or bad programming habits. Compilers minimize non useful code with static analysis. However, redundant execution might be dynamic and there are no current approaches to reduce these inefficiencies. Additionally, many algorithms can be computed with different levels of accuracy. Approximate computing exploits this fact to reduce execution time at the cost of slightly less accurate results. In this case, expert developers determine the desired tradeoff between performance and accuracy for each application. In this paper, we present Approximate Task Memoization (ATM), a novel approach in the runtime system that transparently exploits both dynamic redundancy and approximation at the task granularity of a parallel application. Memoization of previous task executions allows predicting the results of future tasks without having to execute them and without losing accuracy. To further increase performance improvements, the runtime system can memoize similar tasks, which leads to task approximate computing. By defining how to measure task similarity and correctness, we present an adaptive algorithm in the runtime system that automatically decides if task approximation is beneficial or not. When evaluated on a real 8-core processor with applications from different domains (financial analysis, stencil-computation, machine-learning and linear-algebra), ATM achieves a 1.4x average speedup when only applying memoization techniques. When adding task approximation, ATM achieves a 2.5x average speedup with an average 0.7% accuracy loss (maximum of 3.2%).
在实际程序的执行过程中会出现冗余计算。许多因素导致了这些不必要的计算,例如重复的输入和模式、调用具有相同参数的函数或不良的编程习惯。编译器通过静态分析将无用代码最小化。然而,冗余执行可能是动态的,并且目前没有减少这些低效率的方法。此外,许多算法可以以不同的精度水平进行计算。近似计算利用这一事实来减少执行时间,但代价是结果的准确性略低。在这种情况下,专家开发人员确定每个应用程序的性能和准确性之间的折衷。在本文中,我们提出了近似任务记忆(ATM),这是一种在运行时系统中透明地利用并行应用程序任务粒度上的动态冗余和近似的新方法。对先前任务执行的记忆可以预测未来任务的结果,而不必执行它们,也不会失去准确性。为了进一步提高性能,运行时系统可以记忆类似的任务,从而导致任务近似计算。通过定义如何度量任务相似度和正确性,我们提出了一种在运行时系统中自动判断任务近似是否有益的自适应算法。当在真正的8核处理器上使用来自不同领域(金融分析、模板计算、机器学习和线性代数)的应用程序进行评估时,ATM在仅应用记忆技术时实现了1.4倍的平均加速。当添加任务近似时,ATM实现了2.5倍的平均加速,平均精度损失为0.7%(最大3.2%)。
{"title":"ATM: Approximate Task Memoization in the Runtime System","authors":"I. Brumar, Marc Casas, Miquel Moretó, M. Valero, G. Sohi","doi":"10.1109/IPDPS.2017.49","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.49","url":null,"abstract":"Redundant computations appear during the execution of real programs. Multiple factors contribute to these unnecessary computations, such as repetitive inputs and patterns, calling functions with the same parameters or bad programming habits. Compilers minimize non useful code with static analysis. However, redundant execution might be dynamic and there are no current approaches to reduce these inefficiencies. Additionally, many algorithms can be computed with different levels of accuracy. Approximate computing exploits this fact to reduce execution time at the cost of slightly less accurate results. In this case, expert developers determine the desired tradeoff between performance and accuracy for each application. In this paper, we present Approximate Task Memoization (ATM), a novel approach in the runtime system that transparently exploits both dynamic redundancy and approximation at the task granularity of a parallel application. Memoization of previous task executions allows predicting the results of future tasks without having to execute them and without losing accuracy. To further increase performance improvements, the runtime system can memoize similar tasks, which leads to task approximate computing. By defining how to measure task similarity and correctness, we present an adaptive algorithm in the runtime system that automatically decides if task approximation is beneficial or not. When evaluated on a real 8-core processor with applications from different domains (financial analysis, stencil-computation, machine-learning and linear-algebra), ATM achieves a 1.4x average speedup when only applying memoization techniques. When adding task approximation, ATM achieves a 2.5x average speedup with an average 0.7% accuracy loss (maximum of 3.2%).","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131355439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Partitioning Low-Diameter Networks to Eliminate Inter-Job Interference 划分低直径网络以消除作业间干扰
Nikhil Jain, A. Bhatele, Xiang Ni, T. Gamblin, L. Kalé
On most supercomputers, except some torus network based systems, resource managers allocate nodes to jobs without considering the sharing of network resources by different jobs. Such network-oblivious resource allocations result in link sharing among multiple jobs that can cause significant performance variability and performance degradation for individual jobs. In this paper, we explore low-diameter networks and corresponding node allocation policies that can eliminate inter-job interference. We propose a variation to n-dimensional mesh networks called express mesh. An express mesh is denser than the corresponding mesh network, has a low diameter independent of the number of routers, and is easily partitionable. We compare structural properties and performance of express mesh with other popular low-diameter networks. We present practical node allocation policies for express mesh and fat-tree networks that not only eliminate inter-job interference and performance variability, but also improve overall performance.
在大多数超级计算机上,除了一些基于环面网络的系统外,资源管理器将节点分配给作业,而不考虑不同作业对网络资源的共享。这种与网络无关的资源分配导致多个作业之间的链接共享,这可能会导致单个作业的显著性能变化和性能下降。在本文中,我们探索了可以消除作业间干扰的低直径网络和相应的节点分配策略。我们提出了一种n维网格网络的变体,称为快速网格。快速网格比相应的网状网络更密集,具有与路由器数量无关的低直径,并且易于分区。我们比较了快速网与其他流行的低直径网的结构特性和性能。我们提出了实用的节点分配策略,用于快速网格和胖树网络,不仅消除了作业间干扰和性能变化,而且提高了整体性能。
{"title":"Partitioning Low-Diameter Networks to Eliminate Inter-Job Interference","authors":"Nikhil Jain, A. Bhatele, Xiang Ni, T. Gamblin, L. Kalé","doi":"10.1109/IPDPS.2017.91","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.91","url":null,"abstract":"On most supercomputers, except some torus network based systems, resource managers allocate nodes to jobs without considering the sharing of network resources by different jobs. Such network-oblivious resource allocations result in link sharing among multiple jobs that can cause significant performance variability and performance degradation for individual jobs. In this paper, we explore low-diameter networks and corresponding node allocation policies that can eliminate inter-job interference. We propose a variation to n-dimensional mesh networks called express mesh. An express mesh is denser than the corresponding mesh network, has a low diameter independent of the number of routers, and is easily partitionable. We compare structural properties and performance of express mesh with other popular low-diameter networks. We present practical node allocation policies for express mesh and fat-tree networks that not only eliminate inter-job interference and performance variability, but also improve overall performance.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131868253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
期刊
2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1