首页 > 最新文献

2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)最新文献

英文 中文
Fast GPU parallel N-Body tree traversal with Simulated Wide-Warp 模拟Wide-Warp的快速GPU并行n体树遍历
Pub Date : 2014-12-01 DOI: 10.1109/PADSW.2014.7097874
Wagner M. Nunan Zola, L. C. E. Bona, Fabiano Silva
The Barnes-Hut algorithm is a widely used approximation method for the N-Body simulation problem. The irregular nature of this tree walking code presents interesting challenges for its computation on parallel systems. Additional problems arise in effectively exploiting the processing capacity of GPU architectures. We propose and investigate the applicability of software Simulated Wide-Warps (SWW) in this context. To this extent, we explicitly deal with dynamic irregular patterns in data accesses with data remapping and data transformation, by controlling execution flow divergence of threads. We present a new compact data-structure for the tree layout, GPU parallel algorithms for tree transformation and parallel walking using SWW. Benefits of our techniques are in transposing the tree algorithm to execute regular patterns to match the GPU model. Our experiments show significant performance improvement over the best known GPU solutions to this algorithm.
Barnes-Hut算法是一种广泛应用于n体仿真问题的近似方法。这种树遍历代码的不规则性质为其在并行系统上的计算提出了有趣的挑战。在有效利用GPU架构的处理能力方面出现了其他问题。在此背景下,我们提出并研究了软件模拟宽翘曲(SWW)的适用性。在这种程度上,我们通过控制线程的执行流发散,显式地处理数据访问中的数据重映射和数据转换的动态不规则模式。我们提出了一种新的用于树布局的紧凑数据结构、用于树转换的GPU并行算法和基于SWW的并行行走。我们技术的好处是将树算法转换为执行规则模式以匹配GPU模型。我们的实验表明,与最知名的GPU解决方案相比,该算法的性能有了显著提高。
{"title":"Fast GPU parallel N-Body tree traversal with Simulated Wide-Warp","authors":"Wagner M. Nunan Zola, L. C. E. Bona, Fabiano Silva","doi":"10.1109/PADSW.2014.7097874","DOIUrl":"https://doi.org/10.1109/PADSW.2014.7097874","url":null,"abstract":"The Barnes-Hut algorithm is a widely used approximation method for the N-Body simulation problem. The irregular nature of this tree walking code presents interesting challenges for its computation on parallel systems. Additional problems arise in effectively exploiting the processing capacity of GPU architectures. We propose and investigate the applicability of software Simulated Wide-Warps (SWW) in this context. To this extent, we explicitly deal with dynamic irregular patterns in data accesses with data remapping and data transformation, by controlling execution flow divergence of threads. We present a new compact data-structure for the tree layout, GPU parallel algorithms for tree transformation and parallel walking using SWW. Benefits of our techniques are in transposing the tree algorithm to execute regular patterns to match the GPU model. Our experiments show significant performance improvement over the best known GPU solutions to this algorithm.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"170 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115177573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
A distributed real-time operating system built with aspect-oriented programming for distributed embedded control systems 面向方面编程的分布式实时操作系统,用于分布式嵌入式控制系统
Pub Date : 2014-12-01 DOI: 10.1109/PADSW.2014.7097839
Nobuhiro Saito, Myungryun Yoo, T. Yokoyama
The paper presents a method to build a distributed real-time operating system for distributed embedded control systems using aspect-oriented programming. We define aspects to weave distributed computing mechanisms to an existing real-time operating system. By using the aspects, we can build a distributed operating system without modifying the original source code. This improves the maintainability of the source code of a real-time operating system family. We have applied the aspects to an OSEK OS and have got a distributed operating system that provides location-transparent system calls for task management and inter-task synchronization. The evaluation results show that the overhead of aspect-oriented programming is practically small enough.
提出了一种基于面向方面编程的分布式嵌入式控制系统实时操作系统的实现方法。我们定义了将分布式计算机制编织到现有实时操作系统中的方面。通过使用方面,我们可以在不修改原始源代码的情况下构建分布式操作系统。这提高了实时操作系统家族源代码的可维护性。我们将这些方面应用到OSEK操作系统中,得到了一个分布式操作系统,它为任务管理和任务间同步提供了位置透明的系统调用。评估结果表明,面向方面编程的开销实际上足够小。
{"title":"A distributed real-time operating system built with aspect-oriented programming for distributed embedded control systems","authors":"Nobuhiro Saito, Myungryun Yoo, T. Yokoyama","doi":"10.1109/PADSW.2014.7097839","DOIUrl":"https://doi.org/10.1109/PADSW.2014.7097839","url":null,"abstract":"The paper presents a method to build a distributed real-time operating system for distributed embedded control systems using aspect-oriented programming. We define aspects to weave distributed computing mechanisms to an existing real-time operating system. By using the aspects, we can build a distributed operating system without modifying the original source code. This improves the maintainability of the source code of a real-time operating system family. We have applied the aspects to an OSEK OS and have got a distributed operating system that provides location-transparent system calls for task management and inter-task synchronization. The evaluation results show that the overhead of aspect-oriented programming is practically small enough.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130879558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
M2M-enabled real-time Trip Planner m2m支持实时行程规划
Pub Date : 2014-12-01 DOI: 10.1109/PADSW.2014.7097902
Eduardo Cerritos, F. Lin
Uncertainty is a key factor that prevents a commuter from using public transportation system. More and more transportation agencies are incorporating real-time Trip Planners to empower commuters with opportune information. However, such systems require continuous status updates from the vehicles and involves expensive communication cost. In this paper we propose an architecture that takes advantage of Machine-to-Machine Communication concepts and provides a degree of intelligence to the vehicles, to alleviate unnecessary communication between the vehicles and the Trip Planner.
不确定性是阻碍通勤者使用公共交通系统的一个关键因素。越来越多的交通机构正在整合实时旅行计划,为通勤者提供及时的信息。然而,这种系统需要车辆不断更新状态,并且涉及昂贵的通信成本。在本文中,我们提出了一种利用机器对机器通信概念的架构,并为车辆提供了一定程度的智能,以减少车辆和旅行计划器之间不必要的通信。
{"title":"M2M-enabled real-time Trip Planner","authors":"Eduardo Cerritos, F. Lin","doi":"10.1109/PADSW.2014.7097902","DOIUrl":"https://doi.org/10.1109/PADSW.2014.7097902","url":null,"abstract":"Uncertainty is a key factor that prevents a commuter from using public transportation system. More and more transportation agencies are incorporating real-time Trip Planners to empower commuters with opportune information. However, such systems require continuous status updates from the vehicles and involves expensive communication cost. In this paper we propose an architecture that takes advantage of Machine-to-Machine Communication concepts and provides a degree of intelligence to the vehicles, to alleviate unnecessary communication between the vehicles and the Trip Planner.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"228 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122461149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Be a good neighbour: Characterizing performance interference of virtual machines under xen virtualization environments 做一个好邻居:描述xen虚拟化环境下虚拟机的性能干扰
Pub Date : 2014-12-01 DOI: 10.1109/PADSW.2014.7097816
Ruiqing Chi, Zhuzhong Qian, Sanglu Lu
With the rapid development of virtualization techniques, modern data centers move into a new era of cloud in recent years. Despite numerous advantages such as high resource utilization and rapid service scalability, current virtualization techniques don't guarantee perfect performance isolation among virtual machines sharing the physical machine, which may lead to unstable and unpredictable user-perceived application performance in clouds. Therefore, understanding and modeling performance interference among collocated applications is of utmost importance. However, the hypervisor and guest OSes usually run independent resource schedulers and are invisible into each other, thereby making accurately characterizing performance interference a non-trivial work. In this paper, we first present a comprehensive experimental study on performance interference of different combinations of benchmarks, observing that virtual CPU floating overhead between multiple physical CPUs, and VMEXITs, i.e., the control transitions between the hypervisor and VMs, constitute the key source of performance interference. In order to characterize the performance interference effects, we measure both the application-level and VM-level characteristics from the collocated applications and then build a novel interference prediction framework based on kernel canonical correlation analysis. Our evaluations first show the practicability of KCCA in finding reliable correlation, and further confirm the high accuracy and great applicability of our interference model with a low prediction error of no more than 7.9%.
近年来,随着虚拟化技术的飞速发展,现代数据中心进入了云计算的新时代。尽管具有资源利用率高、服务可伸缩性快等诸多优势,但当前的虚拟化技术并不能保证共享物理机的虚拟机之间实现完美的性能隔离,这可能导致云中用户感知到的应用程序性能不稳定且不可预测。因此,理解和建模并发应用程序之间的性能干扰是至关重要的。但是,管理程序和客户机操作系统通常运行独立的资源调度器,并且彼此不可见,因此,准确地描述性能干扰是一项非常重要的工作。在本文中,我们首先对不同基准组合的性能干扰进行了全面的实验研究,观察到多个物理CPU之间的虚拟CPU浮动开销和VMEXITs,即hypervisor和vm之间的控制转换,构成了性能干扰的主要来源。为了描述性能干扰的影响,我们测量了并发应用程序的应用层和虚拟机层的特性,然后构建了一个基于核典型相关分析的干扰预测框架。我们的评价首先证明了KCCA在寻找可靠相关性方面的实用性,进一步证实了我们的干扰模型具有较高的准确度和适用性,预测误差不超过7.9%。
{"title":"Be a good neighbour: Characterizing performance interference of virtual machines under xen virtualization environments","authors":"Ruiqing Chi, Zhuzhong Qian, Sanglu Lu","doi":"10.1109/PADSW.2014.7097816","DOIUrl":"https://doi.org/10.1109/PADSW.2014.7097816","url":null,"abstract":"With the rapid development of virtualization techniques, modern data centers move into a new era of cloud in recent years. Despite numerous advantages such as high resource utilization and rapid service scalability, current virtualization techniques don't guarantee perfect performance isolation among virtual machines sharing the physical machine, which may lead to unstable and unpredictable user-perceived application performance in clouds. Therefore, understanding and modeling performance interference among collocated applications is of utmost importance. However, the hypervisor and guest OSes usually run independent resource schedulers and are invisible into each other, thereby making accurately characterizing performance interference a non-trivial work. In this paper, we first present a comprehensive experimental study on performance interference of different combinations of benchmarks, observing that virtual CPU floating overhead between multiple physical CPUs, and VMEXITs, i.e., the control transitions between the hypervisor and VMs, constitute the key source of performance interference. In order to characterize the performance interference effects, we measure both the application-level and VM-level characteristics from the collocated applications and then build a novel interference prediction framework based on kernel canonical correlation analysis. Our evaluations first show the practicability of KCCA in finding reliable correlation, and further confirm the high accuracy and great applicability of our interference model with a low prediction error of no more than 7.9%.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117175785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
pbitMCE: A bit-based approach for maximal clique enumeration on multicore processors pbitMCE:在多核处理器上实现最大团枚举的基于位的方法
Pub Date : 2014-12-01 DOI: 10.1109/PADSW.2014.7097844
N. Dasari, D. Ranjan, M. Zubair
Maximal clique enumeration (MCE) is a fundamental problem in graph theory. It plays a vital role in many network analysis applications and in computational biology. MCE is an extensively studied problem. Recently, Eppstein et al. proposed a state-of-the-art sequential algorithm that uses degeneracy based ordering of vertices to improve the efficiency. In this paper, we propose a new parallel implementation of the algorithm of Eppstein et al. using a new bit-based data structure. The new data structure not only reduces the working set size significantly but also by enabling the use of bit-parallelism improves the performance of the algorithm. We illustrate the significance of degeneracy ordering in load balancing and experimentally evaluate the impact of scheduling on the performance of the algorithm. We present experimental results on several types of synthetic and real-world graphs with up to 50 million vertices and 100 million edges. We show that our approach outperforms Eppstein et al.'s approach by up to 4 times and also scales up to 29 times when run on a multicore machine with 32 cores.
极大团枚举是图论中的一个基本问题。它在许多网络分析应用和计算生物学中起着至关重要的作用。MCE是一个被广泛研究的问题。最近,Eppstein等人提出了一种最先进的序列算法,该算法使用基于退化的顶点排序来提高效率。在本文中,我们使用一种新的基于位的数据结构,提出了一种新的并行实现Eppstein等人的算法。新的数据结构不仅显著地减小了工作集的大小,而且通过启用位并行性提高了算法的性能。我们说明了退化排序在负载平衡中的重要性,并通过实验评估了调度对算法性能的影响。我们给出了几种类型的合成图和真实世界图的实验结果,这些图有多达5000万个顶点和1亿个边。我们表明,我们的方法比Eppstein等人的方法性能高出4倍,并且在具有32核的多核机器上运行时也可扩展到29倍。
{"title":"pbitMCE: A bit-based approach for maximal clique enumeration on multicore processors","authors":"N. Dasari, D. Ranjan, M. Zubair","doi":"10.1109/PADSW.2014.7097844","DOIUrl":"https://doi.org/10.1109/PADSW.2014.7097844","url":null,"abstract":"Maximal clique enumeration (MCE) is a fundamental problem in graph theory. It plays a vital role in many network analysis applications and in computational biology. MCE is an extensively studied problem. Recently, Eppstein et al. proposed a state-of-the-art sequential algorithm that uses degeneracy based ordering of vertices to improve the efficiency. In this paper, we propose a new parallel implementation of the algorithm of Eppstein et al. using a new bit-based data structure. The new data structure not only reduces the working set size significantly but also by enabling the use of bit-parallelism improves the performance of the algorithm. We illustrate the significance of degeneracy ordering in load balancing and experimentally evaluate the impact of scheduling on the performance of the algorithm. We present experimental results on several types of synthetic and real-world graphs with up to 50 million vertices and 100 million edges. We show that our approach outperforms Eppstein et al.'s approach by up to 4 times and also scales up to 29 times when run on a multicore machine with 32 cores.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125288849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Improving utilization through dynamic VM resource allocation in hybrid cloud environment 通过混合云环境下虚拟机资源的动态分配,提高利用率
Pub Date : 2014-12-01 DOI: 10.1109/PADSW.2014.7097814
Yuda Wang, Renyu Yang, Tianyu Wo, Wenbo Jiang, Chunming Hu
Virtualization is one of the most fascinating techniques because it can facilitate the infrastructure management and provide isolated execution for running workloads. Despite the benefits gained from virtualization and resource sharing, improved resource utilization is still far from settled due to the dynamic resource requirements and the widely-used over-provision strategy for guaranteed QoS. Additionally, with the emerging demands for big data analytic, how to effectively manage hybrid workloads such as traditional batch task and long-running virtual machine (VM) service needs to be dealt with. In this paper, we propose a system to combine long-running VM service with typical batch workload like MapReduce. The objectives are to improve the holistic cluster utilization through dynamic resource adjustment mechanism for VM without violating other batch workload executions. Furthermore, VM migration is utilized to ensure high availability and avoid potential performance degradation. The experimental results reveal that the dynamically allocated memory is close to the real usage with only 10% estimation margin, and the performance impact on VM and MapReduce jobs are both within 1%. Additionally, at most 50% increment of resource utilization could be achieved. We believe that these findings are in the right direction to solving workload consolidation issues in hybrid computing environments.
虚拟化是最吸引人的技术之一,因为它可以促进基础设施管理,并为运行的工作负载提供独立的执行。尽管从虚拟化和资源共享中获得了好处,但由于动态资源需求和广泛使用的保证QoS的过度供应策略,提高资源利用率仍然远远没有解决。此外,随着大数据分析需求的不断涌现,如何有效地管理传统的批处理任务和长时间运行的虚拟机服务等混合工作负载也需要解决。在本文中,我们提出了一个将长时间运行的VM服务与典型的批处理工作负载(如MapReduce)相结合的系统。目标是在不影响其他批处理工作负载执行的情况下,通过VM的动态资源调整机制提高整体集群利用率。此外,还利用虚拟机迁移来确保高可用性并避免潜在的性能下降。实验结果表明,动态分配的内存接近实际使用情况,估计余量仅为10%,对VM和MapReduce作业的性能影响均在1%以内。此外,最多可以实现50%的资源利用率增量。我们相信这些发现是解决混合计算环境中的工作负载整合问题的正确方向。
{"title":"Improving utilization through dynamic VM resource allocation in hybrid cloud environment","authors":"Yuda Wang, Renyu Yang, Tianyu Wo, Wenbo Jiang, Chunming Hu","doi":"10.1109/PADSW.2014.7097814","DOIUrl":"https://doi.org/10.1109/PADSW.2014.7097814","url":null,"abstract":"Virtualization is one of the most fascinating techniques because it can facilitate the infrastructure management and provide isolated execution for running workloads. Despite the benefits gained from virtualization and resource sharing, improved resource utilization is still far from settled due to the dynamic resource requirements and the widely-used over-provision strategy for guaranteed QoS. Additionally, with the emerging demands for big data analytic, how to effectively manage hybrid workloads such as traditional batch task and long-running virtual machine (VM) service needs to be dealt with. In this paper, we propose a system to combine long-running VM service with typical batch workload like MapReduce. The objectives are to improve the holistic cluster utilization through dynamic resource adjustment mechanism for VM without violating other batch workload executions. Furthermore, VM migration is utilized to ensure high availability and avoid potential performance degradation. The experimental results reveal that the dynamically allocated memory is close to the real usage with only 10% estimation margin, and the performance impact on VM and MapReduce jobs are both within 1%. Additionally, at most 50% increment of resource utilization could be achieved. We believe that these findings are in the right direction to solving workload consolidation issues in hybrid computing environments.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116904691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Scaling and analyzing the stencil performance on multi-core and many-core architectures 在多核和多核架构下扩展和分析模板性能
Pub Date : 2014-12-01 DOI: 10.1109/PADSW.2014.7097797
L. Gan, H. Fu, Wei Xue, Yangtong Xu, Chao Yang, Xinliang Wang, Zihong Lv, Yang You, Guangwen Yang, Kaijian Ou
Stencils are among the most important and time-consuming kernels in many applications. While stencil optimization has been a well-studied topic on CPU platforms, achieving higher performance and efficiency for the evolving numerical stencils on the more recent multi-core and many-core architectures is still an important issue. In this paper, we explore a number of different stencils, ranging from a basic 7-point Jacobi stencil to more complex high-order stencils used in finer numerical simulations. By optimizing and analyzing those stencils on the latest multi-core and many-core architectures (the Intel Sandy Bridge processor, the Intel Xeon Phi coprocessor, and the NVIDIA Fermi C2070 and Kepler K20x GPUs), we investigate the algorithmic and architectural factors that determine the performance and efficiency of the resulting designs. While multi-threading, vectorization, and optimization on cache and other fast buffers are still the most important techniques that provide performance, we observe that the different memory hierarchy and the different mechanism for issuing and executing parallel instructions lead to the different performance behaviors on CPU, MIC and GPU. With vector-like processing units becoming the major provider of computing power on almost all architectures, the compiler's inability to align all the computing and memory operations would become the major bottleneck from getting a high efficiency on current and future platforms. Our specific optimization of the complex WNAD stencil on GPU provides a good example of what the compiler could do to help.
模板是许多应用程序中最重要和最耗时的内核之一。虽然模板优化已经成为CPU平台上一个被广泛研究的话题,但在最近的多核和多核架构上,为不断发展的数值模板实现更高的性能和效率仍然是一个重要的问题。在本文中,我们探索了许多不同的模板,从基本的7点雅可比模板到更复杂的高阶模板,用于更精细的数值模拟。通过在最新的多核和多核架构(英特尔Sandy Bridge处理器、英特尔Xeon Phi协处理器、NVIDIA Fermi C2070和Kepler K20x gpu)上优化和分析这些模板,我们研究了决定最终设计性能和效率的算法和架构因素。虽然多线程、向量化以及缓存和其他快速缓冲区上的优化仍然是提供性能的最重要技术,但我们观察到,不同的内存层次结构以及发出和执行并行指令的不同机制导致CPU、MIC和GPU上的不同性能行为。随着类矢量处理单元成为几乎所有体系结构上计算能力的主要提供者,编译器无法将所有计算和内存操作对齐将成为当前和未来平台上获得高效率的主要瓶颈。我们对GPU上复杂的WNAD模板的具体优化提供了一个很好的例子,说明编译器可以做些什么来提供帮助。
{"title":"Scaling and analyzing the stencil performance on multi-core and many-core architectures","authors":"L. Gan, H. Fu, Wei Xue, Yangtong Xu, Chao Yang, Xinliang Wang, Zihong Lv, Yang You, Guangwen Yang, Kaijian Ou","doi":"10.1109/PADSW.2014.7097797","DOIUrl":"https://doi.org/10.1109/PADSW.2014.7097797","url":null,"abstract":"Stencils are among the most important and time-consuming kernels in many applications. While stencil optimization has been a well-studied topic on CPU platforms, achieving higher performance and efficiency for the evolving numerical stencils on the more recent multi-core and many-core architectures is still an important issue. In this paper, we explore a number of different stencils, ranging from a basic 7-point Jacobi stencil to more complex high-order stencils used in finer numerical simulations. By optimizing and analyzing those stencils on the latest multi-core and many-core architectures (the Intel Sandy Bridge processor, the Intel Xeon Phi coprocessor, and the NVIDIA Fermi C2070 and Kepler K20x GPUs), we investigate the algorithmic and architectural factors that determine the performance and efficiency of the resulting designs. While multi-threading, vectorization, and optimization on cache and other fast buffers are still the most important techniques that provide performance, we observe that the different memory hierarchy and the different mechanism for issuing and executing parallel instructions lead to the different performance behaviors on CPU, MIC and GPU. With vector-like processing units becoming the major provider of computing power on almost all architectures, the compiler's inability to align all the computing and memory operations would become the major bottleneck from getting a high efficiency on current and future platforms. Our specific optimization of the complex WNAD stencil on GPU provides a good example of what the compiler could do to help.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"596 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134542936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Optimizing Seam Carving on multi-GPU systems for real-time image resizing 优化接缝雕刻在多gpu系统上的实时图像大小调整
Pub Date : 2014-12-01 DOI: 10.1109/PADSW.2014.7097861
I. Kim, Jidong Zhai, Yan Li, Wenguang Chen
Image resizing is increasingly important for picture sharing and exchanging between various personal electronic equipments. Seam Carving is a state-of-the-art approach for effective image resizing because of its content-aware characteristic. However, complex computation and memory access patterns make it time-consuming and prevent its wide usage in real-time image processing. To address these problems, we propose a novel algorithm, called Non-Cumulative Seam Carving (NCSC), which removes main computation bottleneck. Furthermore, we also propose an adaptive multi-seam algorithm for better parallelism on GPU platforms. Finally, we implement our algorithm on a multi-GPU platform. Results show that our approach achieves a maximum 140× speedup on a two-GPU system over the sequential version. It only takes 0.11 second to resize a 1024×640 image by half in width compared to 15.5 seconds with the traditional seam carving.
图像大小调整对于各种个人电子设备之间的图像共享和交换越来越重要。由于其内容感知特性,接缝雕刻是一种最先进的有效图像调整方法。然而,复杂的计算和内存访问模式使其在实时图像处理中难以得到广泛应用。为了解决这些问题,我们提出了一种新的算法,称为非累积缝雕刻(NCSC),它消除了主要的计算瓶颈。此外,我们还提出了一种自适应多接缝算法,以提高GPU平台上的并行性。最后,我们在多gpu平台上实现了我们的算法。结果表明,与顺序版本相比,我们的方法在双gpu系统上实现了最大140倍的加速。将1024×640图像的宽度调整一半只需要0.11秒,而传统的接缝雕刻需要15.5秒。
{"title":"Optimizing Seam Carving on multi-GPU systems for real-time image resizing","authors":"I. Kim, Jidong Zhai, Yan Li, Wenguang Chen","doi":"10.1109/PADSW.2014.7097861","DOIUrl":"https://doi.org/10.1109/PADSW.2014.7097861","url":null,"abstract":"Image resizing is increasingly important for picture sharing and exchanging between various personal electronic equipments. Seam Carving is a state-of-the-art approach for effective image resizing because of its content-aware characteristic. However, complex computation and memory access patterns make it time-consuming and prevent its wide usage in real-time image processing. To address these problems, we propose a novel algorithm, called Non-Cumulative Seam Carving (NCSC), which removes main computation bottleneck. Furthermore, we also propose an adaptive multi-seam algorithm for better parallelism on GPU platforms. Finally, we implement our algorithm on a multi-GPU platform. Results show that our approach achieves a maximum 140× speedup on a two-GPU system over the sequential version. It only takes 0.11 second to resize a 1024×640 image by half in width compared to 15.5 seconds with the traditional seam carving.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"357 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132884077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
GlobLease: A globally consistent and elastic storage system using leases GlobLease:使用租约的全局一致性、弹性存储系统
Pub Date : 2014-12-01 DOI: 10.1109/PADSW.2014.7097872
Y. Liu, Xiaxi Li, Vladimir Vlassov
Nowadays, more and more IT companies are expanding their businesses and services to a global scale, serving users in several countries. Globally distributed storage systems are employed to reduce data access latency for clients all over the world. We present GlobLease, an elastic, globally-distributed and consistent key-value store. It is organised as multiple distributed hash tables (DHTs) storing replicated data and namespace. Across DHTs, data lookups and accesses are processed with respect to the locality of DHT deployments. We explore the use of leases in GlobLease to maintain data consistency across DHTs. The leases enable GlobLease to provide fast and consistent read access in a global scale with reduced global communications. The write accesses are optimized by migrating the master copy to the locations, where most of the writes take place. The elasticity of GlobLease is provided in a fine-grained manner in order to precisely and efficiently handle spiky and skewed read workloads. In our evaluation, GlobLease has demonstrated its optimized global performance, in comparison with Cassandra, with read and write latency less than 10 ms in most of the cases. Furthermore, our evaluation shows that GlobLease is able to bring down the request latency under an instant 4.5 times workload increase with skewed key distribution (a zipfian distribution with an exponent factor of 4) in less than 20 seconds.
如今,越来越多的IT公司将业务和服务扩展到全球范围,为多个国家的用户提供服务。采用全球分布式存储系统,减少全球客户端的数据访问延迟。我们提出了GlobLease,一个弹性的、全局分布的、一致的键值存储。它被组织为多个分布式哈希表(dht),存储复制的数据和名称空间。跨DHT,根据DHT部署的位置来处理数据查找和访问。我们将探讨在GlobLease中使用租约来维护跨dht的数据一致性。租约使GlobLease能够在减少全球通信的情况下在全球范围内提供快速和一致的读访问。通过将主副本迁移到大多数写操作发生的位置来优化写访问。GlobLease的弹性以细粒度的方式提供,以便精确有效地处理尖尖和倾斜的读工作负载。在我们的评估中,与Cassandra相比,GlobLease已经展示了其优化的全局性能,在大多数情况下读写延迟小于10毫秒。此外,我们的评估表明,GlobLease能够在不到20秒的时间内降低请求延迟,在使用歪斜键分布(指数因子为4的zipfian分布)的情况下,将工作负载增加4.5倍。
{"title":"GlobLease: A globally consistent and elastic storage system using leases","authors":"Y. Liu, Xiaxi Li, Vladimir Vlassov","doi":"10.1109/PADSW.2014.7097872","DOIUrl":"https://doi.org/10.1109/PADSW.2014.7097872","url":null,"abstract":"Nowadays, more and more IT companies are expanding their businesses and services to a global scale, serving users in several countries. Globally distributed storage systems are employed to reduce data access latency for clients all over the world. We present GlobLease, an elastic, globally-distributed and consistent key-value store. It is organised as multiple distributed hash tables (DHTs) storing replicated data and namespace. Across DHTs, data lookups and accesses are processed with respect to the locality of DHT deployments. We explore the use of leases in GlobLease to maintain data consistency across DHTs. The leases enable GlobLease to provide fast and consistent read access in a global scale with reduced global communications. The write accesses are optimized by migrating the master copy to the locations, where most of the writes take place. The elasticity of GlobLease is provided in a fine-grained manner in order to precisely and efficiently handle spiky and skewed read workloads. In our evaluation, GlobLease has demonstrated its optimized global performance, in comparison with Cassandra, with read and write latency less than 10 ms in most of the cases. Furthermore, our evaluation shows that GlobLease is able to bring down the request latency under an instant 4.5 times workload increase with skewed key distribution (a zipfian distribution with an exponent factor of 4) in less than 20 seconds.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"79 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134260216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Wireless transmission modeling for Vehicular Ad-hoc Networks 车载自组织网络的无线传输建模
Pub Date : 2014-12-01 DOI: 10.1109/PADSW.2014.7097834
S. Rehman, M. A. Khan, T. Zia
Modeling wireless transmission in stringent networks such as VANETs is a challenging task. This requires mathematically incorporating all the environmental effects present within such a dynamics atmosphere. The key attributes to model the wireless channel are physical constraints inherent to such networks such as lack of permanent infrastructure, limited knowledge in relation to the position of vehicles as well as interference that effects the strength of receive signal at each position of vehicles. The selection of an appropriate transmission model plays a key role in the routing decisions for VANET. This paper investigates such wireless transmission models for vehicular communication. It identifies the situations where a particular model can be beneficial. The paper also provides an insight into the use of practical parameters in theoretical transmission models. An analysis of the proposed transmission model is presented. The performance of different transmission models in terms of receive signal strength (RSS) is also presented. These results help to select a transmission model that suits best to a particular VANET communication scenario.
在VANETs等严格的网络中对无线传输进行建模是一项具有挑战性的任务。这就需要在数学上结合这种动态大气中存在的所有环境影响。对无线信道进行建模的关键属性是这种网络固有的物理约束,例如缺乏永久性基础设施,对车辆位置的了解有限,以及影响车辆每个位置接收信号强度的干扰。选择合适的传输模型在VANET的路由决策中起着关键作用。本文对车载通信的无线传输模型进行了研究。它确定了特定模型可以发挥作用的情况。本文还提供了在理论传输模型中使用实际参数的见解。对所提出的传输模型进行了分析。并介绍了不同传输模式在接收信号强度方面的性能。这些结果有助于选择最适合特定VANET通信场景的传输模型。
{"title":"Wireless transmission modeling for Vehicular Ad-hoc Networks","authors":"S. Rehman, M. A. Khan, T. Zia","doi":"10.1109/PADSW.2014.7097834","DOIUrl":"https://doi.org/10.1109/PADSW.2014.7097834","url":null,"abstract":"Modeling wireless transmission in stringent networks such as VANETs is a challenging task. This requires mathematically incorporating all the environmental effects present within such a dynamics atmosphere. The key attributes to model the wireless channel are physical constraints inherent to such networks such as lack of permanent infrastructure, limited knowledge in relation to the position of vehicles as well as interference that effects the strength of receive signal at each position of vehicles. The selection of an appropriate transmission model plays a key role in the routing decisions for VANET. This paper investigates such wireless transmission models for vehicular communication. It identifies the situations where a particular model can be beneficial. The paper also provides an insight into the use of practical parameters in theoretical transmission models. An analysis of the proposed transmission model is presented. The performance of different transmission models in terms of receive signal strength (RSS) is also presented. These results help to select a transmission model that suits best to a particular VANET communication scenario.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124767120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
期刊
2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1