首页 > 最新文献

2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)最新文献

英文 中文
POPS: A popularity-aware live streaming service pop:关注流行的流媒体直播服务
Pub Date : 2014-12-01 DOI: 10.1109/PADSW.2014.7097873
Karine Pires, Sébastien Monnet, Pierre Sens
Live streaming has become very popular. Many systems, such as justin.tv, have emerged. They aim to collect user live-streams and serve them to the viewers using broadcasting servers. However, the huge variation in the total number of viewers and the great heterogeneity among streams popularity generally implies over-provisioning, leading to an important resource waste. In this paper, we show that there is a trade-off between the number of servers involved to broadcast the streams and the bandwidth usage among the servers. We also stress the importance to predict streams popularity in order to efficiently place them on the servers. We propose POPS: a live streaming service using popularity predictions to map live-streams on the servers.
直播已经变得非常流行。很多系统,比如justin。电视,已经出现。他们的目标是收集用户的直播流,并通过广播服务器将其提供给观众。然而,观众总数的巨大差异和流受欢迎程度的巨大异质性通常意味着过度供应,导致重要的资源浪费。在本文中,我们展示了在广播流所涉及的服务器数量和服务器之间的带宽使用之间存在权衡。我们还强调了预测流流行度的重要性,以便有效地将它们放置在服务器上。我们提出POPS:一种使用流行度预测来映射服务器上的直播流的直播流服务。
{"title":"POPS: A popularity-aware live streaming service","authors":"Karine Pires, Sébastien Monnet, Pierre Sens","doi":"10.1109/PADSW.2014.7097873","DOIUrl":"https://doi.org/10.1109/PADSW.2014.7097873","url":null,"abstract":"Live streaming has become very popular. Many systems, such as justin.tv, have emerged. They aim to collect user live-streams and serve them to the viewers using broadcasting servers. However, the huge variation in the total number of viewers and the great heterogeneity among streams popularity generally implies over-provisioning, leading to an important resource waste. In this paper, we show that there is a trade-off between the number of servers involved to broadcast the streams and the bandwidth usage among the servers. We also stress the importance to predict streams popularity in order to efficiently place them on the servers. We propose POPS: a live streaming service using popularity predictions to map live-streams on the servers.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116726964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Fast GPU parallel N-Body tree traversal with Simulated Wide-Warp 模拟Wide-Warp的快速GPU并行n体树遍历
Pub Date : 2014-12-01 DOI: 10.1109/PADSW.2014.7097874
Wagner M. Nunan Zola, L. C. E. Bona, Fabiano Silva
The Barnes-Hut algorithm is a widely used approximation method for the N-Body simulation problem. The irregular nature of this tree walking code presents interesting challenges for its computation on parallel systems. Additional problems arise in effectively exploiting the processing capacity of GPU architectures. We propose and investigate the applicability of software Simulated Wide-Warps (SWW) in this context. To this extent, we explicitly deal with dynamic irregular patterns in data accesses with data remapping and data transformation, by controlling execution flow divergence of threads. We present a new compact data-structure for the tree layout, GPU parallel algorithms for tree transformation and parallel walking using SWW. Benefits of our techniques are in transposing the tree algorithm to execute regular patterns to match the GPU model. Our experiments show significant performance improvement over the best known GPU solutions to this algorithm.
Barnes-Hut算法是一种广泛应用于n体仿真问题的近似方法。这种树遍历代码的不规则性质为其在并行系统上的计算提出了有趣的挑战。在有效利用GPU架构的处理能力方面出现了其他问题。在此背景下,我们提出并研究了软件模拟宽翘曲(SWW)的适用性。在这种程度上,我们通过控制线程的执行流发散,显式地处理数据访问中的数据重映射和数据转换的动态不规则模式。我们提出了一种新的用于树布局的紧凑数据结构、用于树转换的GPU并行算法和基于SWW的并行行走。我们技术的好处是将树算法转换为执行规则模式以匹配GPU模型。我们的实验表明,与最知名的GPU解决方案相比,该算法的性能有了显著提高。
{"title":"Fast GPU parallel N-Body tree traversal with Simulated Wide-Warp","authors":"Wagner M. Nunan Zola, L. C. E. Bona, Fabiano Silva","doi":"10.1109/PADSW.2014.7097874","DOIUrl":"https://doi.org/10.1109/PADSW.2014.7097874","url":null,"abstract":"The Barnes-Hut algorithm is a widely used approximation method for the N-Body simulation problem. The irregular nature of this tree walking code presents interesting challenges for its computation on parallel systems. Additional problems arise in effectively exploiting the processing capacity of GPU architectures. We propose and investigate the applicability of software Simulated Wide-Warps (SWW) in this context. To this extent, we explicitly deal with dynamic irregular patterns in data accesses with data remapping and data transformation, by controlling execution flow divergence of threads. We present a new compact data-structure for the tree layout, GPU parallel algorithms for tree transformation and parallel walking using SWW. Benefits of our techniques are in transposing the tree algorithm to execute regular patterns to match the GPU model. Our experiments show significant performance improvement over the best known GPU solutions to this algorithm.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"170 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115177573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Improving utilization through dynamic VM resource allocation in hybrid cloud environment 通过混合云环境下虚拟机资源的动态分配,提高利用率
Pub Date : 2014-12-01 DOI: 10.1109/PADSW.2014.7097814
Yuda Wang, Renyu Yang, Tianyu Wo, Wenbo Jiang, Chunming Hu
Virtualization is one of the most fascinating techniques because it can facilitate the infrastructure management and provide isolated execution for running workloads. Despite the benefits gained from virtualization and resource sharing, improved resource utilization is still far from settled due to the dynamic resource requirements and the widely-used over-provision strategy for guaranteed QoS. Additionally, with the emerging demands for big data analytic, how to effectively manage hybrid workloads such as traditional batch task and long-running virtual machine (VM) service needs to be dealt with. In this paper, we propose a system to combine long-running VM service with typical batch workload like MapReduce. The objectives are to improve the holistic cluster utilization through dynamic resource adjustment mechanism for VM without violating other batch workload executions. Furthermore, VM migration is utilized to ensure high availability and avoid potential performance degradation. The experimental results reveal that the dynamically allocated memory is close to the real usage with only 10% estimation margin, and the performance impact on VM and MapReduce jobs are both within 1%. Additionally, at most 50% increment of resource utilization could be achieved. We believe that these findings are in the right direction to solving workload consolidation issues in hybrid computing environments.
虚拟化是最吸引人的技术之一,因为它可以促进基础设施管理,并为运行的工作负载提供独立的执行。尽管从虚拟化和资源共享中获得了好处,但由于动态资源需求和广泛使用的保证QoS的过度供应策略,提高资源利用率仍然远远没有解决。此外,随着大数据分析需求的不断涌现,如何有效地管理传统的批处理任务和长时间运行的虚拟机服务等混合工作负载也需要解决。在本文中,我们提出了一个将长时间运行的VM服务与典型的批处理工作负载(如MapReduce)相结合的系统。目标是在不影响其他批处理工作负载执行的情况下,通过VM的动态资源调整机制提高整体集群利用率。此外,还利用虚拟机迁移来确保高可用性并避免潜在的性能下降。实验结果表明,动态分配的内存接近实际使用情况,估计余量仅为10%,对VM和MapReduce作业的性能影响均在1%以内。此外,最多可以实现50%的资源利用率增量。我们相信这些发现是解决混合计算环境中的工作负载整合问题的正确方向。
{"title":"Improving utilization through dynamic VM resource allocation in hybrid cloud environment","authors":"Yuda Wang, Renyu Yang, Tianyu Wo, Wenbo Jiang, Chunming Hu","doi":"10.1109/PADSW.2014.7097814","DOIUrl":"https://doi.org/10.1109/PADSW.2014.7097814","url":null,"abstract":"Virtualization is one of the most fascinating techniques because it can facilitate the infrastructure management and provide isolated execution for running workloads. Despite the benefits gained from virtualization and resource sharing, improved resource utilization is still far from settled due to the dynamic resource requirements and the widely-used over-provision strategy for guaranteed QoS. Additionally, with the emerging demands for big data analytic, how to effectively manage hybrid workloads such as traditional batch task and long-running virtual machine (VM) service needs to be dealt with. In this paper, we propose a system to combine long-running VM service with typical batch workload like MapReduce. The objectives are to improve the holistic cluster utilization through dynamic resource adjustment mechanism for VM without violating other batch workload executions. Furthermore, VM migration is utilized to ensure high availability and avoid potential performance degradation. The experimental results reveal that the dynamically allocated memory is close to the real usage with only 10% estimation margin, and the performance impact on VM and MapReduce jobs are both within 1%. Additionally, at most 50% increment of resource utilization could be achieved. We believe that these findings are in the right direction to solving workload consolidation issues in hybrid computing environments.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116904691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
pbitMCE: A bit-based approach for maximal clique enumeration on multicore processors pbitMCE:在多核处理器上实现最大团枚举的基于位的方法
Pub Date : 2014-12-01 DOI: 10.1109/PADSW.2014.7097844
N. Dasari, D. Ranjan, M. Zubair
Maximal clique enumeration (MCE) is a fundamental problem in graph theory. It plays a vital role in many network analysis applications and in computational biology. MCE is an extensively studied problem. Recently, Eppstein et al. proposed a state-of-the-art sequential algorithm that uses degeneracy based ordering of vertices to improve the efficiency. In this paper, we propose a new parallel implementation of the algorithm of Eppstein et al. using a new bit-based data structure. The new data structure not only reduces the working set size significantly but also by enabling the use of bit-parallelism improves the performance of the algorithm. We illustrate the significance of degeneracy ordering in load balancing and experimentally evaluate the impact of scheduling on the performance of the algorithm. We present experimental results on several types of synthetic and real-world graphs with up to 50 million vertices and 100 million edges. We show that our approach outperforms Eppstein et al.'s approach by up to 4 times and also scales up to 29 times when run on a multicore machine with 32 cores.
极大团枚举是图论中的一个基本问题。它在许多网络分析应用和计算生物学中起着至关重要的作用。MCE是一个被广泛研究的问题。最近,Eppstein等人提出了一种最先进的序列算法,该算法使用基于退化的顶点排序来提高效率。在本文中,我们使用一种新的基于位的数据结构,提出了一种新的并行实现Eppstein等人的算法。新的数据结构不仅显著地减小了工作集的大小,而且通过启用位并行性提高了算法的性能。我们说明了退化排序在负载平衡中的重要性,并通过实验评估了调度对算法性能的影响。我们给出了几种类型的合成图和真实世界图的实验结果,这些图有多达5000万个顶点和1亿个边。我们表明,我们的方法比Eppstein等人的方法性能高出4倍,并且在具有32核的多核机器上运行时也可扩展到29倍。
{"title":"pbitMCE: A bit-based approach for maximal clique enumeration on multicore processors","authors":"N. Dasari, D. Ranjan, M. Zubair","doi":"10.1109/PADSW.2014.7097844","DOIUrl":"https://doi.org/10.1109/PADSW.2014.7097844","url":null,"abstract":"Maximal clique enumeration (MCE) is a fundamental problem in graph theory. It plays a vital role in many network analysis applications and in computational biology. MCE is an extensively studied problem. Recently, Eppstein et al. proposed a state-of-the-art sequential algorithm that uses degeneracy based ordering of vertices to improve the efficiency. In this paper, we propose a new parallel implementation of the algorithm of Eppstein et al. using a new bit-based data structure. The new data structure not only reduces the working set size significantly but also by enabling the use of bit-parallelism improves the performance of the algorithm. We illustrate the significance of degeneracy ordering in load balancing and experimentally evaluate the impact of scheduling on the performance of the algorithm. We present experimental results on several types of synthetic and real-world graphs with up to 50 million vertices and 100 million edges. We show that our approach outperforms Eppstein et al.'s approach by up to 4 times and also scales up to 29 times when run on a multicore machine with 32 cores.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125288849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
A distributed real-time operating system built with aspect-oriented programming for distributed embedded control systems 面向方面编程的分布式实时操作系统,用于分布式嵌入式控制系统
Pub Date : 2014-12-01 DOI: 10.1109/PADSW.2014.7097839
Nobuhiro Saito, Myungryun Yoo, T. Yokoyama
The paper presents a method to build a distributed real-time operating system for distributed embedded control systems using aspect-oriented programming. We define aspects to weave distributed computing mechanisms to an existing real-time operating system. By using the aspects, we can build a distributed operating system without modifying the original source code. This improves the maintainability of the source code of a real-time operating system family. We have applied the aspects to an OSEK OS and have got a distributed operating system that provides location-transparent system calls for task management and inter-task synchronization. The evaluation results show that the overhead of aspect-oriented programming is practically small enough.
提出了一种基于面向方面编程的分布式嵌入式控制系统实时操作系统的实现方法。我们定义了将分布式计算机制编织到现有实时操作系统中的方面。通过使用方面,我们可以在不修改原始源代码的情况下构建分布式操作系统。这提高了实时操作系统家族源代码的可维护性。我们将这些方面应用到OSEK操作系统中,得到了一个分布式操作系统,它为任务管理和任务间同步提供了位置透明的系统调用。评估结果表明,面向方面编程的开销实际上足够小。
{"title":"A distributed real-time operating system built with aspect-oriented programming for distributed embedded control systems","authors":"Nobuhiro Saito, Myungryun Yoo, T. Yokoyama","doi":"10.1109/PADSW.2014.7097839","DOIUrl":"https://doi.org/10.1109/PADSW.2014.7097839","url":null,"abstract":"The paper presents a method to build a distributed real-time operating system for distributed embedded control systems using aspect-oriented programming. We define aspects to weave distributed computing mechanisms to an existing real-time operating system. By using the aspects, we can build a distributed operating system without modifying the original source code. This improves the maintainability of the source code of a real-time operating system family. We have applied the aspects to an OSEK OS and have got a distributed operating system that provides location-transparent system calls for task management and inter-task synchronization. The evaluation results show that the overhead of aspect-oriented programming is practically small enough.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130879558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Optimizing Seam Carving on multi-GPU systems for real-time image resizing 优化接缝雕刻在多gpu系统上的实时图像大小调整
Pub Date : 2014-12-01 DOI: 10.1109/PADSW.2014.7097861
I. Kim, Jidong Zhai, Yan Li, Wenguang Chen
Image resizing is increasingly important for picture sharing and exchanging between various personal electronic equipments. Seam Carving is a state-of-the-art approach for effective image resizing because of its content-aware characteristic. However, complex computation and memory access patterns make it time-consuming and prevent its wide usage in real-time image processing. To address these problems, we propose a novel algorithm, called Non-Cumulative Seam Carving (NCSC), which removes main computation bottleneck. Furthermore, we also propose an adaptive multi-seam algorithm for better parallelism on GPU platforms. Finally, we implement our algorithm on a multi-GPU platform. Results show that our approach achieves a maximum 140× speedup on a two-GPU system over the sequential version. It only takes 0.11 second to resize a 1024×640 image by half in width compared to 15.5 seconds with the traditional seam carving.
图像大小调整对于各种个人电子设备之间的图像共享和交换越来越重要。由于其内容感知特性,接缝雕刻是一种最先进的有效图像调整方法。然而,复杂的计算和内存访问模式使其在实时图像处理中难以得到广泛应用。为了解决这些问题,我们提出了一种新的算法,称为非累积缝雕刻(NCSC),它消除了主要的计算瓶颈。此外,我们还提出了一种自适应多接缝算法,以提高GPU平台上的并行性。最后,我们在多gpu平台上实现了我们的算法。结果表明,与顺序版本相比,我们的方法在双gpu系统上实现了最大140倍的加速。将1024×640图像的宽度调整一半只需要0.11秒,而传统的接缝雕刻需要15.5秒。
{"title":"Optimizing Seam Carving on multi-GPU systems for real-time image resizing","authors":"I. Kim, Jidong Zhai, Yan Li, Wenguang Chen","doi":"10.1109/PADSW.2014.7097861","DOIUrl":"https://doi.org/10.1109/PADSW.2014.7097861","url":null,"abstract":"Image resizing is increasingly important for picture sharing and exchanging between various personal electronic equipments. Seam Carving is a state-of-the-art approach for effective image resizing because of its content-aware characteristic. However, complex computation and memory access patterns make it time-consuming and prevent its wide usage in real-time image processing. To address these problems, we propose a novel algorithm, called Non-Cumulative Seam Carving (NCSC), which removes main computation bottleneck. Furthermore, we also propose an adaptive multi-seam algorithm for better parallelism on GPU platforms. Finally, we implement our algorithm on a multi-GPU platform. Results show that our approach achieves a maximum 140× speedup on a two-GPU system over the sequential version. It only takes 0.11 second to resize a 1024×640 image by half in width compared to 15.5 seconds with the traditional seam carving.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"357 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132884077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
GlobLease: A globally consistent and elastic storage system using leases GlobLease:使用租约的全局一致性、弹性存储系统
Pub Date : 2014-12-01 DOI: 10.1109/PADSW.2014.7097872
Y. Liu, Xiaxi Li, Vladimir Vlassov
Nowadays, more and more IT companies are expanding their businesses and services to a global scale, serving users in several countries. Globally distributed storage systems are employed to reduce data access latency for clients all over the world. We present GlobLease, an elastic, globally-distributed and consistent key-value store. It is organised as multiple distributed hash tables (DHTs) storing replicated data and namespace. Across DHTs, data lookups and accesses are processed with respect to the locality of DHT deployments. We explore the use of leases in GlobLease to maintain data consistency across DHTs. The leases enable GlobLease to provide fast and consistent read access in a global scale with reduced global communications. The write accesses are optimized by migrating the master copy to the locations, where most of the writes take place. The elasticity of GlobLease is provided in a fine-grained manner in order to precisely and efficiently handle spiky and skewed read workloads. In our evaluation, GlobLease has demonstrated its optimized global performance, in comparison with Cassandra, with read and write latency less than 10 ms in most of the cases. Furthermore, our evaluation shows that GlobLease is able to bring down the request latency under an instant 4.5 times workload increase with skewed key distribution (a zipfian distribution with an exponent factor of 4) in less than 20 seconds.
如今,越来越多的IT公司将业务和服务扩展到全球范围,为多个国家的用户提供服务。采用全球分布式存储系统,减少全球客户端的数据访问延迟。我们提出了GlobLease,一个弹性的、全局分布的、一致的键值存储。它被组织为多个分布式哈希表(dht),存储复制的数据和名称空间。跨DHT,根据DHT部署的位置来处理数据查找和访问。我们将探讨在GlobLease中使用租约来维护跨dht的数据一致性。租约使GlobLease能够在减少全球通信的情况下在全球范围内提供快速和一致的读访问。通过将主副本迁移到大多数写操作发生的位置来优化写访问。GlobLease的弹性以细粒度的方式提供,以便精确有效地处理尖尖和倾斜的读工作负载。在我们的评估中,与Cassandra相比,GlobLease已经展示了其优化的全局性能,在大多数情况下读写延迟小于10毫秒。此外,我们的评估表明,GlobLease能够在不到20秒的时间内降低请求延迟,在使用歪斜键分布(指数因子为4的zipfian分布)的情况下,将工作负载增加4.5倍。
{"title":"GlobLease: A globally consistent and elastic storage system using leases","authors":"Y. Liu, Xiaxi Li, Vladimir Vlassov","doi":"10.1109/PADSW.2014.7097872","DOIUrl":"https://doi.org/10.1109/PADSW.2014.7097872","url":null,"abstract":"Nowadays, more and more IT companies are expanding their businesses and services to a global scale, serving users in several countries. Globally distributed storage systems are employed to reduce data access latency for clients all over the world. We present GlobLease, an elastic, globally-distributed and consistent key-value store. It is organised as multiple distributed hash tables (DHTs) storing replicated data and namespace. Across DHTs, data lookups and accesses are processed with respect to the locality of DHT deployments. We explore the use of leases in GlobLease to maintain data consistency across DHTs. The leases enable GlobLease to provide fast and consistent read access in a global scale with reduced global communications. The write accesses are optimized by migrating the master copy to the locations, where most of the writes take place. The elasticity of GlobLease is provided in a fine-grained manner in order to precisely and efficiently handle spiky and skewed read workloads. In our evaluation, GlobLease has demonstrated its optimized global performance, in comparison with Cassandra, with read and write latency less than 10 ms in most of the cases. Furthermore, our evaluation shows that GlobLease is able to bring down the request latency under an instant 4.5 times workload increase with skewed key distribution (a zipfian distribution with an exponent factor of 4) in less than 20 seconds.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"79 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134260216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Effective multi-GPU communication using multiple CUDA streams and threads 有效的多gpu通信使用多个CUDA流和线程
Pub Date : 2014-12-01 DOI: 10.1109/PADSW.2014.7097919
Mohammed Sourouri, T. Gillberg, S. Baden, Xing Cai
In the context of multiple GPUs that share the same PCIe bus, we propose a new communication scheme that leads to a more effective overlap of communication and computation. Multiple CUDA streams and OpenMP threads are adopted so that data can simultaneously be sent and received. A representative 3D stencil example is used to demonstrate the effectiveness of our scheme. We compare the performance of our new scheme with an MPI-based state-of-the-art scheme. Results show that our approach outperforms the state-of-the-art scheme, being up to 1.85× faster. However, our performance results also indicate that the current underlying PCIe bus architecture needs improvements to handle the future scenario of many GPUs per node.
在多个gpu共享同一PCIe总线的情况下,我们提出了一种新的通信方案,可以更有效地实现通信和计算的重叠。采用多个CUDA流和OpenMP线程,可以同时发送和接收数据。通过一个典型的三维模板实例验证了该方法的有效性。我们将新方案的性能与基于mpi的最先进方案进行了比较。结果表明,我们的方法优于最先进的方案,速度提高了1.85倍。然而,我们的性能结果也表明,当前的底层PCIe总线架构需要改进,以处理每个节点多个gpu的未来场景。
{"title":"Effective multi-GPU communication using multiple CUDA streams and threads","authors":"Mohammed Sourouri, T. Gillberg, S. Baden, Xing Cai","doi":"10.1109/PADSW.2014.7097919","DOIUrl":"https://doi.org/10.1109/PADSW.2014.7097919","url":null,"abstract":"In the context of multiple GPUs that share the same PCIe bus, we propose a new communication scheme that leads to a more effective overlap of communication and computation. Multiple CUDA streams and OpenMP threads are adopted so that data can simultaneously be sent and received. A representative 3D stencil example is used to demonstrate the effectiveness of our scheme. We compare the performance of our new scheme with an MPI-based state-of-the-art scheme. Results show that our approach outperforms the state-of-the-art scheme, being up to 1.85× faster. However, our performance results also indicate that the current underlying PCIe bus architecture needs improvements to handle the future scenario of many GPUs per node.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126884123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
Scaling and analyzing the stencil performance on multi-core and many-core architectures 在多核和多核架构下扩展和分析模板性能
Pub Date : 2014-12-01 DOI: 10.1109/PADSW.2014.7097797
L. Gan, H. Fu, Wei Xue, Yangtong Xu, Chao Yang, Xinliang Wang, Zihong Lv, Yang You, Guangwen Yang, Kaijian Ou
Stencils are among the most important and time-consuming kernels in many applications. While stencil optimization has been a well-studied topic on CPU platforms, achieving higher performance and efficiency for the evolving numerical stencils on the more recent multi-core and many-core architectures is still an important issue. In this paper, we explore a number of different stencils, ranging from a basic 7-point Jacobi stencil to more complex high-order stencils used in finer numerical simulations. By optimizing and analyzing those stencils on the latest multi-core and many-core architectures (the Intel Sandy Bridge processor, the Intel Xeon Phi coprocessor, and the NVIDIA Fermi C2070 and Kepler K20x GPUs), we investigate the algorithmic and architectural factors that determine the performance and efficiency of the resulting designs. While multi-threading, vectorization, and optimization on cache and other fast buffers are still the most important techniques that provide performance, we observe that the different memory hierarchy and the different mechanism for issuing and executing parallel instructions lead to the different performance behaviors on CPU, MIC and GPU. With vector-like processing units becoming the major provider of computing power on almost all architectures, the compiler's inability to align all the computing and memory operations would become the major bottleneck from getting a high efficiency on current and future platforms. Our specific optimization of the complex WNAD stencil on GPU provides a good example of what the compiler could do to help.
模板是许多应用程序中最重要和最耗时的内核之一。虽然模板优化已经成为CPU平台上一个被广泛研究的话题,但在最近的多核和多核架构上,为不断发展的数值模板实现更高的性能和效率仍然是一个重要的问题。在本文中,我们探索了许多不同的模板,从基本的7点雅可比模板到更复杂的高阶模板,用于更精细的数值模拟。通过在最新的多核和多核架构(英特尔Sandy Bridge处理器、英特尔Xeon Phi协处理器、NVIDIA Fermi C2070和Kepler K20x gpu)上优化和分析这些模板,我们研究了决定最终设计性能和效率的算法和架构因素。虽然多线程、向量化以及缓存和其他快速缓冲区上的优化仍然是提供性能的最重要技术,但我们观察到,不同的内存层次结构以及发出和执行并行指令的不同机制导致CPU、MIC和GPU上的不同性能行为。随着类矢量处理单元成为几乎所有体系结构上计算能力的主要提供者,编译器无法将所有计算和内存操作对齐将成为当前和未来平台上获得高效率的主要瓶颈。我们对GPU上复杂的WNAD模板的具体优化提供了一个很好的例子,说明编译器可以做些什么来提供帮助。
{"title":"Scaling and analyzing the stencil performance on multi-core and many-core architectures","authors":"L. Gan, H. Fu, Wei Xue, Yangtong Xu, Chao Yang, Xinliang Wang, Zihong Lv, Yang You, Guangwen Yang, Kaijian Ou","doi":"10.1109/PADSW.2014.7097797","DOIUrl":"https://doi.org/10.1109/PADSW.2014.7097797","url":null,"abstract":"Stencils are among the most important and time-consuming kernels in many applications. While stencil optimization has been a well-studied topic on CPU platforms, achieving higher performance and efficiency for the evolving numerical stencils on the more recent multi-core and many-core architectures is still an important issue. In this paper, we explore a number of different stencils, ranging from a basic 7-point Jacobi stencil to more complex high-order stencils used in finer numerical simulations. By optimizing and analyzing those stencils on the latest multi-core and many-core architectures (the Intel Sandy Bridge processor, the Intel Xeon Phi coprocessor, and the NVIDIA Fermi C2070 and Kepler K20x GPUs), we investigate the algorithmic and architectural factors that determine the performance and efficiency of the resulting designs. While multi-threading, vectorization, and optimization on cache and other fast buffers are still the most important techniques that provide performance, we observe that the different memory hierarchy and the different mechanism for issuing and executing parallel instructions lead to the different performance behaviors on CPU, MIC and GPU. With vector-like processing units becoming the major provider of computing power on almost all architectures, the compiler's inability to align all the computing and memory operations would become the major bottleneck from getting a high efficiency on current and future platforms. Our specific optimization of the complex WNAD stencil on GPU provides a good example of what the compiler could do to help.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"596 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134542936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
A quorum-based channel hopping scheme for jamming resilience 一种基于群体的抗干扰信道跳变方案
Pub Date : 2014-12-01 DOI: 10.1109/PADSW.2014.7097904
Jen-Feng Huang, Guey-Yun Chang, Guo-Xun Hung
Jamming attacks have become wireless security threats disrupting reliable RF communication. Frequency Hopping Spread Spectrum (FHSS) is a widely used technique for anti-jamming wireless communications. However, FHSS relies on the pre-shared secret key between the communication pair. However, secure key exchange is challenging. Recently, Uncoordinated Frequency Hopping (UFH) schemes have been studied to achieve resilient key establishment in presence of a jammer. However, existing UFH schemes either suffer from unguaranteed rendezvous. In this paper, we introduce an Jamming-Resilient Asynchronous-Symmetric UFH scheme, AJCH, for guaranteeing rendezvous.
干扰攻击已经成为干扰可靠射频通信的无线安全威胁。跳频扩频(FHSS)是一种应用广泛的无线通信抗干扰技术。但是,FHSS依赖于通信对之间的预共享密钥。然而,安全密钥交换是具有挑战性的。近年来,研究了非协调跳频(UFH)方案,以实现在干扰机存在下的弹性密钥建立。然而,现有的UFH方案要么存在无法保证会合的问题。本文介绍了一种抗干扰异步对称UFH方案,即AJCH,用于保证交会。
{"title":"A quorum-based channel hopping scheme for jamming resilience","authors":"Jen-Feng Huang, Guey-Yun Chang, Guo-Xun Hung","doi":"10.1109/PADSW.2014.7097904","DOIUrl":"https://doi.org/10.1109/PADSW.2014.7097904","url":null,"abstract":"Jamming attacks have become wireless security threats disrupting reliable RF communication. Frequency Hopping Spread Spectrum (FHSS) is a widely used technique for anti-jamming wireless communications. However, FHSS relies on the pre-shared secret key between the communication pair. However, secure key exchange is challenging. Recently, Uncoordinated Frequency Hopping (UFH) schemes have been studied to achieve resilient key establishment in presence of a jammer. However, existing UFH schemes either suffer from unguaranteed rendezvous. In this paper, we introduce an Jamming-Resilient Asynchronous-Symmetric UFH scheme, AJCH, for guaranteeing rendezvous.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"31 S1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120966600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1