首页 > 最新文献

2015 44th International Conference on Parallel Processing最新文献

英文 中文
Slowing Little Quickens More: Improving DCTCP for Massive Concurrent Flows 慢一点加速更多:改进大规模并发流的DCTCP
Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.78
Mao Miao, Peng Cheng, Fengyuan Ren, Ran Shu
DCTCP is a potential TCP replacement to satisfy the requirements of data center network. It receives wide concerns in both academic and industrial circles. However, DCTCP could only support tens of concurrent flows well and suffers timeouts and throughput collapse facing numerous concurrent flows. This is far from the requirement of data center network. Data centers employing partition/aggregation pattern usually involve hundreds of concurrent flows. In this paper, after tracing DCTCP's dynamic behavior through experiments, we explored two roots for DCTCP's failure under the high fan-in traffic pattern: (1) The regulation mechanism of sending window is ineffective when cwnd is decreased to the minimum size, (2) The bursts induced by synchronized flows with small cwnd cause fatal packet loss leading to severe timeouts. We enhance DCTCP to support massive concurrent flows by regulating the sending time interval and desynchronizing the sending time in particular conditions. The new protocol called DCTCP+ outperforms DCTCP when the number of concurrent flows increases to several hundreds. DCTCP+ can normally work to effectively support the short concurrent query responses in the benchmark from real production clusters, and keep the same good performance with the mixture of background traffic.
DCTCP是满足数据中心网络需求的一种潜在的TCP替代品。它受到学术界和工业界的广泛关注。然而,DCTCP只能很好地支持数十个并发流,并且面对大量并发流会出现超时和吞吐量崩溃。这与数据中心网络的需求相去甚远。采用分区/聚合模式的数据中心通常涉及数百个并发流。本文通过实验跟踪了DCTCP的动态行为,探讨了高扇区流量模式下DCTCP失效的两个根源:(1)当cwnd减小到最小时,发送窗口的调节机制失效;(2)当cwnd较小时,同步流引发的突发会造成致命的丢包,导致严重的超时。我们通过调整发送时间间隔和在特定条件下去同步发送时间来增强DCTCP以支持大规模并发流。当并发流的数量增加到几百个时,称为DCTCP+的新协议优于DCTCP。DCTCP+通常可以有效地支持基准测试中来自实际生产集群的短并发查询响应,并在混合后台流量的情况下保持相同的良好性能。
{"title":"Slowing Little Quickens More: Improving DCTCP for Massive Concurrent Flows","authors":"Mao Miao, Peng Cheng, Fengyuan Ren, Ran Shu","doi":"10.1109/ICPP.2015.78","DOIUrl":"https://doi.org/10.1109/ICPP.2015.78","url":null,"abstract":"DCTCP is a potential TCP replacement to satisfy the requirements of data center network. It receives wide concerns in both academic and industrial circles. However, DCTCP could only support tens of concurrent flows well and suffers timeouts and throughput collapse facing numerous concurrent flows. This is far from the requirement of data center network. Data centers employing partition/aggregation pattern usually involve hundreds of concurrent flows. In this paper, after tracing DCTCP's dynamic behavior through experiments, we explored two roots for DCTCP's failure under the high fan-in traffic pattern: (1) The regulation mechanism of sending window is ineffective when cwnd is decreased to the minimum size, (2) The bursts induced by synchronized flows with small cwnd cause fatal packet loss leading to severe timeouts. We enhance DCTCP to support massive concurrent flows by regulating the sending time interval and desynchronizing the sending time in particular conditions. The new protocol called DCTCP+ outperforms DCTCP when the number of concurrent flows increases to several hundreds. DCTCP+ can normally work to effectively support the short concurrent query responses in the benchmark from real production clusters, and keep the same good performance with the mixture of background traffic.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123813600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
On Maximizing Reliability of Lifetime Constrained Data Aggregation Tree in Wireless Sensor Networks 无线传感器网络中寿命约束数据聚合树可靠性最大化研究
Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.17
M. Shan, Guihai Chen, Fan Wu, Xiaobing Wu, Xiaofeng Gao, Pan Wu, Haipeng Dai
Tree-based routing structures are widely used to gather data in wireless sensor networks. Along with tree structures, in-network aggregation is adopted to reduce transmissions, to save energy and to prolong the network lifetime. Most existing works that focus on the lifetime optimization for data aggregation do not take the link quality into consideration. In this paper, we study the problem of Maximizing Reliability of Lifetime Constrained data aggregation trees (MRLC) in WSNs. Considering the NP-completeness of the MRLC problem, we propose an algorithm, namely Iterative Relaxation Algorithm (IRA), to iteratively relax the optimization program and to find the aggregation tree subject to the lifetime bound with a sub-optimal cost. To adapt to the distributed nature of the WSNs in practice, we further propose a Prufer code based distributed updating protocol. Through extensive simulations, we demonstrate that IRA outperforms the best known related work in term of reliability.
在无线传感器网络中,基于树的路由结构被广泛用于数据采集。与树形结构相结合,采用网内聚合,减少传输,节约能源,延长网络寿命。现有的研究大多集中在数据聚合的生命周期优化上,没有考虑链路质量。本文研究了无线传感器网络中受寿命约束的数据聚合树(MRLC)可靠性最大化问题。考虑到MRLC问题的np -完备性,提出了一种迭代松弛算法(IRA),对优化方案进行迭代松弛,并以次优代价找到符合生存界的聚合树。为了适应实际应用中无线传感器网络的分布式特性,我们进一步提出了一种基于Prufer代码的分布式更新协议。通过大量的模拟,我们证明了IRA在可靠性方面优于最知名的相关工作。
{"title":"On Maximizing Reliability of Lifetime Constrained Data Aggregation Tree in Wireless Sensor Networks","authors":"M. Shan, Guihai Chen, Fan Wu, Xiaobing Wu, Xiaofeng Gao, Pan Wu, Haipeng Dai","doi":"10.1109/ICPP.2015.17","DOIUrl":"https://doi.org/10.1109/ICPP.2015.17","url":null,"abstract":"Tree-based routing structures are widely used to gather data in wireless sensor networks. Along with tree structures, in-network aggregation is adopted to reduce transmissions, to save energy and to prolong the network lifetime. Most existing works that focus on the lifetime optimization for data aggregation do not take the link quality into consideration. In this paper, we study the problem of Maximizing Reliability of Lifetime Constrained data aggregation trees (MRLC) in WSNs. Considering the NP-completeness of the MRLC problem, we propose an algorithm, namely Iterative Relaxation Algorithm (IRA), to iteratively relax the optimization program and to find the aggregation tree subject to the lifetime bound with a sub-optimal cost. To adapt to the distributed nature of the WSNs in practice, we further propose a Prufer code based distributed updating protocol. Through extensive simulations, we demonstrate that IRA outperforms the best known related work in term of reliability.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126181890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Accelerating Spectral Calculation through Hybrid GPU-Based Computing 基于混合gpu的计算加速频谱计算
Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.13
Jian Xiao, Xingyu Xu, Ce Yu, Jiawan Zhang, Shuinai Zhang, Li Ji, Ji-zhou Sun
Spectral calculation and analysis have very important practical applications in astrophysics. The main portion of spectral calculation is to solve a large number of one-dimensional numerical integrations at each point of a large three-dimensional parameter space. However, existing widely used solutions still remain in process-level parallelism, which is not competent to tackle numerous compute-intensive small integral tasks. This paper presented a GPU-optimized approach to accelerate the numerical integration in massive spectral calculation. We also proposed a load balance strategy on hybrid multiple CPUs and GPUs architecture via share memory to maximize performance. The approach was prototyped and tested on the Astrophysical Plasma Emission Code (APEC), a commonly used spectral toolset. Comparing with the original serial version and the 24 CPU cores (2.5GHz) parallel version, our implementation on 3 Tesla C2075 GPUs achieves a speed-up of up to 300 and 22 respectively.
光谱的计算和分析在天体物理学中有着非常重要的实际应用。光谱计算的主要部分是在一个大的三维参数空间的每一点上求解大量的一维数值积分。然而,现有的广泛使用的解决方案仍然停留在进程级并行性上,无法胜任处理大量计算密集型的小型积分任务。提出了一种基于gpu优化的大规模光谱计算数值积分加速方法。我们还提出了一种基于共享内存的混合多cpu和gpu架构的负载平衡策略,以实现性能最大化。该方法在天体物理等离子体发射代码(APEC)上进行了原型和测试,这是一种常用的光谱工具集。与原始串行版本和24个CPU核(2.5GHz)并行版本相比,我们在3个Tesla C2075 gpu上的实现分别实现了高达300和22的速度提升。
{"title":"Accelerating Spectral Calculation through Hybrid GPU-Based Computing","authors":"Jian Xiao, Xingyu Xu, Ce Yu, Jiawan Zhang, Shuinai Zhang, Li Ji, Ji-zhou Sun","doi":"10.1109/ICPP.2015.13","DOIUrl":"https://doi.org/10.1109/ICPP.2015.13","url":null,"abstract":"Spectral calculation and analysis have very important practical applications in astrophysics. The main portion of spectral calculation is to solve a large number of one-dimensional numerical integrations at each point of a large three-dimensional parameter space. However, existing widely used solutions still remain in process-level parallelism, which is not competent to tackle numerous compute-intensive small integral tasks. This paper presented a GPU-optimized approach to accelerate the numerical integration in massive spectral calculation. We also proposed a load balance strategy on hybrid multiple CPUs and GPUs architecture via share memory to maximize performance. The approach was prototyped and tested on the Astrophysical Plasma Emission Code (APEC), a commonly used spectral toolset. Comparing with the original serial version and the 24 CPU cores (2.5GHz) parallel version, our implementation on 3 Tesla C2075 GPUs achieves a speed-up of up to 300 and 22 respectively.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127463502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Optimizing Image Sharpening Algorithm on GPU 基于GPU的图像锐化算法优化
Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.32
Mengran Fan, Haipeng Jia, Yunquan Zhang, Xiaojing An, Ting Cao
Sharpness is an algorithm used to sharpen images. As the increase of image size, resolution, and the requirements for real-time processing, the performance of sharpness needs to get improved greatly. The independent pixel calculation of sharpness makes a good opportunity to use GPU to largely accelerate the performance. However, to transplant it to GPU, one challenge is that sharpness involves several stages to execute. Each stage has its own characteristics, either with or without data dependency to other stages. Based on those characteristics, this paper proposes a complete solution to implement and optimize sharpness on GPU. Our solution includes five major and effective techniques: Data Transfer Optimization, Kernel Fusion, Vectorization for Data Locality, Border and Reduction Optimization. Experiments show that, compared to a well-optimized CPU version, our GPU solution can reach 10.7~ 69.3 times speedup for different image sizes on an AMD Fire Pro W8000 GPU.
锐度是一种用于锐化图像的算法。随着图像尺寸、分辨率的增加以及对实时处理要求的提高,对图像的清晰度性能要求得到很大的提高。锐度的独立像素计算为使用GPU大幅提高性能提供了很好的机会。然而,要将其移植到GPU,一个挑战是清晰度涉及几个阶段来执行。每个阶段都有自己的特征,或者与其他阶段有数据依赖关系,或者没有数据依赖关系。基于这些特点,本文提出了在GPU上实现和优化图像清晰度的完整方案。我们的解决方案包括五个主要和有效的技术:数据传输优化,核融合,数据局域矢量化,边界和约简优化。实验表明,与优化后的CPU版本相比,我们的GPU解决方案在AMD Fire Pro W8000 GPU上对不同图像大小的加速可以达到10.7~ 69.3倍。
{"title":"Optimizing Image Sharpening Algorithm on GPU","authors":"Mengran Fan, Haipeng Jia, Yunquan Zhang, Xiaojing An, Ting Cao","doi":"10.1109/ICPP.2015.32","DOIUrl":"https://doi.org/10.1109/ICPP.2015.32","url":null,"abstract":"Sharpness is an algorithm used to sharpen images. As the increase of image size, resolution, and the requirements for real-time processing, the performance of sharpness needs to get improved greatly. The independent pixel calculation of sharpness makes a good opportunity to use GPU to largely accelerate the performance. However, to transplant it to GPU, one challenge is that sharpness involves several stages to execute. Each stage has its own characteristics, either with or without data dependency to other stages. Based on those characteristics, this paper proposes a complete solution to implement and optimize sharpness on GPU. Our solution includes five major and effective techniques: Data Transfer Optimization, Kernel Fusion, Vectorization for Data Locality, Border and Reduction Optimization. Experiments show that, compared to a well-optimized CPU version, our GPU solution can reach 10.7~ 69.3 times speedup for different image sizes on an AMD Fire Pro W8000 GPU.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121459937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PDTL: Parallel and Distributed Triangle Listing for Massive Graphs 海量图的并行和分布式三角形列表
Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.46
Ilias Giechaskiel, G. Panagopoulos, Eiko Yoneki
This paper presents the first distributed triangle listing algorithm with provable CPU, I/O, Memory, and Network bounds. Finding all triangles (3-cliques) in a graph has numerous applications for density and connectivity metrics, but the majority of existing algorithms for massive graphs are sequential, while distributed versions of algorithms do not guarantee their CPU, I/O, Memory, or Network requirements. Our Parallel and Distributed Triangle Listing (PDTL) framework focuses on efficient external-memory access in distributed environments instead of fitting sub graphs into memory. It works by performing efficient orientation and load-balancing steps, and replicating graphs across machines by using an extended version of Hu et al.'s Massive Graph Triangulation algorithm. PDTL suits a variety of computational environments, from single-core machines to high-end clusters, and computes the exact triangle count on graphs of over 6B edges and 1B vertices (e.g. Yahoo graphs), outperforming and using fewer resources than the state-of-the-art systems Power Graph, OPT, and PATRIC by 2x to 4x. Our approach thus highlights the importance of I/O in a distributed environment.
本文提出了第一个具有可证明的CPU、I/O、内存和网络边界的分布式三角形列表算法。在图中查找所有三角形(3-cliques)有许多用于密度和连通性指标的应用程序,但是大多数用于大规模图的现有算法是顺序的,而分布式版本的算法不能保证它们的CPU、I/O、内存或网络需求。我们的并行和分布式三角列表(PDTL)框架侧重于分布式环境中高效的外部内存访问,而不是将子图拟合到内存中。它的工作原理是执行有效的定向和负载平衡步骤,并通过使用Hu等人的大规模图形三角化算法的扩展版本在机器上复制图形。PDTL适用于各种计算环境,从单核机器到高端集群,并在超过6B条边和1B个顶点的图(例如Yahoo图)上计算精确的三角形计数,比最先进的系统Power Graph, OPT和patrick的性能和使用的资源要少2到4倍。因此,我们的方法强调了I/O在分布式环境中的重要性。
{"title":"PDTL: Parallel and Distributed Triangle Listing for Massive Graphs","authors":"Ilias Giechaskiel, G. Panagopoulos, Eiko Yoneki","doi":"10.1109/ICPP.2015.46","DOIUrl":"https://doi.org/10.1109/ICPP.2015.46","url":null,"abstract":"This paper presents the first distributed triangle listing algorithm with provable CPU, I/O, Memory, and Network bounds. Finding all triangles (3-cliques) in a graph has numerous applications for density and connectivity metrics, but the majority of existing algorithms for massive graphs are sequential, while distributed versions of algorithms do not guarantee their CPU, I/O, Memory, or Network requirements. Our Parallel and Distributed Triangle Listing (PDTL) framework focuses on efficient external-memory access in distributed environments instead of fitting sub graphs into memory. It works by performing efficient orientation and load-balancing steps, and replicating graphs across machines by using an extended version of Hu et al.'s Massive Graph Triangulation algorithm. PDTL suits a variety of computational environments, from single-core machines to high-end clusters, and computes the exact triangle count on graphs of over 6B edges and 1B vertices (e.g. Yahoo graphs), outperforming and using fewer resources than the state-of-the-art systems Power Graph, OPT, and PATRIC by 2x to 4x. Our approach thus highlights the importance of I/O in a distributed environment.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"126 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131918010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
SZTS: A Novel Big Data Transportation System Benchmark Suite SZTS:一个新的大数据交通系统基准套件
Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.91
Wen Xiong, Zhibin Yu, L. Eeckhout, Zhengdong Bei, Fan Zhang, Chengzhong Xu
Data analytics is at the core of the supply chain for both products and services in modern economies and societies. Big data workloads however, are placing unprecedented demands on computing technologies, calling for a deep understanding and characterization of these emerging workloads. In this paper, we propose Shen Zhen Transportation System (SZTS), a novel big data Hadoop benchmark suite comprised of real-life transportation analysis applications with real-life input data sets from Shenzhen in China. SZTS uniquely focuses on a specific and real-life application domain whereas other existing Hadoop benchmark suites, such as Hi Bench and Cloud Rank-D, consist of generic algorithms with synthetic inputs. We perform a cross-layer workload characterization at both the job and micro architecture level, revealing unique characteristics of SZTS compared to existing Hadoop benchmarks as well as general-purpose multi-core PARSEC benchmarks. We also study the sensitivity of workload behavior with respect to input data size, and propose a methodology for identifying representative input data sets.
在现代经济和社会中,数据分析是产品和服务供应链的核心。然而,大数据工作负载对计算技术提出了前所未有的要求,要求对这些新兴工作负载进行深入理解和表征。在本文中,我们提出了深圳交通系统(SZTS),这是一个新颖的大数据Hadoop基准套件,由来自中国深圳的真实交通分析应用程序和真实输入数据集组成。SZTS独特地专注于一个特定的和现实生活中的应用领域,而其他现有的Hadoop基准套件,如Hi Bench和Cloud Rank-D,由合成输入的通用算法组成。我们在作业和微架构级别执行了跨层工作负载表征,揭示了SZTS与现有Hadoop基准测试以及通用多核PARSEC基准测试相比的独特特征。我们还研究了工作负载行为对输入数据大小的敏感性,并提出了一种识别代表性输入数据集的方法。
{"title":"SZTS: A Novel Big Data Transportation System Benchmark Suite","authors":"Wen Xiong, Zhibin Yu, L. Eeckhout, Zhengdong Bei, Fan Zhang, Chengzhong Xu","doi":"10.1109/ICPP.2015.91","DOIUrl":"https://doi.org/10.1109/ICPP.2015.91","url":null,"abstract":"Data analytics is at the core of the supply chain for both products and services in modern economies and societies. Big data workloads however, are placing unprecedented demands on computing technologies, calling for a deep understanding and characterization of these emerging workloads. In this paper, we propose Shen Zhen Transportation System (SZTS), a novel big data Hadoop benchmark suite comprised of real-life transportation analysis applications with real-life input data sets from Shenzhen in China. SZTS uniquely focuses on a specific and real-life application domain whereas other existing Hadoop benchmark suites, such as Hi Bench and Cloud Rank-D, consist of generic algorithms with synthetic inputs. We perform a cross-layer workload characterization at both the job and micro architecture level, revealing unique characteristics of SZTS compared to existing Hadoop benchmarks as well as general-purpose multi-core PARSEC benchmarks. We also study the sensitivity of workload behavior with respect to input data size, and propose a methodology for identifying representative input data sets.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130751256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Parallel (Probable) Lock-Free Hash Sieve: A Practical Sieving Algorithm for the SVP 并行(可能)无锁散列筛:一种实用的SVP筛分算法
Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.68
Artur Mariano, C. Bischof, Thijs Laarhoven
In this paper, we assess the practicability of Hash Sieve, a recently proposed sieving algorithm for the Shortest Vector Problem (SVP) on lattices, on multi-core shared memory systems. To this end, we devised a parallel implementation that scales well, and is based on a probable lock-free system to handle concurrency. The probable lock-free system, implemented with spin-locks, in turn implemented with CAS operations, becomes likely a lock-free mechanism, since threads block only when strictly required and chances are that they are not required to block. With our implementation, we were able to solve the SVP on an arbitrary lattice in dimension 96, in less than 17.5 hours, using 16 physical cores. The least squares fit of the execution times of our implementation, in seconds, lies between 2(0.32n -- 15) or 2(0.33n -- 16). These results are of paramount importance for the selection of parameters in lattice-based cryptography, as they indicate that sieving algorithms are way more practical for solving the SVP than previously believed.
在本文中,我们评估了哈希筛法的实用性,哈希筛法是最近提出的用于格上最短向量问题(SVP)的筛选算法,在多核共享内存系统上。为此,我们设计了一个可伸缩性很好的并行实现,它基于一个可能的无锁系统来处理并发性。可能的无锁系统,由自旋锁实现,再由CAS操作实现,很可能成为无锁机制,因为线程只在严格需要时阻塞,而且很可能不需要它们阻塞。通过我们的实现,我们能够在不到17.5小时的时间内,使用16个物理内核,在96维的任意晶格上求解SVP。我们实现的执行时间的最小二乘拟合(以秒为单位)介于2(0.32n—15)或2(0.33n—16)之间。这些结果对于基于格的密码学中参数的选择至关重要,因为它们表明筛分算法对于解决SVP比以前认为的更实用。
{"title":"Parallel (Probable) Lock-Free Hash Sieve: A Practical Sieving Algorithm for the SVP","authors":"Artur Mariano, C. Bischof, Thijs Laarhoven","doi":"10.1109/ICPP.2015.68","DOIUrl":"https://doi.org/10.1109/ICPP.2015.68","url":null,"abstract":"In this paper, we assess the practicability of Hash Sieve, a recently proposed sieving algorithm for the Shortest Vector Problem (SVP) on lattices, on multi-core shared memory systems. To this end, we devised a parallel implementation that scales well, and is based on a probable lock-free system to handle concurrency. The probable lock-free system, implemented with spin-locks, in turn implemented with CAS operations, becomes likely a lock-free mechanism, since threads block only when strictly required and chances are that they are not required to block. With our implementation, we were able to solve the SVP on an arbitrary lattice in dimension 96, in less than 17.5 hours, using 16 physical cores. The least squares fit of the execution times of our implementation, in seconds, lies between 2(0.32n -- 15) or 2(0.33n -- 16). These results are of paramount importance for the selection of parameters in lattice-based cryptography, as they indicate that sieving algorithms are way more practical for solving the SVP than previously believed.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134045702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 36
Nested Parallelism on GPU: Exploring Parallelization Templates for Irregular Loops and Recursive Computations GPU上的嵌套并行:探索不规则循环和递归计算的并行化模板
Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.107
Da Li, Hancheng Wu, M. Becchi
The effective deployment of applications exhibiting irregular nested parallelism on GPUs is still an open problem. A naïve mapping of irregular code onto the GPU hardware often leads to resource underutilization and, thereby, limited performance. In this work, we focus on two computational patterns exhibiting nested parallelism: irregular nested loops and parallel recursive computations. In particular, we focus on recursive algorithms operating on trees and graphs. We propose different parallelization templates aimed to increase the GPU utilization of these codes. Specifically, we investigate mechanisms to effectively distribute irregular work to streaming multiprocessors and GPU cores. Some of our parallelization templates rely on dynamic parallelism, a feature recently introduced by Nvidia in their Kepler GPUs and announced as part of the Open CL 2.0 standard. We propose mechanisms to maximize the work performed by nested kernels and minimize the overhead due to their invocation. Our results show that the use of our parallelization templates on applications with irregular nested loops can lead to a 2-6x speedup over baseline GPU codes that do not include load balancing mechanisms. The use of nested parallelism-based parallelization templates on recursive tree traversal algorithms can lead to substantial speedups (up to 15-24x) over optimized CPU implementations. However, the benefits of nested parallelism are still unclear in the presence of recursive applications operating on graphs, especially when recursive code variants require expensive synchronization. In these cases, a flat parallelization of iterative versions of the considered algorithms may be preferable.
在gpu上有效部署具有不规则嵌套并行性的应用程序仍然是一个开放的问题。将不规则代码naïve映射到GPU硬件上通常会导致资源利用率不足,从而限制性能。在这项工作中,我们关注两种显示嵌套并行的计算模式:不规则嵌套循环和并行递归计算。我们特别关注在树和图上操作的递归算法。我们提出了不同的并行化模板,旨在提高这些代码的GPU利用率。具体来说,我们研究了有效地将不规则工作分配到流多处理器和GPU内核的机制。我们的一些并行化模板依赖于动态并行,这是Nvidia最近在其Kepler gpu中引入的一个特性,并作为Open CL 2.0标准的一部分宣布。我们提出了一些机制来最大化嵌套内核执行的工作,并最小化由于调用而产生的开销。我们的结果表明,在具有不规则嵌套循环的应用程序上使用我们的并行化模板可以比不包含负载平衡机制的基线GPU代码提高2-6倍的速度。在递归树遍历算法上使用嵌套的基于并行的并行化模板可以比优化的CPU实现带来显著的加速(高达15-24倍)。然而,在操作图的递归应用程序存在的情况下,嵌套并行的好处仍然不清楚,特别是当递归代码变体需要昂贵的同步时。在这些情况下,所考虑的算法的迭代版本的平并行化可能是可取的。
{"title":"Nested Parallelism on GPU: Exploring Parallelization Templates for Irregular Loops and Recursive Computations","authors":"Da Li, Hancheng Wu, M. Becchi","doi":"10.1109/ICPP.2015.107","DOIUrl":"https://doi.org/10.1109/ICPP.2015.107","url":null,"abstract":"The effective deployment of applications exhibiting irregular nested parallelism on GPUs is still an open problem. A naïve mapping of irregular code onto the GPU hardware often leads to resource underutilization and, thereby, limited performance. In this work, we focus on two computational patterns exhibiting nested parallelism: irregular nested loops and parallel recursive computations. In particular, we focus on recursive algorithms operating on trees and graphs. We propose different parallelization templates aimed to increase the GPU utilization of these codes. Specifically, we investigate mechanisms to effectively distribute irregular work to streaming multiprocessors and GPU cores. Some of our parallelization templates rely on dynamic parallelism, a feature recently introduced by Nvidia in their Kepler GPUs and announced as part of the Open CL 2.0 standard. We propose mechanisms to maximize the work performed by nested kernels and minimize the overhead due to their invocation. Our results show that the use of our parallelization templates on applications with irregular nested loops can lead to a 2-6x speedup over baseline GPU codes that do not include load balancing mechanisms. The use of nested parallelism-based parallelization templates on recursive tree traversal algorithms can lead to substantial speedups (up to 15-24x) over optimized CPU implementations. However, the benefits of nested parallelism are still unclear in the presence of recursive applications operating on graphs, especially when recursive code variants require expensive synchronization. In these cases, a flat parallelization of iterative versions of the considered algorithms may be preferable.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132337818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Optimization of Resource Allocation and Energy Efficiency in Heterogeneous Cloud Data Centers 异构云数据中心的资源配置与能效优化
Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.9
Amer Qouneh, Ming Liu, Tao Li
Performance and energy efficiency are major concerns in cloud computing data centers. More often, they carry conflicting requirements making optimization a challenge. Further complications arise when heterogeneous hardware and data center management technologies are combined. For example, heterogeneous hardware such as General Purpose Graphics Processing Units (GPGPUs) improve performance at the cost of greater power consumption while virtualization technologies improve resource management and utilization at the cost of degraded performance. In this paper, we focus on exploiting heterogeneity introduced by GPUs to reduce power budget requirements for servers while maintaining performance. To maintain or improve overall server performance at reduced power budget, we propose two enhancements: (a) We borrow power from co-located multithreaded virtual machines (VMs) and reallocate it to GPU VMs. (b) To compensate multi-threaded VMs and re-boost their performance, we propose to borrow virtual computing resources from GPU VMs and reallocate them to CPU VMs. Combining the two techniques minimizes server power budget while maintaining overall server performance. Our results show that server power budget can be reduced by almost 18% at the average cost of 13% performance degradation per virtual machine. In addition, reallocating virtual resources improves the performance of multi-threaded applications by 30% without affecting GPU applications. Combining both techniques reduces server energy consumption by 47 % with minimum performance degradation.
性能和能源效率是云计算数据中心的主要关注点。更常见的是,它们带有相互冲突的需求,使得优化成为一项挑战。当异构硬件和数据中心管理技术相结合时,会出现进一步的复杂情况。例如,通用图形处理单元(General Purpose Graphics Processing Units, gpgpu)等异构硬件以更高的功耗为代价来提高性能,而虚拟化技术以降低性能为代价来提高资源管理和利用率。在本文中,我们着重于利用gpu引入的异构性来降低服务器的功率预算需求,同时保持性能。为了在降低功耗预算的情况下保持或提高服务器的整体性能,我们提出了两个增强功能:(a)我们从共存的多线程虚拟机(vm)中借用功率,并将其重新分配给GPU虚拟机。(b)为了补偿多线程虚拟机并重新提升其性能,我们建议借用GPU虚拟机的虚拟计算资源,重新分配给CPU虚拟机。结合这两种技术,可以在保持服务器整体性能的同时最大限度地减少服务器功耗预算。我们的结果表明,在每个虚拟机性能下降13%的平均代价下,服务器电源预算可以减少近18%。此外,重新分配虚拟资源可以在不影响GPU应用程序的情况下,将多线程应用程序的性能提高30%。结合这两种技术,服务器能耗降低了47%,性能下降最小。
{"title":"Optimization of Resource Allocation and Energy Efficiency in Heterogeneous Cloud Data Centers","authors":"Amer Qouneh, Ming Liu, Tao Li","doi":"10.1109/ICPP.2015.9","DOIUrl":"https://doi.org/10.1109/ICPP.2015.9","url":null,"abstract":"Performance and energy efficiency are major concerns in cloud computing data centers. More often, they carry conflicting requirements making optimization a challenge. Further complications arise when heterogeneous hardware and data center management technologies are combined. For example, heterogeneous hardware such as General Purpose Graphics Processing Units (GPGPUs) improve performance at the cost of greater power consumption while virtualization technologies improve resource management and utilization at the cost of degraded performance. In this paper, we focus on exploiting heterogeneity introduced by GPUs to reduce power budget requirements for servers while maintaining performance. To maintain or improve overall server performance at reduced power budget, we propose two enhancements: (a) We borrow power from co-located multithreaded virtual machines (VMs) and reallocate it to GPU VMs. (b) To compensate multi-threaded VMs and re-boost their performance, we propose to borrow virtual computing resources from GPU VMs and reallocate them to CPU VMs. Combining the two techniques minimizes server power budget while maintaining overall server performance. Our results show that server power budget can be reduced by almost 18% at the average cost of 13% performance degradation per virtual machine. In addition, reallocating virtual resources improves the performance of multi-threaded applications by 30% without affecting GPU applications. Combining both techniques reduces server energy consumption by 47 % with minimum performance degradation.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"172 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132662166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Joint Media Streaming Optimization of Energy and Rebuffering Time in Cellular Networks 蜂窝网络中能量和再缓冲时间的联合流媒体优化
Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.49
Zeqi Lai, Yong Cui, Yayun Bao, Jiangchuan Liu, Yingchao Zhao, Xiao Ma
Streaming services are gaining popularity and have contributed a tremendous fraction of today's cellular network traffic. Both playback fluency and battery endurance are significant performance metrics for mobile streaming services. However, because of the unpredictable network condition and the loose coupling between upper layer streaming protocols and underlying network configurations, jointly optimizing rebuffering time and energy consumption for mobile streaming services remains a significant challenge. In this paper, we propose a novel framework that effectively addresses the above limitations and optimizes video transmission in cellular networks. We design two complementary algorithms, Rebuffering Time Minimization Algorithm (RTMA) and Energy Minimization Algorithm (EMA) in this framework, to achieve smoothed playback and energy-efficiency on demand over multi-user scenarios. Our algorithms integrate cross-layer parameters to schedule video delivery. Specifically, RTMA aims at achieving the minimum rebuffering time with limited energy and EMA tries to obtain the minimum energy consumption while meeting the rebuffering time constraint. Extensive simulation demonstrates that RTMA is able to reduce at least 68% rebuffering time and EMA can achieve more than 27% energy reduction compared with other state-of-the-art solutions.
流媒体服务越来越受欢迎,并贡献了当今蜂窝网络流量的很大一部分。播放流畅性和电池续航时间都是移动流媒体服务的重要性能指标。然而,由于网络条件的不可预测以及上层流协议与底层网络配置之间的松耦合,共同优化移动流服务的再缓冲时间和能耗仍然是一个重大挑战。在本文中,我们提出了一种新的框架,有效地解决了上述限制,并优化了蜂窝网络中的视频传输。我们在这个框架中设计了两个互补的算法,即重新缓冲时间最小化算法(RTMA)和能量最小化算法(EMA),以实现多用户场景下的平滑播放和按需节能。我们的算法集成了跨层参数来调度视频传输。其中,RTMA的目标是以有限的能量实现最小的再缓冲时间,EMA的目标是在满足再缓冲时间约束的情况下实现最小的能量消耗。广泛的模拟表明,与其他最先进的解决方案相比,RTMA能够减少至少68%的再缓冲时间,EMA可以减少27%以上的能耗。
{"title":"Joint Media Streaming Optimization of Energy and Rebuffering Time in Cellular Networks","authors":"Zeqi Lai, Yong Cui, Yayun Bao, Jiangchuan Liu, Yingchao Zhao, Xiao Ma","doi":"10.1109/ICPP.2015.49","DOIUrl":"https://doi.org/10.1109/ICPP.2015.49","url":null,"abstract":"Streaming services are gaining popularity and have contributed a tremendous fraction of today's cellular network traffic. Both playback fluency and battery endurance are significant performance metrics for mobile streaming services. However, because of the unpredictable network condition and the loose coupling between upper layer streaming protocols and underlying network configurations, jointly optimizing rebuffering time and energy consumption for mobile streaming services remains a significant challenge. In this paper, we propose a novel framework that effectively addresses the above limitations and optimizes video transmission in cellular networks. We design two complementary algorithms, Rebuffering Time Minimization Algorithm (RTMA) and Energy Minimization Algorithm (EMA) in this framework, to achieve smoothed playback and energy-efficiency on demand over multi-user scenarios. Our algorithms integrate cross-layer parameters to schedule video delivery. Specifically, RTMA aims at achieving the minimum rebuffering time with limited energy and EMA tries to obtain the minimum energy consumption while meeting the rebuffering time constraint. Extensive simulation demonstrates that RTMA is able to reduce at least 68% rebuffering time and EMA can achieve more than 27% energy reduction compared with other state-of-the-art solutions.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114832394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
2015 44th International Conference on Parallel Processing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1