首页 > 最新文献

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献

英文 中文
QoS-Driven Coordinated Management of Resources to Save Energy in Multi-core Systems 基于qos驱动的多核系统资源协同节能管理
Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00040
M. Nejat, M. Pericàs, P. Stenström
Applications that are run on multicore systems without performance targets can waste significant energy. This paper considers, for the first time, a QoS-driven coordinated resource management algorithm (RMA) that dynamically adjusts the size of the per-core last-level cache partitions and the per-core voltage-frequency settings to save energy while respecting QoS requirements of individual applications in multi-programmed workloads run on multi-core systems. It does so by doing configuration-space exploration across the spectrum of LLC partition sizes and DVFS settings at runtime at negligible overhead. Compared to DVFS and cache partitioning alone, we show that our proposed coordinated RMA is capable of saving, on average, 20% energy as compared to 15% for DVFS alone and 7% for cache partitioning alone, when the performance target is set to 70% of the baseline system performance.
在没有性能目标的多核系统上运行的应用程序可能会浪费大量的能源。本文首次考虑了一种QoS驱动的协调资源管理算法(RMA),该算法动态调整每核最后一级缓存分区的大小和每核电压频率设置,以节省能源,同时尊重运行在多核系统上的多编程工作负载中单个应用程序的QoS要求。它通过在运行时以可忽略不计的开销跨LLC分区大小和DVFS设置范围进行配置空间探索来实现这一点。与单独的DVFS和缓存分区相比,我们表明,当性能目标设置为基准系统性能的70%时,我们提出的协调RMA能够平均节省20%的能量,而单独的DVFS和单独的缓存分区分别节省15%和7%的能量。
{"title":"QoS-Driven Coordinated Management of Resources to Save Energy in Multi-core Systems","authors":"M. Nejat, M. Pericàs, P. Stenström","doi":"10.1109/IPDPS.2019.00040","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00040","url":null,"abstract":"Applications that are run on multicore systems without performance targets can waste significant energy. This paper considers, for the first time, a QoS-driven coordinated resource management algorithm (RMA) that dynamically adjusts the size of the per-core last-level cache partitions and the per-core voltage-frequency settings to save energy while respecting QoS requirements of individual applications in multi-programmed workloads run on multi-core systems. It does so by doing configuration-space exploration across the spectrum of LLC partition sizes and DVFS settings at runtime at negligible overhead. Compared to DVFS and cache partitioning alone, we show that our proposed coordinated RMA is capable of saving, on average, 20% energy as compared to 15% for DVFS alone and 7% for cache partitioning alone, when the performance target is set to 70% of the baseline system performance.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122184473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Network Size Estimation in Small-World Networks Under Byzantine Faults 拜占庭故障下小世界网络的网络大小估计
Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00094
Soumyottam Chatterjee, Gopal Pandurangan, Peter Robinson
We study the fundamental problem of counting the number of nodes in a sparse network (of unknown size) under the presence of a large number of Byzantine nodes. We assume the full information model where the Byzantine nodes have complete knowledge about the entire state of the network at every round (including random choices made by all the nodes), have unbounded computational power, and can deviate arbitrarily from the protocol. Essentially all known algorithms for fundamental Byzantine problems (e.g., agreement, leader election, sampling) studied in the literature assume the knowledge (or at least an estimate) of the size of the network. It is non-trivial to design algorithms for Byzantine problems that work without knowledge of the network size, especially in bounded-degree (expander) networks where the local views of all nodes are (essentially) the same and limited, and Byzantine nodes can quite easily fake the presence/absence of non-existing nodes. To design truly local algorithms that do not rely on any global knowledge (including network size), estimating the size of the network under Byzantine nodes is an important first step. Our main contribution is a randomized distributed algorithm that estimates the size of a network under the presence of a large number of Byzantine nodes. In particular, our algorithm estimates the size of a sparse, "small-world", expander network with up to O(n^1-Δ) Byzantine nodes, where n is the (unknown) network size and Δ > 0 can be be any arbitrarily small (but fixed) constant. Our algorithm outputs a (fixed) constant factor estimate of log(n) with high probability; the correct estimate of the network size will be known to a large fraction (1 - ε)-fraction, for any fixed positive constant ε) of the honest nodes. Our algorithm is fully distributed, lightweight, and simple to implement, runs in O(log^3 n) rounds, and requires nodes to send and receive messages of only small-sized messages per round; any node's local computation cost per round is also small.
我们研究了在存在大量拜占庭节点的情况下,计算一个(未知大小)稀疏网络中节点数量的基本问题。我们假设完全信息模型,其中拜占庭节点在每轮中都完全了解网络的整个状态(包括所有节点所做的随机选择),具有无限的计算能力,并且可以任意偏离协议。从本质上讲,文献中研究的所有已知的基本拜占庭问题(例如,协议,领导者选举,抽样)的算法都假设知道(或至少估计)网络的大小。在不知道网络大小的情况下,为拜占庭问题设计算法是非常重要的,特别是在有界度(扩展器)网络中,所有节点的局部视图(本质上)是相同的和有限的,拜占庭节点可以很容易地假装不存在节点的存在/不存在。为了设计不依赖任何全局知识(包括网络大小)的真正的局部算法,估算拜占庭节点下的网络大小是重要的第一步。我们的主要贡献是一种随机分布算法,它可以在大量拜占庭节点存在的情况下估计网络的大小。特别是,我们的算法估计一个稀疏的,“小世界”的扩展网络的大小,该网络最多有O(n^1-Δ)个拜占庭节点,其中n是(未知的)网络大小,Δ > 0可以是任意小的(但固定的)常数。我们的算法以高概率输出log(n)的(固定)常数因子估计;对于诚实节点的很大一部分(1 - ε)(对于任何固定的正常数ε),网络大小的正确估计将是已知的。我们的算法是完全分布式的,轻量级的,易于实现的,在O(log^ 3n)轮中运行,并且要求节点每轮只发送和接收小消息;每轮任何节点的本地计算成本也很小。
{"title":"Network Size Estimation in Small-World Networks Under Byzantine Faults","authors":"Soumyottam Chatterjee, Gopal Pandurangan, Peter Robinson","doi":"10.1109/IPDPS.2019.00094","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00094","url":null,"abstract":"We study the fundamental problem of counting the number of nodes in a sparse network (of unknown size) under the presence of a large number of Byzantine nodes. We assume the full information model where the Byzantine nodes have complete knowledge about the entire state of the network at every round (including random choices made by all the nodes), have unbounded computational power, and can deviate arbitrarily from the protocol. Essentially all known algorithms for fundamental Byzantine problems (e.g., agreement, leader election, sampling) studied in the literature assume the knowledge (or at least an estimate) of the size of the network. It is non-trivial to design algorithms for Byzantine problems that work without knowledge of the network size, especially in bounded-degree (expander) networks where the local views of all nodes are (essentially) the same and limited, and Byzantine nodes can quite easily fake the presence/absence of non-existing nodes. To design truly local algorithms that do not rely on any global knowledge (including network size), estimating the size of the network under Byzantine nodes is an important first step. Our main contribution is a randomized distributed algorithm that estimates the size of a network under the presence of a large number of Byzantine nodes. In particular, our algorithm estimates the size of a sparse, \"small-world\", expander network with up to O(n^1-Δ) Byzantine nodes, where n is the (unknown) network size and Δ > 0 can be be any arbitrarily small (but fixed) constant. Our algorithm outputs a (fixed) constant factor estimate of log(n) with high probability; the correct estimate of the network size will be known to a large fraction (1 - ε)-fraction, for any fixed positive constant ε) of the honest nodes. Our algorithm is fully distributed, lightweight, and simple to implement, runs in O(log^3 n) rounds, and requires nodes to send and receive messages of only small-sized messages per round; any node's local computation cost per round is also small.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114633849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Message from the Program Chair and Vice Chairs 来自项目主席和副主席的信息
Pub Date : 2019-05-01 DOI: 10.1109/PERCOM.2005.25
Christian Becker, A. Dey, F. Lau, Gergely Záruba Vice-Chairs
We are pleased to announce an excellent technical program for the 6th International Conference on Pervasive Computing and Communications. The program covers a broad cross section of topics in pervasive computing and communications. This year, 160 papers were submitted for consideration to the program committee. As a result, the selection process was highly competitive, and the result is a program of high-quality papers.
我们很高兴地宣布第六届普适计算与通信国际会议的优秀技术计划。该计划涵盖了普适计算和通信领域的广泛主题。今年,160篇论文被提交给项目委员会审议。因此,选拔过程竞争激烈,结果是一个高质量的论文计划。
{"title":"Message from the Program Chair and Vice Chairs","authors":"Christian Becker, A. Dey, F. Lau, Gergely Záruba Vice-Chairs","doi":"10.1109/PERCOM.2005.25","DOIUrl":"https://doi.org/10.1109/PERCOM.2005.25","url":null,"abstract":"We are pleased to announce an excellent technical program for the 6th International Conference on Pervasive Computing and Communications. The program covers a broad cross section of topics in pervasive computing and communications. This year, 160 papers were submitted for consideration to the program committee. As a result, the selection process was highly competitive, and the result is a program of high-quality papers.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115752648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Tight & Simple Load Balancing 紧凑和简单的负载平衡
Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00080
P. Berenbrink, Tom Friedetzky, Dominik Kaaser, Peter Kling
We consider the following load balancing process for m tokens distributed arbitrarily among n nodes connected by a complete graph. In each time step a pair of nodes is selected uniformly at random. Let ℓ_1 and ℓ_2 be their respective number of tokens. The two nodes exchange tokens such that they have ⌈(ℓ_1 + ℓ_2)/2⌉ and ⌈(ℓ_1 + ℓ_2)/2⌉ tokens, respectively. We provide a simple analysis showing that this process reaches almost perfect balance within O(n log n + n log Δ) steps with high probability, where Δ is the maximal initial load difference between any two nodes. This bound is asymptotically tight.
我们考虑m个令牌随机分布在由完全图连接的n个节点上的负载平衡过程。在每个时间步长中均匀随机选择一对节点。设_1和_2是它们各自的符号数。这两个节点交换令牌,使它们分别具有< <(__1 + __2)/2²和< __1 + __2)/2²令牌。我们提供了一个简单的分析,表明该过程在O(n log n + n log Δ)步内以高概率达到几乎完美的平衡,其中Δ是任意两个节点之间的最大初始负载差。这个界是渐近紧的。
{"title":"Tight & Simple Load Balancing","authors":"P. Berenbrink, Tom Friedetzky, Dominik Kaaser, Peter Kling","doi":"10.1109/IPDPS.2019.00080","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00080","url":null,"abstract":"We consider the following load balancing process for m tokens distributed arbitrarily among n nodes connected by a complete graph. In each time step a pair of nodes is selected uniformly at random. Let ℓ_1 and ℓ_2 be their respective number of tokens. The two nodes exchange tokens such that they have ⌈(ℓ_1 + ℓ_2)/2⌉ and ⌈(ℓ_1 + ℓ_2)/2⌉ tokens, respectively. We provide a simple analysis showing that this process reaches almost perfect balance within O(n log n + n log Δ) steps with high probability, where Δ is the maximal initial load difference between any two nodes. This bound is asymptotically tight.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126936332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
A Scalable Clustering-Based Task Scheduler for Homogeneous Processors Using DAG Partitioning 使用DAG分区的同构处理器的可伸缩的基于集群的任务调度程序
Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00026
M. Özkaya, A. Benoit, B. Uçar, J. Herrmann, Ümit V. Çatalyürek
When scheduling a directed acyclic graph (DAG) of tasks with communication costs on computational platforms, a good trade-off between load balance and data locality is necessary. List-based scheduling techniques are commonly-used greedy approaches for this problem. The downside of list-scheduling heuristics is that they are incapable of making short-term sacrifices for the global efficiency of the schedule. In this work, we describe new list-based scheduling heuristics based on clustering for homogeneous platforms, under the realistic duplex single-port communication model. Our approach uses an acyclic partitioner for DAGs for clustering. The clustering enhances the data locality of the scheduler with a global view of the graph. Furthermore, since the partition is acyclic, we can schedule each part completely once its input tasks are ready to be executed. We present an extensive experimental evaluation showing the trade-offs between the granularity of clustering and the parallelism, and how this affects the scheduling. Furthermore, we compare our heuristics to the best state-of-the-art list-scheduling and clustering heuristics, and obtain more than three times better makespan in cases with many communications.
在计算平台上调度具有通信开销的任务的有向无环图(DAG)时,需要在负载平衡和数据局部性之间进行良好的权衡。基于列表的调度技术是解决此问题的常用贪婪方法。列表调度启发式的缺点是,它们无法为调度的全局效率做出短期牺牲。在本工作中,我们描述了在现实双工单端口通信模型下,基于聚类的同构平台的新的基于列表的调度启发式算法。我们的方法使用dag的无循环分区器进行聚类。聚类通过图的全局视图增强了调度器的数据局部性。此外,由于分区是无循环的,一旦每个部分的输入任务准备好执行,我们就可以完全调度每个部分。我们提供了一个广泛的实验评估,显示了集群粒度和并行性之间的权衡,以及这如何影响调度。此外,我们将我们的启发式方法与最先进的列表调度和聚类启发式方法进行了比较,并在具有许多通信的情况下获得了三倍以上的最大完成时间。
{"title":"A Scalable Clustering-Based Task Scheduler for Homogeneous Processors Using DAG Partitioning","authors":"M. Özkaya, A. Benoit, B. Uçar, J. Herrmann, Ümit V. Çatalyürek","doi":"10.1109/IPDPS.2019.00026","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00026","url":null,"abstract":"When scheduling a directed acyclic graph (DAG) of tasks with communication costs on computational platforms, a good trade-off between load balance and data locality is necessary. List-based scheduling techniques are commonly-used greedy approaches for this problem. The downside of list-scheduling heuristics is that they are incapable of making short-term sacrifices for the global efficiency of the schedule. In this work, we describe new list-based scheduling heuristics based on clustering for homogeneous platforms, under the realistic duplex single-port communication model. Our approach uses an acyclic partitioner for DAGs for clustering. The clustering enhances the data locality of the scheduler with a global view of the graph. Furthermore, since the partition is acyclic, we can schedule each part completely once its input tasks are ready to be executed. We present an extensive experimental evaluation showing the trade-offs between the granularity of clustering and the parallelism, and how this affects the scheduling. Furthermore, we compare our heuristics to the best state-of-the-art list-scheduling and clustering heuristics, and obtain more than three times better makespan in cases with many communications.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"44 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125680807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Dynamic Memory Management for GPU-Based Training of Deep Neural Networks 基于gpu的深度神经网络训练动态内存管理
Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00030
B. ShriramS, Anshuj Garg, Purushottam Kulkarni
Deep learning has been widely adopted for different applications of artificial intelligence - speech recognition, natural language processing, computer vision etc. The growing size of Deep Neural Networks (DNNs) has compelled the researchers to design memory efficient and performance optimal algorithms. Apart from algorithmic improvements, specialized hardware like Graphics Processing Units (GPUs) are being widely employed to accelerate the training and inference phases of deep networks. However, the limited GPU memory capacity limits the upper bound on the size of networks that can be offloaded to and trained using GPUs. vDNN addresses the GPU memory bottleneck issue and provides a solution which enables training of deep networks that are larger than GPU memory. In our work, we characterize and identify multiple bottlenecks with vDNN like delayed computation start, high pinned memory requirements and GPU memory fragmentation. We present vDNN++ which extends vDNN and resolves the identified issues. Our results show that the performance of vDNN++ is comparable or better (up to 60% relative improvement) than vDNN. We propose different heuristics and order for memory allocation, and empirically evaluate the extent of memory fragmentation with them. We are also able to reduce the pinned memory requirement by up to 60%.
深度学习已被广泛应用于人工智能的不同应用-语音识别,自然语言处理,计算机视觉等。随着深度神经网络(dnn)规模的不断扩大,研究人员不得不设计记忆效率和性能最优的算法。除了算法的改进,图形处理单元(gpu)等专用硬件也被广泛用于加速深度网络的训练和推理阶段。然而,有限的GPU内存容量限制了可以卸载和使用GPU训练的网络大小的上限。vDNN解决了GPU内存瓶颈问题,并提供了一种能够训练比GPU内存更大的深度网络的解决方案。在我们的工作中,我们描述并识别了vDNN的多个瓶颈,如延迟计算启动,高固定内存需求和GPU内存碎片。我们提出了vdnn++,它扩展了vDNN并解决了已识别的问题。我们的结果表明,vdnn++的性能与vDNN相当或更好(高达60%的相对改进)。我们提出了不同的内存分配启发式和顺序,并用它们对内存碎片程度进行了实证评估。我们还能够将固定内存需求减少高达60%。
{"title":"Dynamic Memory Management for GPU-Based Training of Deep Neural Networks","authors":"B. ShriramS, Anshuj Garg, Purushottam Kulkarni","doi":"10.1109/IPDPS.2019.00030","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00030","url":null,"abstract":"Deep learning has been widely adopted for different applications of artificial intelligence - speech recognition, natural language processing, computer vision etc. The growing size of Deep Neural Networks (DNNs) has compelled the researchers to design memory efficient and performance optimal algorithms. Apart from algorithmic improvements, specialized hardware like Graphics Processing Units (GPUs) are being widely employed to accelerate the training and inference phases of deep networks. However, the limited GPU memory capacity limits the upper bound on the size of networks that can be offloaded to and trained using GPUs. vDNN addresses the GPU memory bottleneck issue and provides a solution which enables training of deep networks that are larger than GPU memory. In our work, we characterize and identify multiple bottlenecks with vDNN like delayed computation start, high pinned memory requirements and GPU memory fragmentation. We present vDNN++ which extends vDNN and resolves the identified issues. Our results show that the performance of vDNN++ is comparable or better (up to 60% relative improvement) than vDNN. We propose different heuristics and order for memory allocation, and empirically evaluate the extent of memory fragmentation with them. We are also able to reduce the pinned memory requirement by up to 60%.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122060192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 32
LLC-Guided Data Migration in Hybrid Memory Systems 混合存储系统中llc引导的数据迁移
Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00101
E. Vasilakis, Vassilis D. Papaefstathiou, P. Trancoso, I. Sourdis
Although 3D-stacked DRAM offers substantially higher bandwidth than commodity DDR DIMMs, it cannot yet provide the necessary capacity to replace the bulk of the memory. A promising alternative is to use flat address space, hybrid memory systems of two or more levels, each exhibiting different performance characteristics. One such existing approach employs a near, high bandwidth 3D-stacked memory, placed on top of the processor die, combined with a far, commodity DDR memory, placed off-chip. Migrating data from the far to the near memory has significant performance potential, but also entails overheads, which may diminish migration benefits or even lead to performance degradation. This paper describes a new data migration scheme for hybrid memory systems that takes into account the above overheads and improves migration efficiency and effectiveness. It is based on the observation that migrating memory segments, which are (partly) present in the Last-Level Cache (LLC) introduces lower migration traffic. Our approach relies on the state of the LLC cachelines to predict future reuse and select memory segments for migration. Thereby, the segments are migrated when present (at least partly) in the LLC incurring lower cost. Our experiments confirm that our approach outperforms current state-of-the art migration designs improving system performance by 12.1% and reducing memory system dynamic energy by 13.2%.
虽然3d堆叠DRAM提供比商品DDR dimm高得多的带宽,但它还不能提供取代大部分内存所需的容量。一个有希望的替代方案是使用平面地址空间,两层或两层以上的混合存储系统,每层都表现出不同的性能特征。现有的一种方法是采用近距离、高带宽的3d堆叠存储器,放置在处理器芯片的顶部,结合远距离、商用DDR存储器,放置在芯片外。将数据从远内存迁移到近内存具有很大的性能潜力,但也会带来开销,这可能会减少迁移的好处,甚至导致性能下降。本文提出了一种新的混合存储系统数据迁移方案,该方案考虑了上述开销,提高了迁移效率和有效性。这是基于这样的观察,即迁移(部分)存在于最后一级缓存(LLC)中的内存段引入了较低的迁移流量。我们的方法依赖于LLC cacheline的状态来预测未来的重用和选择用于迁移的内存段。因此,当存在(至少部分)在有限责任公司中产生较低的成本时,这些部分被迁移。我们的实验证实,我们的方法优于当前最先进的迁移设计,将系统性能提高12.1%,并将内存系统动态能量降低13.2%。
{"title":"LLC-Guided Data Migration in Hybrid Memory Systems","authors":"E. Vasilakis, Vassilis D. Papaefstathiou, P. Trancoso, I. Sourdis","doi":"10.1109/IPDPS.2019.00101","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00101","url":null,"abstract":"Although 3D-stacked DRAM offers substantially higher bandwidth than commodity DDR DIMMs, it cannot yet provide the necessary capacity to replace the bulk of the memory. A promising alternative is to use flat address space, hybrid memory systems of two or more levels, each exhibiting different performance characteristics. One such existing approach employs a near, high bandwidth 3D-stacked memory, placed on top of the processor die, combined with a far, commodity DDR memory, placed off-chip. Migrating data from the far to the near memory has significant performance potential, but also entails overheads, which may diminish migration benefits or even lead to performance degradation. This paper describes a new data migration scheme for hybrid memory systems that takes into account the above overheads and improves migration efficiency and effectiveness. It is based on the observation that migrating memory segments, which are (partly) present in the Last-Level Cache (LLC) introduces lower migration traffic. Our approach relies on the state of the LLC cachelines to predict future reuse and select memory segments for migration. Thereby, the segments are migrated when present (at least partly) in the LLC incurring lower cost. Our experiments confirm that our approach outperforms current state-of-the art migration designs improving system performance by 12.1% and reducing memory system dynamic energy by 13.2%.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130977101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
BigSpa: An Efficient Interprocedural Static Analysis Engine in the Cloud BigSpa:云端高效的过程间静态分析引擎
Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00086
Zhiqiang Zuo, Rong Gu, Xi Jiang, Zhaokang Wang, Yihua Huang, Linzhang Wang, Xuandong Li
Static program analysis is widely used in various application areas to solve many practical problems. Although researchers have made significant achievements in static analysis, it is still too challenging to perform sophisticated interprocedural analysis on large-scale modern software. The underlying reason is that interprocedural analysis for large-scale modern software is highly computation-and memory-intensive, leading to poor scalability. We aim to tackle the scalability problem by proposing a novel big data solution for sophisticated static analysis. Specifically, we propose a data-parallel algorithm and a join-process-filter computation model for the CFL-reachability based interprocedural analysis and develop an efficient distributed static analysis engine in the cloud, called BigSpa. Our experiments validated that BigSpa running on a cluster scales greatly to perform precise interprocedural analyses on millions of lines of code, and runs an order of magnitude or more faster than the existing state-of-the-art analysis tools.
静态程序分析广泛应用于各个应用领域,解决了许多实际问题。尽管研究人员在静态分析方面取得了重大成就,但在大规模的现代软件上进行复杂的程序间分析仍然具有很大的挑战性。潜在的原因是大规模现代软件的过程间分析是高度计算和内存密集型的,导致了较差的可伸缩性。我们的目标是通过为复杂的静态分析提出一种新颖的大数据解决方案来解决可扩展性问题。具体而言,我们提出了一种基于cfl可达性的过程间分析的数据并行算法和连接-进程-过滤计算模型,并开发了一个高效的云分布式静态分析引擎BigSpa。我们的实验证实,在集群上运行的BigSpa可以在数百万行代码上执行精确的过程间分析,并且运行速度比现有的最先进的分析工具快一个数量级或更快。
{"title":"BigSpa: An Efficient Interprocedural Static Analysis Engine in the Cloud","authors":"Zhiqiang Zuo, Rong Gu, Xi Jiang, Zhaokang Wang, Yihua Huang, Linzhang Wang, Xuandong Li","doi":"10.1109/IPDPS.2019.00086","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00086","url":null,"abstract":"Static program analysis is widely used in various application areas to solve many practical problems. Although researchers have made significant achievements in static analysis, it is still too challenging to perform sophisticated interprocedural analysis on large-scale modern software. The underlying reason is that interprocedural analysis for large-scale modern software is highly computation-and memory-intensive, leading to poor scalability. We aim to tackle the scalability problem by proposing a novel big data solution for sophisticated static analysis. Specifically, we propose a data-parallel algorithm and a join-process-filter computation model for the CFL-reachability based interprocedural analysis and develop an efficient distributed static analysis engine in the cloud, called BigSpa. Our experiments validated that BigSpa running on a cluster scales greatly to perform precise interprocedural analyses on millions of lines of code, and runs an order of magnitude or more faster than the existing state-of-the-art analysis tools.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114999472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
SAC Goes Cluster: Fully Implicit Distributed Computing SAC Goes集群:完全隐式分布式计算
Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00107
Thomas Macht, C. Grelck
SAC (Single Assignment C) is a purely functional, data-parallel array programming language that predominantly targets compute-intensive applications. Thus, clusters of workstations, or distributed memory architectures in general, form highly relevant compilation targets. Notwithstanding, SAC as of today only supports shared-memory architectures, graphics accelerators and heterogeneous combinations thereof. In our current work we aim at closing this gap. At the same time, we are determined to uphold SAC's promise of entirely compiler-directed exploitation of concurrency, no matter what the target architecture is. Distributed memory architectures are going to make this promise a particular challenge. Despite SAC's functional semantics, it is generally far from straightforward to infer exact communication patterns from architecture-agnostic code. Therefore, we intend to capitalise on recent advances in network technology, namely the closing of the gap between memory bandwidth and network bandwidth. We aim at a solution based on a custom-designed software distributed shared memory (S-DSM) and large per-node software-managed cache memories. To this effect the functional nature of SAC with its write-once/read-only arrays provides a strategic advantage that we thoroughly exploit. Throughout the paper we further motivate our approach, sketch out our implementation strategy, show preliminary results and discuss the pros and cons of our approach.
SAC (Single Assignment C)是一种纯函数式、数据并行数组编程语言,主要针对计算密集型应用程序。因此,工作站集群,或者通常的分布式内存体系结构,形成了高度相关的编译目标。尽管如此,SAC到目前为止只支持共享内存架构、图形加速器及其异构组合。在我们目前的工作中,我们的目标是缩小这一差距。同时,无论目标体系结构是什么,我们都决心坚持SAC的承诺,即完全由编译器定向地利用并发性。分布式内存架构将使这一承诺成为一个特别的挑战。尽管SAC具有功能语义,但从与体系结构无关的代码中推断出精确的通信模式通常是很不容易的。因此,我们打算利用网络技术的最新进展,即缩小内存带宽和网络带宽之间的差距。我们的目标是基于定制设计的软件分布式共享内存(S-DSM)和大型每个节点软件管理的缓存内存的解决方案。为此,SAC的功能性及其一次写入/只读数组提供了我们充分利用的战略优势。在整个论文中,我们进一步激励了我们的方法,概述了我们的实施策略,展示了初步结果,并讨论了我们方法的利弊。
{"title":"SAC Goes Cluster: Fully Implicit Distributed Computing","authors":"Thomas Macht, C. Grelck","doi":"10.1109/IPDPS.2019.00107","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00107","url":null,"abstract":"SAC (Single Assignment C) is a purely functional, data-parallel array programming language that predominantly targets compute-intensive applications. Thus, clusters of workstations, or distributed memory architectures in general, form highly relevant compilation targets. Notwithstanding, SAC as of today only supports shared-memory architectures, graphics accelerators and heterogeneous combinations thereof. In our current work we aim at closing this gap. At the same time, we are determined to uphold SAC's promise of entirely compiler-directed exploitation of concurrency, no matter what the target architecture is. Distributed memory architectures are going to make this promise a particular challenge. Despite SAC's functional semantics, it is generally far from straightforward to infer exact communication patterns from architecture-agnostic code. Therefore, we intend to capitalise on recent advances in network technology, namely the closing of the gap between memory bandwidth and network bandwidth. We aim at a solution based on a custom-designed software distributed shared memory (S-DSM) and large per-node software-managed cache memories. To this effect the functional nature of SAC with its write-once/read-only arrays provides a strategic advantage that we thoroughly exploit. Throughout the paper we further motivate our approach, sketch out our implementation strategy, show preliminary results and discuss the pros and cons of our approach.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125173176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
The Path to Delivering Programable Exascale Systems 提供可编程百亿亿级系统的途径
Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00081
L. DeRose
The trends in hardware architecture are paving the road towards Exascale. However, these trends are also increasing the complexity of design and development of the software developer environment that is deployed on modern supercomputers. Moreover, the scale and complexity of high-end systems creates a new set of challenges for application developers. Computational scientists are facing system characteristics that will significantly impact the programmability and scalability of applications. In order to address these issues, software architects need to take a holistic view of the entire system and deliver a high-level programming environment that can help maximize programmability, while not losing sight of performance portability. In this talk, I will discuss the current trends in computer architecture and their implications in application development and will present Cray’s high level parallel programming environment for performance and programmability on current and future supercomputers. I will also discuss some of the challenges and open research problems that need to be addressed in order to build a software developer environment for extreme-scale systems that helps users solve multi-disciplinary and multi-scale problems with high levels of performance, programmability, and scalability.
硬件架构的发展趋势正在为百亿亿级铺平道路。然而,这些趋势也增加了部署在现代超级计算机上的软件开发人员环境的设计和开发的复杂性。此外,高端系统的规模和复杂性给应用程序开发人员带来了一系列新的挑战。计算科学家正面临着将显著影响应用程序可编程性和可扩展性的系统特性。为了解决这些问题,软件架构师需要从整体上看待整个系统,并交付一个高级编程环境,以帮助最大化可编程性,同时不忽略性能可移植性。在这次演讲中,我将讨论计算机体系结构的当前趋势及其对应用程序开发的影响,并将介绍Cray在当前和未来超级计算机上的高性能和可编程性的高级并行编程环境。我还将讨论一些需要解决的挑战和开放的研究问题,以便为极端规模的系统构建一个软件开发环境,帮助用户解决具有高水平性能、可编程性和可扩展性的多学科和多规模问题。
{"title":"The Path to Delivering Programable Exascale Systems","authors":"L. DeRose","doi":"10.1109/IPDPS.2019.00081","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00081","url":null,"abstract":"The trends in hardware architecture are paving the road towards Exascale. However, these trends are also increasing the complexity of design and development of the software developer environment that is deployed on modern supercomputers. Moreover, the scale and complexity of high-end systems creates a new set of challenges for application developers. Computational scientists are facing system characteristics that will significantly impact the programmability and scalability of applications. In order to address these issues, software architects need to take a holistic view of the entire system and deliver a high-level programming environment that can help maximize programmability, while not losing sight of performance portability. In this talk, I will discuss the current trends in computer architecture and their implications in application development and will present Cray’s high level parallel programming environment for performance and programmability on current and future supercomputers. I will also discuss some of the challenges and open research problems that need to be addressed in order to build a software developer environment for extreme-scale systems that helps users solve multi-disciplinary and multi-scale problems with high levels of performance, programmability, and scalability.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122391308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1