首页 > 最新文献

2015 44th International Conference on Parallel Processing最新文献

英文 中文
Code 5-6: An Efficient MDS Array Coding Scheme to Accelerate Online RAID Level Migration 代码5-6:一种高效的MDS阵列编码方案,加速RAID级别在线迁移
Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.54
Chentao Wu, Xubin He, Jie Li, M. Guo
With the rapid growth of data storage, the demand for high reliability becomes critical in large data centers where RAID-5 is widely used. However, the disk failure rate increases sharply after some usage, and thus concurrent disk failures are not rare, therefore RAID-5 is insufficient to provide high reliability. A solution is to convert an existing RAID-5 to a RAID-6 (a type of "RAID level migration") to tolerate more concurrent disk failures via erasure codes, but existing approaches involve complex conversion process and high transformation cost. To address these challenges, we propose a novel MDS code, called "Code 5-6", to combine a new dedicated parity column with the original RAID-5 layout. Code 5-6 not only accelerates online conversion from a RAID-5 to a RAID-6, but also demonstrates several optimal properties of MDS codes. Our mathematical analysis shows that, compared to existing MDS codes, Code 5-6 reduces new parities, decreases the total I/O operations, and speeds up the conversion process by up to 80%, 48.5%, and 3.38×, respectively.
随着数据存储的快速增长,大型数据中心对高可靠性的要求越来越高,RAID-5的应用越来越广泛。但硬盘在使用一段时间后故障率会急剧上升,并发故障的情况并不少见,因此RAID-5无法提供高可靠性。一种解决方案是将现有的RAID-5转换为RAID-6(一种“RAID级迁移”),以便通过擦除码容忍更多的并发磁盘故障,但是现有的方法涉及复杂的转换过程和较高的转换成本。为了解决这些挑战,我们提出了一种新的MDS代码,称为“代码5-6”,将新的专用奇偶校验列与原始RAID-5布局结合起来。代码5-6不仅加速了从RAID-5到RAID-6的在线转换,而且还演示了MDS代码的几个最佳特性。我们的数学分析表明,与现有的MDS代码相比,代码5-6减少了新的对偶,减少了总I/O操作,并将转换过程的速度分别提高了80%、48.5%和3.38倍。
{"title":"Code 5-6: An Efficient MDS Array Coding Scheme to Accelerate Online RAID Level Migration","authors":"Chentao Wu, Xubin He, Jie Li, M. Guo","doi":"10.1109/ICPP.2015.54","DOIUrl":"https://doi.org/10.1109/ICPP.2015.54","url":null,"abstract":"With the rapid growth of data storage, the demand for high reliability becomes critical in large data centers where RAID-5 is widely used. However, the disk failure rate increases sharply after some usage, and thus concurrent disk failures are not rare, therefore RAID-5 is insufficient to provide high reliability. A solution is to convert an existing RAID-5 to a RAID-6 (a type of \"RAID level migration\") to tolerate more concurrent disk failures via erasure codes, but existing approaches involve complex conversion process and high transformation cost. To address these challenges, we propose a novel MDS code, called \"Code 5-6\", to combine a new dedicated parity column with the original RAID-5 layout. Code 5-6 not only accelerates online conversion from a RAID-5 to a RAID-6, but also demonstrates several optimal properties of MDS codes. Our mathematical analysis shows that, compared to existing MDS codes, Code 5-6 reduces new parities, decreases the total I/O operations, and speeds up the conversion process by up to 80%, 48.5%, and 3.38×, respectively.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130344213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Crowdsourcing Sensing Workloads of Heterogeneous Tasks: A Distributed Fairness-Aware Approach 异构任务的众包感知工作负载:一种分布式公平感知方法
Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.67
Wei Sun, Yanmin Zhu, L. Ni, Bo Li
Crowd sourced sensing over smartphones presents a new paradigm for collecting sensing data over a vast area for real-time monitoring applications. A monitoring application may require different types of sensing data, while under a budget constraint. This paper explores the crucial problem of maximizing the aggregate data utility of heterogeneous sensing tasks while maintaining utility-centric fairness across different tasks under a budget constraint. In particular, we take the redundancy of sensing data into account. This problem is highly challenging given its unique characteristics including the intrinsic trade off between aggregate data utility and fairness, and the large number of smartphones. We propose a fairness-aware distributed approach to solving this problem. To overcome the intractability of the problem, we decompose it to two sub problems of recruiting smartphones under a budget constraint and allocating workloads of sensing tasks. For the first sub problem, we propose an efficient greedy algorithm which has a constant approximation ratio of two. For the second problem, we apply dual based decomposition based on which we design a distributed algorithm for determining the workloads of different tasks on each recruited smartphone. We have implemented our distributed algorithm on a windows-based server and Android-based smartphones. With extensive simulations we demonstrate that our approach achieves high aggregate data utility while maintaining good utility-centric fairness across sensing tasks.
智能手机上的众包传感为在大范围内收集传感数据以进行实时监测应用提供了一种新的范例。在预算有限的情况下,监控应用程序可能需要不同类型的传感数据。本文探讨了在预算约束下,在保持不同任务之间以效用为中心的公平性的同时,最大化异构感知任务的总数据效用的关键问题。特别地,我们考虑了传感数据的冗余。这个问题非常具有挑战性,因为它具有独特的特征,包括在总数据效用和公平性之间的内在权衡,以及大量的智能手机。我们提出了一种公平感知的分布式方法来解决这个问题。为了克服问题的难解性,我们将其分解为在预算约束下招募智能手机和分配传感任务负载两个子问题。对于第一个子问题,我们提出了一个具有常数近似比为2的高效贪心算法。对于第二个问题,我们应用基于对偶的分解,在此基础上我们设计了一个分布式算法来确定每个招募的智能手机上不同任务的工作量。我们已经在基于windows的服务器和基于android的智能手机上实现了分布式算法。通过广泛的模拟,我们证明了我们的方法在实现高聚合数据效用的同时,在感知任务中保持了良好的以效用为中心的公平性。
{"title":"Crowdsourcing Sensing Workloads of Heterogeneous Tasks: A Distributed Fairness-Aware Approach","authors":"Wei Sun, Yanmin Zhu, L. Ni, Bo Li","doi":"10.1109/ICPP.2015.67","DOIUrl":"https://doi.org/10.1109/ICPP.2015.67","url":null,"abstract":"Crowd sourced sensing over smartphones presents a new paradigm for collecting sensing data over a vast area for real-time monitoring applications. A monitoring application may require different types of sensing data, while under a budget constraint. This paper explores the crucial problem of maximizing the aggregate data utility of heterogeneous sensing tasks while maintaining utility-centric fairness across different tasks under a budget constraint. In particular, we take the redundancy of sensing data into account. This problem is highly challenging given its unique characteristics including the intrinsic trade off between aggregate data utility and fairness, and the large number of smartphones. We propose a fairness-aware distributed approach to solving this problem. To overcome the intractability of the problem, we decompose it to two sub problems of recruiting smartphones under a budget constraint and allocating workloads of sensing tasks. For the first sub problem, we propose an efficient greedy algorithm which has a constant approximation ratio of two. For the second problem, we apply dual based decomposition based on which we design a distributed algorithm for determining the workloads of different tasks on each recruited smartphone. We have implemented our distributed algorithm on a windows-based server and Android-based smartphones. With extensive simulations we demonstrate that our approach achieves high aggregate data utility while maintaining good utility-centric fairness across sensing tasks.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114629464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Study on Partitioning Real-World Directed Graphs of Skewed Degree Distribution 偏度分布的有向图的划分研究
Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.37
Jie Yan, Guangming Tan, Ninghui Sun
Distributed computation on directed graphs has been increasingly important in emerging big data analytics. However, partitioning the huge real-world graphs, such as social and web networks, is known challenging for their skewed (or power-law) degree distributions. In this paper, by investigating two representative k-way balanced edge-cut methods (LDG streaming heuristic and METIS) on 12 real social and web graphs, we empirically find that both LDG and METIS can partition page-level web graphs with extremely high quality, but fail to generate low-cut balanced partitions for social networks and host-level web graphs. Our deep analysis identifies that the global star-motif structures around high-degree vertices is the main obstacle to high-quality partitioning. Based on the empirical study, we further propose a new distributed graph model, namely Agent-Graph, and the Agent+ framework that partitions power-law graphs in the Agent-Graph model. Agent-Graph is a vertex cut variant in the context of message passing, where any high-degree vertex is factored into arbitrary computational agents in remote partitions for message combining and scattering. The Agent framework filters the high-degree vertices to form a residual graph which is then partitioned with high quality by existing edge-cut methods, and finally refills high-degree vertices as agents to construct an agent-graph. Experiments show that the Agent+ approach constantly generates high-quality partitions for all tested real-world skewed graphs. In particular, for 64-way partitioning on social networks and host-level web graphs, the Agent+ approach reduces edge cut equivalently by 27%~79% for LDG and 23%~82% for METIS.
有向图上的分布式计算在新兴的大数据分析中越来越重要。然而,划分巨大的现实世界的图表,如社交和网络,是众所周知的挑战,因为它们的倾斜(或幂律)度分布。本文通过对12个真实社交网络图和web图的两种具有代表性的k-way平衡边切方法(LDG流启发式和METIS)的研究,实证发现LDG和METIS都可以对页面级web图进行高质量的分区,但无法对社交网络和主机级web图生成低切割平衡分区。我们的深入分析表明,围绕高阶顶点的全局星基序结构是高质量分割的主要障碍。在实证研究的基础上,我们进一步提出了一种新的分布式图模型Agent- graph,并在Agent- graph模型中提出了划分幂律图的Agent+框架。Agent-Graph是消息传递上下文中的顶点切割变体,其中任何高度顶点都被分解为远程分区中的任意计算代理,用于消息组合和分散。Agent框架对高度顶点进行过滤,形成残差图,然后用现有的切边方法对残差图进行高质量分割,最后将高度顶点重新填充为Agent,构造Agent图。实验表明,Agent+方法不断为所有测试的真实世界歪斜图生成高质量的分区。特别是,对于社交网络和主机级web图上的64路分区,Agent+方法在LDG和METIS上分别减少了27%~79%和23%~82%的切边。
{"title":"Study on Partitioning Real-World Directed Graphs of Skewed Degree Distribution","authors":"Jie Yan, Guangming Tan, Ninghui Sun","doi":"10.1109/ICPP.2015.37","DOIUrl":"https://doi.org/10.1109/ICPP.2015.37","url":null,"abstract":"Distributed computation on directed graphs has been increasingly important in emerging big data analytics. However, partitioning the huge real-world graphs, such as social and web networks, is known challenging for their skewed (or power-law) degree distributions. In this paper, by investigating two representative k-way balanced edge-cut methods (LDG streaming heuristic and METIS) on 12 real social and web graphs, we empirically find that both LDG and METIS can partition page-level web graphs with extremely high quality, but fail to generate low-cut balanced partitions for social networks and host-level web graphs. Our deep analysis identifies that the global star-motif structures around high-degree vertices is the main obstacle to high-quality partitioning. Based on the empirical study, we further propose a new distributed graph model, namely Agent-Graph, and the Agent+ framework that partitions power-law graphs in the Agent-Graph model. Agent-Graph is a vertex cut variant in the context of message passing, where any high-degree vertex is factored into arbitrary computational agents in remote partitions for message combining and scattering. The Agent framework filters the high-degree vertices to form a residual graph which is then partitioned with high quality by existing edge-cut methods, and finally refills high-degree vertices as agents to construct an agent-graph. Experiments show that the Agent+ approach constantly generates high-quality partitions for all tested real-world skewed graphs. In particular, for 64-way partitioning on social networks and host-level web graphs, the Agent+ approach reduces edge cut equivalently by 27%~79% for LDG and 23%~82% for METIS.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"221 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134369845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Region-Based May-Happen-in-Parallel Analysis for C Programs 基于区域的C程序并行可能发生分析
Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.98
Peng Di, Yulei Sui, Ding Ye, Jingling Xue
The C programming language continues to play an essential role in the development of system software. May-Happen-in-Parallel (MHP) analysis is the basis of many other analyses and optimisations for concurrent programs. Existing MHP analyses that work well for programming languages such as X10 are often not effective for C (with Pthreads). This paper presents a new MHP algorithm for C that operates at the granularity of code regions rather than individual statements in a program. A flow-sensitive Happens-Before (HB) analysis is performed to account for fork-join semantics of pthreads on an interprocedural thread-sensitive control flow graph representation of a program, enabling the HB relations among its statements to be discovered. All the statements that share the same HB properties are then grouped into one region. As a result, computing the MHP information for all pairs of statements in a program is reduced to one of inferring the HB relations from among its regions. We have implemented our algorithm in LLVM-3.5.0 and evaluated it using 14 programs from the SPLASH2 and PARSEC benchmark suites. Our preliminary results show that our approach is more precise than two existing MHP analyses yet computationally comparable with the fastest MHP analysis.
C编程语言在系统软件的开发中继续扮演着重要的角色。并行可能发生(MHP)分析是许多其他并发程序分析和优化的基础。现有的MHP分析可以很好地用于编程语言(如X10),但通常不适用于C语言(使用pthread)。本文提出了一种新的C语言MHP算法,它在代码区域的粒度上而不是在程序中的单个语句上进行操作。执行流敏感的Happens-Before (HB)分析,以解释程序的过程间线程敏感控制流图表示上pthread的fork-join语义,从而发现其语句之间的HB关系。所有具有相同HB属性的语句都被分组到一个区域中。因此,计算程序中所有语句对的MHP信息可以简化为从其区域之间推断HB关系。我们在LLVM-3.5.0中实现了我们的算法,并使用SPLASH2和PARSEC基准套件中的14个程序对其进行了评估。我们的初步结果表明,我们的方法比现有的两种MHP分析更精确,但在计算上可与最快的MHP分析相媲美。
{"title":"Region-Based May-Happen-in-Parallel Analysis for C Programs","authors":"Peng Di, Yulei Sui, Ding Ye, Jingling Xue","doi":"10.1109/ICPP.2015.98","DOIUrl":"https://doi.org/10.1109/ICPP.2015.98","url":null,"abstract":"The C programming language continues to play an essential role in the development of system software. May-Happen-in-Parallel (MHP) analysis is the basis of many other analyses and optimisations for concurrent programs. Existing MHP analyses that work well for programming languages such as X10 are often not effective for C (with Pthreads). This paper presents a new MHP algorithm for C that operates at the granularity of code regions rather than individual statements in a program. A flow-sensitive Happens-Before (HB) analysis is performed to account for fork-join semantics of pthreads on an interprocedural thread-sensitive control flow graph representation of a program, enabling the HB relations among its statements to be discovered. All the statements that share the same HB properties are then grouped into one region. As a result, computing the MHP information for all pairs of statements in a program is reduced to one of inferring the HB relations from among its regions. We have implemented our algorithm in LLVM-3.5.0 and evaluated it using 14 programs from the SPLASH2 and PARSEC benchmark suites. Our preliminary results show that our approach is more precise than two existing MHP analyses yet computationally comparable with the fastest MHP analysis.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133711829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Automatic OpenCL Code Generation for Multi-device Heterogeneous Architectures 多设备异构体系结构的自动OpenCL代码生成
Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.105
Pei Li, E. Brunet, François Trahay, C. Parrot, Gaël Thomas, R. Namyst
Using multiple accelerators, such as GPUs or Xeon Phis, is attractive to improve the performance of large data parallel applications and to increase the size of their workloads. However, writing an application for multiple accelerators remains today challenging because going from a single accelerator to multiple ones indeed requires to deal with potentially non-uniform domain decomposition, inter-accelerator data movements, and dynamic load balancing. Writing such code manually is time consuming and error-prone. In this paper, we propose a new programming tool called STEPOCL along with a new domain specific language designed to simplify the development of an application for multiple accelerators. We evaluate both the performance and the usefulness of STEPOCL with three applications and show that: (i) the performance of an application written with STEPOCL scales linearly with the number of accelerators, (ii) the performance of an application written using STEPOCL competes with a handwritten version, (iii) larger workloads run on multiple devices that do not fit in the memory of a single device, (iv) thanks to STEPOCL, the number of lines of code required to write an application for multiple accelerators is roughly divided by ten.
使用多个加速器(如gpu或Xeon Phis)对于提高大型数据并行应用程序的性能和增加其工作负载的大小很有吸引力。然而,为多个加速器编写应用程序在今天仍然具有挑战性,因为从单个加速器到多个加速器确实需要处理潜在的非统一域分解、加速器间数据移动和动态负载平衡。手动编写这样的代码既耗时又容易出错。在本文中,我们提出了一种新的编程工具,称为STEPOCL,以及一种新的领域特定语言,旨在简化多加速器应用程序的开发。我们通过三个应用程序评估了STEPOCL的性能和有用性,并表明:(i)使用STEPOCL编写的应用程序的性能与加速器的数量呈线性增长,(ii)使用STEPOCL编写的应用程序的性能与手写版本竞争,(iii)在多个设备上运行较大的工作负载,这些设备不适合单个设备的内存,(iv)由于STEPOCL,编写多个加速器应用程序所需的代码行数大致除以10。
{"title":"Automatic OpenCL Code Generation for Multi-device Heterogeneous Architectures","authors":"Pei Li, E. Brunet, François Trahay, C. Parrot, Gaël Thomas, R. Namyst","doi":"10.1109/ICPP.2015.105","DOIUrl":"https://doi.org/10.1109/ICPP.2015.105","url":null,"abstract":"Using multiple accelerators, such as GPUs or Xeon Phis, is attractive to improve the performance of large data parallel applications and to increase the size of their workloads. However, writing an application for multiple accelerators remains today challenging because going from a single accelerator to multiple ones indeed requires to deal with potentially non-uniform domain decomposition, inter-accelerator data movements, and dynamic load balancing. Writing such code manually is time consuming and error-prone. In this paper, we propose a new programming tool called STEPOCL along with a new domain specific language designed to simplify the development of an application for multiple accelerators. We evaluate both the performance and the usefulness of STEPOCL with three applications and show that: (i) the performance of an application written with STEPOCL scales linearly with the number of accelerators, (ii) the performance of an application written using STEPOCL competes with a handwritten version, (iii) larger workloads run on multiple devices that do not fit in the memory of a single device, (iv) thanks to STEPOCL, the number of lines of code required to write an application for multiple accelerators is roughly divided by ten.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"233 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131965350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
LPM: Concurrency-Driven Layered Performance Matching LPM:并发驱动分层性能匹配
Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.97
Yuhang Liu, Xian-He Sun
Data access has become the preeminent performance bottleneck of computing. In this study, a Layered Performance Matching (LPM) model and its associated algorithm are proposed to match the request and reply speed for each layer of a memory hierarchy to improve memory performance. The rationale of LPM is that the performance of each layer of a memory hierarchy should and can be optimized to closely match the request of the layer directly above it. The LPM model simultaneously considers both data access concurrency and locality. It reveals the fact that increasing the effective overlapping between hits and misses of the higher layer will alleviate the performance impact of the lower layer. The terms pure miss and pure miss penalty are introduced to measure the effectiveness of such hit-miss overlapping. By distinguishing between (general) miss and pure miss, we have made LPM optimization practical and feasible. Our evaluation shows the data stall time can be reduced significantly with an optimized hardware configuration. We also have achieved noticeable performance improvement by simply adopting smart LPM scheduling without changing the underlying hardware configurations. Analysis and experimental results show LPM is feasible and effective. It provides a novel and efficient way to cope with the ever-widening memory wall problem, and to optimize the vital memory system design.
数据访问已经成为计算性能的突出瓶颈。本文提出了一种分层性能匹配(LPM)模型及其相关算法,以匹配存储器层次结构中每一层的请求和应答速度,从而提高存储器性能。LPM的基本原理是,内存层次结构的每一层的性能都应该并且可以进行优化,以紧密匹配其直接上一层的请求。LPM模型同时考虑数据访问并发性和局部性。它揭示了这样一个事实,即增加上层命中和未命中之间的有效重叠将减轻下层对性能的影响。引入了“纯脱靶”和“纯脱靶惩罚”两个术语来衡量这种命中脱靶重叠的有效性。通过区分(一般)脱靶和纯脱靶,使LPM优化具有实用性和可行性。我们的评估显示,通过优化硬件配置,可以显著减少数据失速时间。通过简单地采用智能LPM调度而不更改底层硬件配置,我们还实现了显著的性能改进。分析和实验结果表明,LPM是可行和有效的。它为解决日益扩大的存储墙问题和优化重要的存储系统设计提供了一种新颖而有效的方法。
{"title":"LPM: Concurrency-Driven Layered Performance Matching","authors":"Yuhang Liu, Xian-He Sun","doi":"10.1109/ICPP.2015.97","DOIUrl":"https://doi.org/10.1109/ICPP.2015.97","url":null,"abstract":"Data access has become the preeminent performance bottleneck of computing. In this study, a Layered Performance Matching (LPM) model and its associated algorithm are proposed to match the request and reply speed for each layer of a memory hierarchy to improve memory performance. The rationale of LPM is that the performance of each layer of a memory hierarchy should and can be optimized to closely match the request of the layer directly above it. The LPM model simultaneously considers both data access concurrency and locality. It reveals the fact that increasing the effective overlapping between hits and misses of the higher layer will alleviate the performance impact of the lower layer. The terms pure miss and pure miss penalty are introduced to measure the effectiveness of such hit-miss overlapping. By distinguishing between (general) miss and pure miss, we have made LPM optimization practical and feasible. Our evaluation shows the data stall time can be reduced significantly with an optimized hardware configuration. We also have achieved noticeable performance improvement by simply adopting smart LPM scheduling without changing the underlying hardware configurations. Analysis and experimental results show LPM is feasible and effective. It provides a novel and efficient way to cope with the ever-widening memory wall problem, and to optimize the vital memory system design.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"231 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132323946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Software-Based Lightweight Multithreading to Overlap Memory-Access Latencies of Commodity Processors 基于软件的轻量级多线程重叠普通处理器的内存访问延迟
Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.71
Cihang Jiang, Youhui Zhang, Weimin Zheng
Emerging services applications operate on vast datasets that are kept in DRAM to minimize latency and to improve throughput. A considerable part of them have irregular memory references and then caused the serious locality issue. This paper presents a Software-based LIght weight Multithreading framework, SLIM, to conquer this problem for commodity hardware, which still keeps the simple style of multithreading programming. The principle is fairly straight: as issuing an irregular memory reference, the current fine-granularity thread uses some primitive of asynchronous memory-accesses and then switches itself out for others' execution to overlap long memory-latencies. Meanwhile, SLIM tries to maintain most contents of thread-contexts in the on-chip cache to reduce cache-misses. Therefore, the main challenge lies in how to improve the cache behavior at the expense of more instructions involved for context-switches and smaller cache-space left for applications. Consequently, we have proposed a corresponding performance model to guide the design, which is also verified by tests. Moreover, an optimized synchronization mechanism has been designed. For some classic irregular application, excessive tests have been carried out to explore the effects on performance of system configurations, including the aggressiveness of data-pre-fetch, the distribution of tasks among cores / CPUs, etc. Results show that it can achieve higher performance than the counterpart using traditional threads, under different data scales. Even compared to some tricky codes with manual optimizations, its performance is comparable and it has still reserved the simple programing manner of high-concurrency applications.
新兴的服务应用程序对保存在DRAM中的大量数据集进行操作,以最大限度地减少延迟并提高吞吐量。其中相当一部分具有不规则的内存引用,从而造成了严重的局部性问题。本文提出了一种基于软件的轻量级多线程框架SLIM,在保持简单的多线程编程风格的同时,克服了商品硬件的这一问题。原理相当简单:当发出不规则的内存引用时,当前细粒度线程使用一些异步内存访问的原语,然后将自己切换出去,让其他线程执行,从而重叠长内存延迟。同时,SLIM尝试在片上缓存中维护线程上下文的大部分内容,以减少缓存丢失。因此,主要的挑战在于如何以牺牲上下文切换涉及的更多指令和为应用程序留下的更小的缓存空间为代价来改进缓存行为。因此,我们提出了相应的性能模型来指导设计,并通过测试进行了验证。并设计了一种优化的同步机制。对于一些经典的不规则应用程序,已经进行了大量的测试,以探索系统配置对性能的影响,包括数据预取的侵略性,在内核/ cpu之间的任务分配等。结果表明,在不同的数据规模下,该方法比使用传统线程的方法取得了更高的性能。即使与一些手工优化的复杂代码相比,它的性能也是相当的,而且它仍然保留了高并发应用程序的简单编程方式。
{"title":"Software-Based Lightweight Multithreading to Overlap Memory-Access Latencies of Commodity Processors","authors":"Cihang Jiang, Youhui Zhang, Weimin Zheng","doi":"10.1109/ICPP.2015.71","DOIUrl":"https://doi.org/10.1109/ICPP.2015.71","url":null,"abstract":"Emerging services applications operate on vast datasets that are kept in DRAM to minimize latency and to improve throughput. A considerable part of them have irregular memory references and then caused the serious locality issue. This paper presents a Software-based LIght weight Multithreading framework, SLIM, to conquer this problem for commodity hardware, which still keeps the simple style of multithreading programming. The principle is fairly straight: as issuing an irregular memory reference, the current fine-granularity thread uses some primitive of asynchronous memory-accesses and then switches itself out for others' execution to overlap long memory-latencies. Meanwhile, SLIM tries to maintain most contents of thread-contexts in the on-chip cache to reduce cache-misses. Therefore, the main challenge lies in how to improve the cache behavior at the expense of more instructions involved for context-switches and smaller cache-space left for applications. Consequently, we have proposed a corresponding performance model to guide the design, which is also verified by tests. Moreover, an optimized synchronization mechanism has been designed. For some classic irregular application, excessive tests have been carried out to explore the effects on performance of system configurations, including the aggressiveness of data-pre-fetch, the distribution of tasks among cores / CPUs, etc. Results show that it can achieve higher performance than the counterpart using traditional threads, under different data scales. Even compared to some tricky codes with manual optimizations, its performance is comparable and it has still reserved the simple programing manner of high-concurrency applications.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"145 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133086923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Characterizing Loop-Level Communication Patterns in Shared Memory 共享内存中环路级通信模式的表征
Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.85
Arya Mazaheri, A. Jannesari, Abdolreza Mirzaei, F. Wolf
Communication patterns extracted from parallel programs can provide a valuable source of information for parallel pattern detection, application auto-tuning, and runtime workload scheduling on heterogeneous systems. Once identified, such patterns can help find the most promising optimizations. Communication patterns can be detected using different methods, including sandbox simulation, memory profiling, and hardware counter analysis. However, these analyses usually suffer from high runtime and memory overhead, necessitating a trade off between accuracy and resource consumption. More importantly, none of the existing methods exploit fine-grained communication patterns on the level of individual code regions. In this paper, we present an efficient tool based on Disco PoP profiler that characterizes the communication pattern of every hotspot in a shared-memory application. With the aid of static and dynamic code analysis, it produces a nested structure of communication patterns based on program's loops. By employing asymmetric signature memory, the runtime overhead is around 225× while the required amount of memory remains fixed. In comparison with other profilers, the proposed method is efficient enough to be used with real world applications.
从并行程序中提取的通信模式可以为异构系统上的并行模式检测、应用程序自动调优和运行时工作负载调度提供有价值的信息源。一旦确定,这样的模式可以帮助找到最有前途的优化。可以使用不同的方法检测通信模式,包括沙箱模拟、内存分析和硬件计数器分析。然而,这些分析通常受到高运行时和内存开销的影响,需要在准确性和资源消耗之间进行权衡。更重要的是,现有的方法都没有利用单个代码区域级别上的细粒度通信模式。在本文中,我们提出了一个基于Disco PoP分析器的高效工具来表征共享内存应用程序中每个热点的通信模式。借助静态和动态代码分析,生成基于程序循环的通信模式嵌套结构。通过使用非对称签名内存,运行时开销约为225x,而所需的内存量保持不变。与其他分析器相比,所提出的方法是有效的,足以在实际应用中使用。
{"title":"Characterizing Loop-Level Communication Patterns in Shared Memory","authors":"Arya Mazaheri, A. Jannesari, Abdolreza Mirzaei, F. Wolf","doi":"10.1109/ICPP.2015.85","DOIUrl":"https://doi.org/10.1109/ICPP.2015.85","url":null,"abstract":"Communication patterns extracted from parallel programs can provide a valuable source of information for parallel pattern detection, application auto-tuning, and runtime workload scheduling on heterogeneous systems. Once identified, such patterns can help find the most promising optimizations. Communication patterns can be detected using different methods, including sandbox simulation, memory profiling, and hardware counter analysis. However, these analyses usually suffer from high runtime and memory overhead, necessitating a trade off between accuracy and resource consumption. More importantly, none of the existing methods exploit fine-grained communication patterns on the level of individual code regions. In this paper, we present an efficient tool based on Disco PoP profiler that characterizes the communication pattern of every hotspot in a shared-memory application. With the aid of static and dynamic code analysis, it produces a nested structure of communication patterns based on program's loops. By employing asymmetric signature memory, the runtime overhead is around 225× while the required amount of memory remains fixed. In comparison with other profilers, the proposed method is efficient enough to be used with real world applications.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130081756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Optimizing MapReduce Based on Locality of K-V Pairs and Overlap between Shuffle and Local Reduce 基于K-V对局部性和Shuffle与Local Reduce重叠的MapReduce优化
Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.103
Jianjiang Li, Jie Wu, Xiaolei Yang, Shiqi Zhong
At present, MapReduce is the most popular programming model for Big Data processing. As a typical open source implementation of MapReduce, Hadoop is divided into map, shuffle, and reduce. In the mapping phase, according to the principle moving computation towards data, the load is basically balanced and network traffic is relatively small. However, shuffle is likely to result in the outburst of network communication. At the same time, reduce without considering data skew will lead to an imbalanced load, and then performance degradation. This paper proposes a Locality-Enhanced Load Balance (LELB) algorithm, and then extends the execution flow of MapReduce to Map, Local reduce, Shuffle and final Reduce (MLSR), and proposes a corresponding MLSR algorithm. Use of the novel algorithms can share the computation of reduce and overlap with shuffle in order to take full advantage of CPU and I/O resources. The actual test results demonstrate that the execution performance using the LELB algorithm and the MLSR algorithm outperforms the execution performance using hadoop by up to 9.2% (for Merge Sort) and 14.4% (for Word Count).
目前,MapReduce是最流行的大数据处理编程模型。作为MapReduce的典型开源实现,Hadoop分为map、shuffle和reduce。在映射阶段,根据计算向数据移动的原理,负载基本均衡,网络流量相对较小。然而,洗牌很可能导致网络传播的爆发。同时,不考虑数据倾斜的减少会导致负载不平衡,进而导致性能下降。本文提出了一种局部增强负载平衡(LELB)算法,然后将MapReduce的执行流程扩展为Map、Local reduce、Shuffle和final reduce (MLSR),并提出了相应的MLSR算法。利用该算法可以与shuffle共享reduce和overlap的计算,从而充分利用CPU和I/O资源。实际测试结果表明,使用LELB算法和MLSR算法的执行性能比使用hadoop的执行性能高出9.2%(用于合并排序)和14.4%(用于单词计数)。
{"title":"Optimizing MapReduce Based on Locality of K-V Pairs and Overlap between Shuffle and Local Reduce","authors":"Jianjiang Li, Jie Wu, Xiaolei Yang, Shiqi Zhong","doi":"10.1109/ICPP.2015.103","DOIUrl":"https://doi.org/10.1109/ICPP.2015.103","url":null,"abstract":"At present, MapReduce is the most popular programming model for Big Data processing. As a typical open source implementation of MapReduce, Hadoop is divided into map, shuffle, and reduce. In the mapping phase, according to the principle moving computation towards data, the load is basically balanced and network traffic is relatively small. However, shuffle is likely to result in the outburst of network communication. At the same time, reduce without considering data skew will lead to an imbalanced load, and then performance degradation. This paper proposes a Locality-Enhanced Load Balance (LELB) algorithm, and then extends the execution flow of MapReduce to Map, Local reduce, Shuffle and final Reduce (MLSR), and proposes a corresponding MLSR algorithm. Use of the novel algorithms can share the computation of reduce and overlap with shuffle in order to take full advantage of CPU and I/O resources. The actual test results demonstrate that the execution performance using the LELB algorithm and the MLSR algorithm outperforms the execution performance using hadoop by up to 9.2% (for Merge Sort) and 14.4% (for Word Count).","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"189 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132707033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
SLA-Based Resource Scheduling for Big Data Analytics as a Service in Cloud Computing Environments 云计算环境下基于sla的大数据分析即服务资源调度
Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.60
Yali Zhao, R. Calheiros, G. Gange, K. Ramamohanarao, R. Buyya
Data analytics plays a significant role in gaining insight of big data that can benefit in decision making and problem solving for various application domains such as science, engineering, and commerce. Cloud computing is a suitable platform for Big Data Analytic Applications (BDAAs) that can greatly reduce application cost by elastically provisioning resources based on user requirements and in a pay as you go model. BDAAs are typically catered for specific domains and are usually expensive. Moreover, it is difficult to provision resources for BDAAs with fluctuating resource requirements and reduce the resource cost. As a result, BDAAs are mostly used by large enterprises. Therefore, it is necessary to have a general Analytics as a Service (AaaS) platform that can provision BDAAs to users in various domains as consumable services in an easy to use way and at lower price. To support the AaaS platform, our research focuses on efficiently scheduling Cloud resources for BDAAs to satisfy Quality of Service (QoS) requirements of budget and deadline for data analytic requests and maximize profit for the AaaS platform. We propose an admission control and resource scheduling algorithm, which not only satisfies QoS requirements of requests as guaranteed in Service Level Agreements (SLAs), but also increases the profit for AaaS providers by offering a cost-effective resource scheduling solution. We propose the architecture and models for the AaaS platform and conduct experiments to evaluate the proposed algorithm. Results show the efficiency of the algorithm in SLA guarantee, profit enhancement, and cost saving.
数据分析在获取大数据洞察力方面发挥着重要作用,大数据有助于科学、工程和商业等各种应用领域的决策和问题解决。云计算是大数据分析应用程序(BDAAs)的合适平台,它可以根据用户需求和按需付费模式弹性地提供资源,从而大大降低应用程序成本。bdaa通常是针对特定领域的,通常价格昂贵。此外,难以为资源需求波动的bdaa提供资源并降低资源成本。因此,bdaa主要由大型企业使用。因此,有必要拥有一个通用的分析即服务(AaaS)平台,该平台可以以易于使用的方式和较低的价格将BDAAs作为可消费的服务提供给各个领域的用户。为了支持AaaS平台,我们的研究重点是有效地调度BDAAs的云资源,以满足预算和数据分析请求截止日期的服务质量(QoS)要求,并最大化AaaS平台的利润。提出了一种允许控制和资源调度算法,该算法不仅满足了服务水平协议(sla)中对请求的QoS要求,而且通过提供一种具有成本效益的资源调度解决方案,提高了AaaS提供商的利润。我们提出了AaaS平台的架构和模型,并进行了实验来评估所提出的算法。结果表明,该算法在保证SLA、提高利润、节约成本等方面是有效的。
{"title":"SLA-Based Resource Scheduling for Big Data Analytics as a Service in Cloud Computing Environments","authors":"Yali Zhao, R. Calheiros, G. Gange, K. Ramamohanarao, R. Buyya","doi":"10.1109/ICPP.2015.60","DOIUrl":"https://doi.org/10.1109/ICPP.2015.60","url":null,"abstract":"Data analytics plays a significant role in gaining insight of big data that can benefit in decision making and problem solving for various application domains such as science, engineering, and commerce. Cloud computing is a suitable platform for Big Data Analytic Applications (BDAAs) that can greatly reduce application cost by elastically provisioning resources based on user requirements and in a pay as you go model. BDAAs are typically catered for specific domains and are usually expensive. Moreover, it is difficult to provision resources for BDAAs with fluctuating resource requirements and reduce the resource cost. As a result, BDAAs are mostly used by large enterprises. Therefore, it is necessary to have a general Analytics as a Service (AaaS) platform that can provision BDAAs to users in various domains as consumable services in an easy to use way and at lower price. To support the AaaS platform, our research focuses on efficiently scheduling Cloud resources for BDAAs to satisfy Quality of Service (QoS) requirements of budget and deadline for data analytic requests and maximize profit for the AaaS platform. We propose an admission control and resource scheduling algorithm, which not only satisfies QoS requirements of requests as guaranteed in Service Level Agreements (SLAs), but also increases the profit for AaaS providers by offering a cost-effective resource scheduling solution. We propose the architecture and models for the AaaS platform and conduct experiments to evaluate the proposed algorithm. Results show the efficiency of the algorithm in SLA guarantee, profit enhancement, and cost saving.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116505726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 36
期刊
2015 44th International Conference on Parallel Processing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1