2015 44th International Conference on Parallel Processing最新文献_第8页

Accelerating I/O Performance of Big Data Analytics on HPC Clusters through RDMA-Based Key-Value Store 基于rdma的键值存储加速HPC集群大数据分析I/O性能

2015 44th International Conference on Parallel Processing

Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.79

Nusrat S. Islam, D. Shankar, Xiaoyi Lu, Md. Wasi-ur-Rahman, D. Panda

Hadoop Distributed File System (HDFS) is the underlying storage engine of many Big Data processing frameworks such as Hadoop MapReduce, HBase, Hive, and Spark. Even though HDFS is well-known for its scalability and reliability, the requirement of large amount of local storage space makes HDFS deployment challenging on HPC clusters. Moreover, HPC clusters usually have large installation of parallel file system like Lustre. In this study, we propose a novel design to integrate HDFS with Lustre through a high performance key-value store. We design a burst buffer system using RDMA-based Mem cached and present three schemes to integrate HDFS with Lustre through this buffer layer, considering different aspects of I/O, data-locality, and fault-tolerance. Our proposed schemes can ensure performance improvement for Big Data applications on HPC clusters. At the same time, they lead to reduced local storage requirement. Performance evaluations show that, our design can improve the write performance of Test DFSIO by up to 2.6x over HDFS and 1.5x over Lustre. The gain in read throughput is up to 8x. Sort execution time is reduced by up to 28% over Lustre and 19% over HDFS. Our design can also significantly benefit I/O-intensive workloads compared to both HDFS and Lustre.

HDFS (Hadoop Distributed File System)是许多大数据处理框架(如Hadoop MapReduce、HBase、Hive、Spark)的底层存储引擎。尽管HDFS以其可扩展性和可靠性而闻名，但对大量本地存储空间的需求使得HDFS在HPC集群上的部署具有挑战性。此外，HPC集群通常安装大量的并行文件系统，如Lustre。在这项研究中，我们提出了一种新的设计，通过高性能的键值存储将HDFS与Lustre集成在一起。我们使用基于rdma的Mem缓存设计了一个突发缓冲系统，并提出了三种方案通过该缓冲层将HDFS与Lustre集成，考虑了I/O，数据局域性和容错的不同方面。我们提出的方案可以确保高性能计算集群上大数据应用的性能提升。同时，它们减少了本地存储需求。性能评估表明，我们的设计可以将Test DFSIO的写性能比HDFS提高2.6倍，比Lustre提高1.5倍。读吞吐量的增益高达8倍。排序执行时间比Lustre减少了28%，比HDFS减少了19%。与HDFS和Lustre相比，我们的设计还可以显著改善I/ o密集型工作负载。

{"title":"Accelerating I/O Performance of Big Data Analytics on HPC Clusters through RDMA-Based Key-Value Store","authors":"Nusrat S. Islam, D. Shankar, Xiaoyi Lu, Md. Wasi-ur-Rahman, D. Panda","doi":"10.1109/ICPP.2015.79","DOIUrl":"https://doi.org/10.1109/ICPP.2015.79","url":null,"abstract":"Hadoop Distributed File System (HDFS) is the underlying storage engine of many Big Data processing frameworks such as Hadoop MapReduce, HBase, Hive, and Spark. Even though HDFS is well-known for its scalability and reliability, the requirement of large amount of local storage space makes HDFS deployment challenging on HPC clusters. Moreover, HPC clusters usually have large installation of parallel file system like Lustre. In this study, we propose a novel design to integrate HDFS with Lustre through a high performance key-value store. We design a burst buffer system using RDMA-based Mem cached and present three schemes to integrate HDFS with Lustre through this buffer layer, considering different aspects of I/O, data-locality, and fault-tolerance. Our proposed schemes can ensure performance improvement for Big Data applications on HPC clusters. At the same time, they lead to reduced local storage requirement. Performance evaluations show that, our design can improve the write performance of Test DFSIO by up to 2.6x over HDFS and 1.5x over Lustre. The gain in read throughput is up to 8x. Sort execution time is reduced by up to 28% over Lustre and 19% over HDFS. Our design can also significantly benefit I/O-intensive workloads compared to both HDFS and Lustre.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124966517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27

TAPS: Software Defined Task-Level Deadline-Aware Preemptive Flow Scheduling in Data Centers 数据中心中软件定义的任务级截止日期感知抢占式流调度

2015 44th International Conference on Parallel Processing

Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.75

Lili Liu, Dan Li, Jianping Wu

Many data center applications have deadline requirements, which pose a requirement of deadline-awareness in network transport. Completing within deadlines is a necessary requirement for flows to be completed. Transport protocols in current data centers try to share the network resources fairly and are deadline-agnostic. Recently several works try to address the problem by making as many flows meet deadlines as possible. However, for many data center applications, a task cannot be completed until the last flow finishes, which indicates the bandwidths consumed by completed flows are wasted if some flows in the task cannot meet deadlines. In this paper we design a task-level deadline-aware preemptive flow scheduling(TAPS), which aims to make more tasks meet deadlines. We leverage software defined networking (SDN) technology and generalize SDN from flow-level awareness to task-level awareness. The scheduling algorithm runs on the SDN controller, which decides whether a flow should be accepted or discarded, pre-allocates the transmission time slices and computes the routing paths for accepted flows. Extensive flow-level simulations demonstrate TAPS outperforms Varys, Bara at, PDQ (Preemptive Distributed Quick flow scheduling), D3 (Deadline-Driven Delivery control protocol) and Fair Sharing transport protocols in deadline sensitive data center environment. A simple implementation on real systems also proves that TAPS makes high effective utilization of the network bandwidth in data centers.

许多数据中心应用都有截止日期需求，这就对网络传输中的截止日期意识提出了要求。在最后期限内完成是完成流程的必要要求。当前数据中心的传输协议试图公平地共享网络资源，并且与截止日期无关。最近，一些作品试图通过让尽可能多的流程在截止日期前完成来解决这个问题。但是，对于许多数据中心应用程序来说，任务直到最后一个流完成后才能完成，这意味着如果任务中的某些流无法满足截止日期，则已完成的流所消耗的带宽被浪费了。本文设计了一种任务级的截止日期感知抢占式流调度(TAPS)，其目的是使更多的任务满足截止日期。我们利用软件定义网络(SDN)技术，将SDN从流级感知推广到任务级感知。调度算法在SDN控制器上运行，决定是否接受或丢弃流，预先分配传输时间片，并计算接受流的路由路径。广泛的流级模拟表明，TAPS在截止日期敏感数据中心环境中优于Varys、Bara at、PDQ(抢占式分布式快速流调度)、D3(截止日期驱动交付控制协议)和公平共享传输协议。在实际系统上的简单实现也证明了TAPS能够有效地利用数据中心的网络带宽。

{"title":"TAPS: Software Defined Task-Level Deadline-Aware Preemptive Flow Scheduling in Data Centers","authors":"Lili Liu, Dan Li, Jianping Wu","doi":"10.1109/ICPP.2015.75","DOIUrl":"https://doi.org/10.1109/ICPP.2015.75","url":null,"abstract":"Many data center applications have deadline requirements, which pose a requirement of deadline-awareness in network transport. Completing within deadlines is a necessary requirement for flows to be completed. Transport protocols in current data centers try to share the network resources fairly and are deadline-agnostic. Recently several works try to address the problem by making as many flows meet deadlines as possible. However, for many data center applications, a task cannot be completed until the last flow finishes, which indicates the bandwidths consumed by completed flows are wasted if some flows in the task cannot meet deadlines. In this paper we design a task-level deadline-aware preemptive flow scheduling(TAPS), which aims to make more tasks meet deadlines. We leverage software defined networking (SDN) technology and generalize SDN from flow-level awareness to task-level awareness. The scheduling algorithm runs on the SDN controller, which decides whether a flow should be accepted or discarded, pre-allocates the transmission time slices and computes the routing paths for accepted flows. Extensive flow-level simulations demonstrate TAPS outperforms Varys, Bara at, PDQ (Preemptive Distributed Quick flow scheduling), D3 (Deadline-Driven Delivery control protocol) and Fair Sharing transport protocols in deadline sensitive data center environment. A simple implementation on real systems also proves that TAPS makes high effective utilization of the network bandwidth in data centers.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"147 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114642562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Evaluating Latency-Sensitive Applications: Performance Degradation in Datacenters with Restricted Power Budget 评估对延迟敏感的应用:在电力预算有限的数据中心中的性能下降

2015 44th International Conference on Parallel Processing

Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.73

Song Wu, Chuxiong Yan, Haibao Chen, Hai Jin, Wenting Guo, Zhen Wang, Deqing Zou

For data centers with limited power supply, restricting the servers' power budget (i.e., The maximal power provided to servers) is an efficient approach to increase the server density (the server quantity per rack), which can effectively improve the cost-effectiveness of the data centers. However, this approach may also affect the performance of applications in servers. Hence, the prerequisite of adopting the approach in data centers is to precisely evaluate the application performance degradation caused by restricting the servers' power budget. Unfortunately, existing evaluation methods are inaccurate because they are either improper or coarse-grained, especially for the latency-sensitive applications widely deployed in data centers. In this paper, we analyze the reasons why state-of-the-art methods are not appropriate for evaluating the performance degradation of latency-sensitive applications in case of power restriction, and we propose a new evaluation method which can provide a fine-grained way to precisely describe and evaluate such degradation. We verify our proposed method by a real-world application and the traces from Ten cent's date enter with 25328 servers. The experimental results show that our method is much more accurate compared with the state of the art, and we can significantly increase datacenter efficiency by saving servers' power budget while maintaining the applications' performance degradation within controllable and acceptable range.

对于供电有限的数据中心，限制服务器的功率预算(即提供给服务器的最大功率)是提高服务器密度(每机架服务器数量)的有效方法，可以有效地提高数据中心的成本效益。但是，这种方法也可能影响服务器中应用程序的性能。因此，在数据中心采用该方法的前提是准确评估由于限制服务器的功率预算而导致的应用程序性能下降。不幸的是，现有的评估方法是不准确的，因为它们要么不合适，要么是粗粒度的，特别是对于广泛部署在数据中心的对延迟敏感的应用程序。在本文中，我们分析了目前最先进的方法不适合评估延迟敏感应用在功率限制情况下的性能退化的原因，并提出了一种新的评估方法，该方法可以提供一种细粒度的方法来精确描述和评估这种退化。我们通过一个真实的应用程序验证了我们提出的方法，并通过25328个服务器验证了Ten cent日期输入的痕迹。实验结果表明，我们的方法比目前的方法更精确，并且可以通过节省服务器的功耗预算来显着提高数据中心的效率，同时将应用程序的性能下降保持在可控和可接受的范围内。

{"title":"Evaluating Latency-Sensitive Applications: Performance Degradation in Datacenters with Restricted Power Budget","authors":"Song Wu, Chuxiong Yan, Haibao Chen, Hai Jin, Wenting Guo, Zhen Wang, Deqing Zou","doi":"10.1109/ICPP.2015.73","DOIUrl":"https://doi.org/10.1109/ICPP.2015.73","url":null,"abstract":"For data centers with limited power supply, restricting the servers' power budget (i.e., The maximal power provided to servers) is an efficient approach to increase the server density (the server quantity per rack), which can effectively improve the cost-effectiveness of the data centers. However, this approach may also affect the performance of applications in servers. Hence, the prerequisite of adopting the approach in data centers is to precisely evaluate the application performance degradation caused by restricting the servers' power budget. Unfortunately, existing evaluation methods are inaccurate because they are either improper or coarse-grained, especially for the latency-sensitive applications widely deployed in data centers. In this paper, we analyze the reasons why state-of-the-art methods are not appropriate for evaluating the performance degradation of latency-sensitive applications in case of power restriction, and we propose a new evaluation method which can provide a fine-grained way to precisely describe and evaluate such degradation. We verify our proposed method by a real-world application and the traces from Ten cent's date enter with 25328 servers. The experimental results show that our method is much more accurate compared with the state of the art, and we can significantly increase datacenter efficiency by saving servers' power budget while maintaining the applications' performance degradation within controllable and acceptable range.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116773758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

PLP: Protecting Location Privacy Against Correlation-Analysis Attack in Crowdsensing PLP:在群体感知中保护位置隐私免受相关分析攻击

2015 44th International Conference on Parallel Processing

Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.20

Shanfeng Zhang, Q. Ma, Tong Zhu, Kebin Liu, Lan Zhang, Wenbo He, Yunhao Liu

Crowdsensing applications require individuals toshare local and personal sensing data with others to produce valuableknowledge and services. Meanwhile, it has raised concernsespecially for location privacy. Users may wish to prevent privacyleak and publish as many non-sensitive contexts as possible.Simply suppressing sensitive contexts is vulnerable to the adversariesexploiting spatio-temporal correlations in users' behavior.In this work, we present PLP, a crowdsensing scheme whichpreserves privacy while maximizes the amount of data collectionby filtering a user's context stream. PLP leverages a conditionalrandom field to model the spatio-temporal correlations amongthe contexts, and proposes a speed-up algorithm to learn theweaknesses in the correlations. Even if the adversaries are strongenough to know the filtering system and the weaknesses, PLPcan still provably preserves privacy, with little computationalcost for online operations. PLP is evaluated and validated overtwo real-world smartphone context traces of 34 users. Theexperimental results show that PLP efficiently protects privacywithout sacrificing much utility.

大众感知应用需要个人与他人分享本地和个人感知数据，以产生有价值的知识和服务。与此同时，它也引起了人们对位置隐私的关注。用户可能希望防止隐私泄露，并尽可能多地发布非敏感上下文。简单地抑制敏感上下文容易受到对手利用用户行为中的时空相关性的攻击。在这项工作中，我们提出了PLP，这是一种通过过滤用户的上下文流来保护隐私的同时最大化数据收集量的众感方案。PLP利用条件随机场来模拟上下文之间的时空相关性，并提出了一种加速算法来学习相关性中的弱点。即使对手足够强大，知道过滤系统和弱点，plp仍然可以证明保护隐私，几乎不需要在线操作的计算成本。PLP在34个用户的两个真实智能手机上下文轨迹上进行评估和验证。实验结果表明，PLP在不牺牲太多效用的前提下，有效地保护了隐私。

{"title":"PLP: Protecting Location Privacy Against Correlation-Analysis Attack in Crowdsensing","authors":"Shanfeng Zhang, Q. Ma, Tong Zhu, Kebin Liu, Lan Zhang, Wenbo He, Yunhao Liu","doi":"10.1109/ICPP.2015.20","DOIUrl":"https://doi.org/10.1109/ICPP.2015.20","url":null,"abstract":"Crowdsensing applications require individuals toshare local and personal sensing data with others to produce valuableknowledge and services. Meanwhile, it has raised concernsespecially for location privacy. Users may wish to prevent privacyleak and publish as many non-sensitive contexts as possible.Simply suppressing sensitive contexts is vulnerable to the adversariesexploiting spatio-temporal correlations in users' behavior.In this work, we present PLP, a crowdsensing scheme whichpreserves privacy while maximizes the amount of data collectionby filtering a user's context stream. PLP leverages a conditionalrandom field to model the spatio-temporal correlations amongthe contexts, and proposes a speed-up algorithm to learn theweaknesses in the correlations. Even if the adversaries are strongenough to know the filtering system and the weaknesses, PLPcan still provably preserves privacy, with little computationalcost for online operations. PLP is evaluated and validated overtwo real-world smartphone context traces of 34 users. Theexperimental results show that PLP efficiently protects privacywithout sacrificing much utility.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124725160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Zebra: An East-West Control Framework for SDN Controllers Zebra: SDN控制器的东西控制框架

2015 44th International Conference on Parallel Processing

Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.70

Haisheng Yu, Keqiu Li, Heng Qi, Wenxin Li, Xiaoyi Tao

Traditional networks are surprisingly fragile and difficult to manage. Software Defined Networking (SDN) gained significant attention from both academia and industry, as if simplify network management through centralized configuration. Existing work primarily focuses on networks of limited scope such as data-centers and enterprises, which makes the development of SDN hindered when it comes to large-scale network environments. One way of enabling communication between data-centers, enterprises and ISPs in a large-scale network is to establish a standard communication mechanism between these entities. In this paper, we propose Zebra, a framework for enabling communication between different SDN domains. Zebra has two modules: Heterogeneous Controller Management (HCM) module and Domain Relationships Management (DRM) module. HCM collects network information from a group of controllers with no interconnection and generate a domain-wide network view. DRM collects network information from other domains to generate a global-wide network view. Moreover, HCM supports different SDN controllers, such as floodlight, maestro and so on. To test this framework, we develop a prototype system, and give some experimental results.

传统的网络非常脆弱，难以管理。软件定义网络(SDN)得到了学术界和工业界的广泛关注，它通过集中配置简化了网络管理。现有的工作主要集中在数据中心和企业等有限范围的网络，这使得SDN在大规模网络环境下的发展受到阻碍。在大规模网络中，实现数据中心、企业和isp之间通信的一种方法是在这些实体之间建立标准的通信机制。在本文中，我们提出了Zebra，这是一个实现不同SDN域之间通信的框架。Zebra有两个模块:HCM (Heterogeneous Controller Management)模块和DRM (Domain Relationships Management)模块。HCM从一组没有互连的控制器中收集网络信息，并生成一个域范围的网络视图。DRM从其他域中收集网络信息，生成全局网络视图。此外，HCM支持不同的SDN控制器，如泛光灯，大师等。为了测试这个框架，我们开发了一个原型系统，并给出了一些实验结果。

{"title":"Zebra: An East-West Control Framework for SDN Controllers","authors":"Haisheng Yu, Keqiu Li, Heng Qi, Wenxin Li, Xiaoyi Tao","doi":"10.1109/ICPP.2015.70","DOIUrl":"https://doi.org/10.1109/ICPP.2015.70","url":null,"abstract":"Traditional networks are surprisingly fragile and difficult to manage. Software Defined Networking (SDN) gained significant attention from both academia and industry, as if simplify network management through centralized configuration. Existing work primarily focuses on networks of limited scope such as data-centers and enterprises, which makes the development of SDN hindered when it comes to large-scale network environments. One way of enabling communication between data-centers, enterprises and ISPs in a large-scale network is to establish a standard communication mechanism between these entities. In this paper, we propose Zebra, a framework for enabling communication between different SDN domains. Zebra has two modules: Heterogeneous Controller Management (HCM) module and Domain Relationships Management (DRM) module. HCM collects network information from a group of controllers with no interconnection and generate a domain-wide network view. DRM collects network information from other domains to generate a global-wide network view. Moreover, HCM supports different SDN controllers, such as floodlight, maestro and so on. To test this framework, we develop a prototype system, and give some experimental results.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130378489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

Dual-centric Data Center Network Architectures 双中心数据中心网络架构

2015 44th International Conference on Parallel Processing

Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.77

Dawei Li, Jie Wu, Zhiyong Liu, Fa Zhang

Existing Data Center Network (DCN) architectures are classified into two categories: switch-centric and server-centric architectures. In switch-centric DCNs, routing intelligence is placed on switches, each server usually uses only one port of the Network Interface Card (NIC) to connect to the network. In server-centric DCNs, switches are only used as cross-bars, and routing intelligence is placed on servers, where multiple NIC ports may be used. In this paper, we formally introduce a new category of DCN architectures: the dual-centric DCN architectures, where routing intelligence can be placed on both switches and servers. We propose two typical dual-centric DCN architectures: FSquare and Rectangle, both of which are based on the folded Clos topology. FSquare is a high performance DCN architecture, in which the diameter is small and the bisection bandwidth is large, however, the DCN power consumption per server in FSquare is high. Rectangle significantly reduces the DCN power consumption per server, compared to FSquare, at the sacrifice of some performances, thus, Rectangle has a larger diameter and a smaller bisection bandwidth. By investigating FSquare and Rectangle, and by comparing them with existing architectures, we demonstrate that, these two novel dual-centric architectures enjoy the advantages of both switch-centric designs and server-centric designs, have various nice properties for practical data centers, and provide flexible choices in designing DCN architectures.

现有的数据中心网络(DCN)架构分为两类:以交换机为中心的架构和以服务器为中心的架构。在以交换机为中心的dcn中，路由智能被放置在交换机上，每台服务器通常只使用网卡的一个端口连接到网络。在以服务器为中心的dcn中，交换机仅用作交叉排，路由智能放置在服务器上，其中可能使用多个NIC端口。在本文中，我们正式介绍了一种新的DCN架构:双中心DCN架构，其中路由智能可以放置在交换机和服务器上。我们提出了两种典型的双中心DCN架构:FSquare和Rectangle，它们都是基于折叠的Clos拓扑。FSquare是一种高性能DCN架构，它的直径小，平分带宽大，但单台DCN功耗高。与FSquare相比，矩形显著降低了每台服务器的DCN功耗，但牺牲了一些性能，因此，矩形具有更大的直径和更小的等分带宽。通过对FSquare和Rectangle的研究，并将它们与现有架构进行比较，我们证明这两种新型双中心架构具有以交换机为中心设计和以服务器为中心设计的优点，具有适用于实际数据中心的各种优良特性，并为设计DCN架构提供了灵活的选择。

{"title":"Dual-centric Data Center Network Architectures","authors":"Dawei Li, Jie Wu, Zhiyong Liu, Fa Zhang","doi":"10.1109/ICPP.2015.77","DOIUrl":"https://doi.org/10.1109/ICPP.2015.77","url":null,"abstract":"Existing Data Center Network (DCN) architectures are classified into two categories: switch-centric and server-centric architectures. In switch-centric DCNs, routing intelligence is placed on switches, each server usually uses only one port of the Network Interface Card (NIC) to connect to the network. In server-centric DCNs, switches are only used as cross-bars, and routing intelligence is placed on servers, where multiple NIC ports may be used. In this paper, we formally introduce a new category of DCN architectures: the dual-centric DCN architectures, where routing intelligence can be placed on both switches and servers. We propose two typical dual-centric DCN architectures: FSquare and Rectangle, both of which are based on the folded Clos topology. FSquare is a high performance DCN architecture, in which the diameter is small and the bisection bandwidth is large, however, the DCN power consumption per server in FSquare is high. Rectangle significantly reduces the DCN power consumption per server, compared to FSquare, at the sacrifice of some performances, thus, Rectangle has a larger diameter and a smaller bisection bandwidth. By investigating FSquare and Rectangle, and by comparing them with existing architectures, we demonstrate that, these two novel dual-centric architectures enjoy the advantages of both switch-centric designs and server-centric designs, have various nice properties for practical data centers, and provide flexible choices in designing DCN architectures.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130614283","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Design and Implementation of a Highly Efficient DGEMM for 64-Bit ARMv8 Multi-core Processors 64位ARMv8多核处理器高效DGEMM的设计与实现

2015 44th International Conference on Parallel Processing

Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.29

Feng Wang, Hao Jiang, Ke Zuo, Xing Su, Jingling Xue, Canqun Yang

This paper presents the design and implementation of a highly efficient Double-precision General Matrix Multiplication (DGEMM) based on Open BLAS for 64-bit ARMv8 eight-core processors. We adopt a theory-guided approach by first developing a performance model for this architecture and then using it to guide our exploration. The key enabler for a highly efficient DGEMM is a highly-optimized inner kernel GEBP developed in assembly language. We have obtained GEBP by (1) maximizing its compute-to-memory access ratios across all levels of the memory hierarchy in the ARMv8 architecture with its performance-critical block sizes being determined analytically, and (2) optimizing its computations through exploiting loop unrolling, instruction scheduling and software-implemented register rotation and taking advantage of A64 instructions to support efficient FMA operations, data transfers and prefetching. We have compared our DGEMM implemented in Open BLAS with another implemented in ATLAS (also in terms of a highly-optimized GEBP in assembly). Our implementation outperforms the one in ALTAS by improving the peak performance (efficiency) of DGEMM from 3.88 Gflops (80.9%) to 4.19 Gflops (87.2%) on one core and from 30.4 Gflops (79.2%) to 32.7 Gflops (85.3%) on eight cores. These results translate into substantial performance (efficiency) improvements by 7.79% on one core and 7.70% on eight cores. In addition, the efficiency of our implementation on one core is very close to the theoretical upper bound 91.5% obtained from micro-benchmarking. Our parallel implementation achieves good performance and scalability under varying thread counts across a range of matrix sizes evaluated.

本文介绍了一种基于Open BLAS的64位ARMv8八核处理器的高效双精度通用矩阵乘法(DGEMM)的设计与实现。我们采用理论指导的方法，首先为这个体系结构开发一个性能模型，然后用它来指导我们的探索。实现高效DGEMM的关键是采用汇编语言开发的高度优化的内核GEBP。我们通过(1)最大化其在ARMv8架构中内存层次的所有级别上的计算对内存访问比率，其性能关键块大小被分析确定，以及(2)通过利用循环展开，指令调度和软件实现的寄存器旋转来优化其计算，并利用A64指令来支持有效的FMA操作，数据传输和预取。我们比较了在Open BLAS中实现的DGEMM与在ATLAS中实现的DGEMM(也是在汇编中高度优化的GEBP)。我们的实现通过将DGEMM的峰值性能(效率)在一个核心上从3.88 Gflops(80.9%)提高到4.19 Gflops(87.2%)，以及在八个核心上从30.4 Gflops(79.2%)提高到32.7 Gflops(85.3%)，从而优于ALTAS中的实现。这些结果转化为显著的性能(效率)提高，单核提高7.79%，八核提高7.70%。此外，我们在一个核心上实现的效率非常接近从微基准测试中获得的91.5%的理论上限。我们的并行实现在不同矩阵大小的线程数下获得了良好的性能和可伸缩性。

{"title":"Design and Implementation of a Highly Efficient DGEMM for 64-Bit ARMv8 Multi-core Processors","authors":"Feng Wang, Hao Jiang, Ke Zuo, Xing Su, Jingling Xue, Canqun Yang","doi":"10.1109/ICPP.2015.29","DOIUrl":"https://doi.org/10.1109/ICPP.2015.29","url":null,"abstract":"This paper presents the design and implementation of a highly efficient Double-precision General Matrix Multiplication (DGEMM) based on Open BLAS for 64-bit ARMv8 eight-core processors. We adopt a theory-guided approach by first developing a performance model for this architecture and then using it to guide our exploration. The key enabler for a highly efficient DGEMM is a highly-optimized inner kernel GEBP developed in assembly language. We have obtained GEBP by (1) maximizing its compute-to-memory access ratios across all levels of the memory hierarchy in the ARMv8 architecture with its performance-critical block sizes being determined analytically, and (2) optimizing its computations through exploiting loop unrolling, instruction scheduling and software-implemented register rotation and taking advantage of A64 instructions to support efficient FMA operations, data transfers and prefetching. We have compared our DGEMM implemented in Open BLAS with another implemented in ATLAS (also in terms of a highly-optimized GEBP in assembly). Our implementation outperforms the one in ALTAS by improving the peak performance (efficiency) of DGEMM from 3.88 Gflops (80.9%) to 4.19 Gflops (87.2%) on one core and from 30.4 Gflops (79.2%) to 32.7 Gflops (85.3%) on eight cores. These results translate into substantial performance (efficiency) improvements by 7.79% on one core and 7.70% on eight cores. In addition, the efficiency of our implementation on one core is very close to the theoretical upper bound 91.5% obtained from micro-benchmarking. Our parallel implementation achieves good performance and scalability under varying thread counts across a range of matrix sizes evaluated.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114575252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Reducing Synchronization Cost in Distributed Multi-resource Allocation Problem 降低分布式多资源分配问题中的同步成本

2015 44th International Conference on Parallel Processing

Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.63

Jonathan Lejeune, L. Arantes, Julien Sopena, Pierre Sens

Generalized distributed mutual exclusion algorithms allow processes to concurrently access a set of shared resources. However, they must ensure an exclusive access to each resource. In order to avoid deadlocks, many of them are based on the strong assumption of a prior knowledge about conflicts between processes' requests. Some other approaches, which do not require such a knowledge, exploit broadcast mechanisms or a global lock, degrading message complexity and synchronization cost. We propose in this paper a new solution for shared resources allocation which reduces the communication between non-conflicting processes without a prior knowledge of processes conflicts. Performance evaluation results show that our solution improves resource use rate by a factor up to 20 compared to a global lock based algorithm.

广义分布式互斥算法允许进程并发访问一组共享资源。但是，它们必须确保对每个资源的独占访问。为了避免死锁，它们中的许多都基于对进程请求之间冲突的先验知识的强烈假设。其他一些方法不需要这样的知识，它们利用广播机制或全局锁，降低了消息复杂性和同步成本。本文提出了一种新的共享资源分配方案，该方案在不知道进程冲突的前提下减少了非冲突进程之间的通信。性能评估结果表明，与基于全局锁的算法相比，我们的解决方案将资源利用率提高了20倍。

引用次数: 5

SZTS: A Novel Big Data Transportation System Benchmark Suite SZTS:一个新的大数据交通系统基准套件

2015 44th International Conference on Parallel Processing

Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.91

Wen Xiong, Zhibin Yu, L. Eeckhout, Zhengdong Bei, Fan Zhang, Chengzhong Xu

Data analytics is at the core of the supply chain for both products and services in modern economies and societies. Big data workloads however, are placing unprecedented demands on computing technologies, calling for a deep understanding and characterization of these emerging workloads. In this paper, we propose Shen Zhen Transportation System (SZTS), a novel big data Hadoop benchmark suite comprised of real-life transportation analysis applications with real-life input data sets from Shenzhen in China. SZTS uniquely focuses on a specific and real-life application domain whereas other existing Hadoop benchmark suites, such as Hi Bench and Cloud Rank-D, consist of generic algorithms with synthetic inputs. We perform a cross-layer workload characterization at both the job and micro architecture level, revealing unique characteristics of SZTS compared to existing Hadoop benchmarks as well as general-purpose multi-core PARSEC benchmarks. We also study the sensitivity of workload behavior with respect to input data size, and propose a methodology for identifying representative input data sets.

在现代经济和社会中，数据分析是产品和服务供应链的核心。然而，大数据工作负载对计算技术提出了前所未有的要求，要求对这些新兴工作负载进行深入理解和表征。在本文中，我们提出了深圳交通系统(SZTS)，这是一个新颖的大数据Hadoop基准套件，由来自中国深圳的真实交通分析应用程序和真实输入数据集组成。SZTS独特地专注于一个特定的和现实生活中的应用领域，而其他现有的Hadoop基准套件，如Hi Bench和Cloud Rank-D，由合成输入的通用算法组成。我们在作业和微架构级别执行了跨层工作负载表征，揭示了SZTS与现有Hadoop基准测试以及通用多核PARSEC基准测试相比的独特特征。我们还研究了工作负载行为对输入数据大小的敏感性，并提出了一种识别代表性输入数据集的方法。

引用次数: 10

Optimizing Image Sharpening Algorithm on GPU 基于GPU的图像锐化算法优化

2015 44th International Conference on Parallel Processing

Pub Date : 2015-09-01 DOI: 10.1109/ICPP.2015.32

Mengran Fan, Haipeng Jia, Yunquan Zhang, Xiaojing An, Ting Cao

Sharpness is an algorithm used to sharpen images. As the increase of image size, resolution, and the requirements for real-time processing, the performance of sharpness needs to get improved greatly. The independent pixel calculation of sharpness makes a good opportunity to use GPU to largely accelerate the performance. However, to transplant it to GPU, one challenge is that sharpness involves several stages to execute. Each stage has its own characteristics, either with or without data dependency to other stages. Based on those characteristics, this paper proposes a complete solution to implement and optimize sharpness on GPU. Our solution includes five major and effective techniques: Data Transfer Optimization, Kernel Fusion, Vectorization for Data Locality, Border and Reduction Optimization. Experiments show that, compared to a well-optimized CPU version, our GPU solution can reach 10.7~ 69.3 times speedup for different image sizes on an AMD Fire Pro W8000 GPU.

锐度是一种用于锐化图像的算法。随着图像尺寸、分辨率的增加以及对实时处理要求的提高，对图像的清晰度性能要求得到很大的提高。锐度的独立像素计算为使用GPU大幅提高性能提供了很好的机会。然而，要将其移植到GPU，一个挑战是清晰度涉及几个阶段来执行。每个阶段都有自己的特征，或者与其他阶段有数据依赖关系，或者没有数据依赖关系。基于这些特点，本文提出了在GPU上实现和优化图像清晰度的完整方案。我们的解决方案包括五个主要和有效的技术:数据传输优化，核融合，数据局域矢量化，边界和约简优化。实验表明，与优化后的CPU版本相比，我们的GPU解决方案在AMD Fire Pro W8000 GPU上对不同图像大小的加速可以达到10.7~ 69.3倍。

引用次数: 0