首页 > 最新文献

Proceedings of the 48th International Conference on Parallel Processing最新文献

英文 中文
Building Scalable NVM-based B+tree with HTM 使用HTM构建可扩展的基于nvm的B+树
Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337827
Mengxing Liu, Jiankai Xing, Kang Chen, Yongwei Wu
Emerging on-volatile memory (NVM) opens an opportunity to build durable data structures. However, to build a highly efficient complex data structure like B+tree on NVM is not easy. We investigate the essential performance bottleneck for NVM-based B+tree. Even with a single-core CPU, the performance is limited by the atomic-write size which plays an essential role in the trade-off between the persistent overhead and keeping leaf node entries sorted. For the multi-core setting, the overlapping of concurrency and persistency is key to the system scalability. Based on the analysis, we propose RNTree, a durable NVM-based B+tree using the hardware transactional memory (HTM). Our way of using HTM can actually address both problems mentioned above simultaneously. (1) HTM can use cache-line granularity to provide larger atomic-write size. Based on this, we propose a new slot-array approach which traces the order of entries in the leaf nodes while still reducing the number of persistent instructions. (2) With careful design, RNTree moves slow persistent instructions out of critical sections and proposes the dual slot array design, to extract more concurrency. For single thread, RNTree achieves 1.44×/4.2× higher throughput for single-key operations and range queries respectively. For multiple threads, the throughput of RNTree is 2.3× higher than state-of-the-art works.
新兴的非易失性内存(NVM)为构建持久的数据结构提供了机会。然而,在NVM上构建像B+树这样高效的复杂数据结构并不容易。我们研究了基于nvm的B+树的主要性能瓶颈。即使使用单核CPU,性能也受到原子写大小的限制,原子写大小在持久开销和保持叶节点条目排序之间的权衡中起着至关重要的作用。在多核环境下,并发性和持久性的重叠是系统可扩展性的关键。在此基础上,我们提出了RNTree,一种使用硬件事务性内存(HTM)的持久的基于nvm的B+树。我们使用HTM的方式实际上可以同时解决上述两个问题。HTM可以使用cache-line粒度来提供更大的原子写入大小。在此基础上,我们提出了一种新的槽数组方法,该方法可以跟踪叶节点中条目的顺序,同时仍然减少持久指令的数量。(2)通过精心设计,RNTree将缓慢的持久指令移出临界区,并提出双槽数组设计,以提取更多的并发性。对于单线程,RNTree在单键操作和范围查询方面分别实现了1.44倍/4.2倍的高吞吐量。对于多线程,RNTree的吞吐量比最先进的工作高2.3倍。
{"title":"Building Scalable NVM-based B+tree with HTM","authors":"Mengxing Liu, Jiankai Xing, Kang Chen, Yongwei Wu","doi":"10.1145/3337821.3337827","DOIUrl":"https://doi.org/10.1145/3337821.3337827","url":null,"abstract":"Emerging on-volatile memory (NVM) opens an opportunity to build durable data structures. However, to build a highly efficient complex data structure like B+tree on NVM is not easy. We investigate the essential performance bottleneck for NVM-based B+tree. Even with a single-core CPU, the performance is limited by the atomic-write size which plays an essential role in the trade-off between the persistent overhead and keeping leaf node entries sorted. For the multi-core setting, the overlapping of concurrency and persistency is key to the system scalability. Based on the analysis, we propose RNTree, a durable NVM-based B+tree using the hardware transactional memory (HTM). Our way of using HTM can actually address both problems mentioned above simultaneously. (1) HTM can use cache-line granularity to provide larger atomic-write size. Based on this, we propose a new slot-array approach which traces the order of entries in the leaf nodes while still reducing the number of persistent instructions. (2) With careful design, RNTree moves slow persistent instructions out of critical sections and proposes the dual slot array design, to extract more concurrency. For single thread, RNTree achieves 1.44×/4.2× higher throughput for single-key operations and range queries respectively. For multiple threads, the throughput of RNTree is 2.3× higher than state-of-the-art works.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122023692","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Tessellating Star Stencils 镶嵌星星模板
Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337835
Liang Yuan, Shan Huang, Yunquan Zhang, Hang Cao
Stencil computations represent a very common class of nested loops in scientific and engineering applications. The exhaustively studied tiling is one of the most powerful transformation techniques to explore the data locality and parallelism. Existing work often uniformly handles different stencil shapes. This paper first presents a concept called natural block to identify the difference between the star and box stencils. Then we propose anew two-level tessellation scheme for star stencils, where the natural block, as well as its successors can tessellate the spatial space and their extensions along the time dimension are able to form a tessellation of the iteration space. Furthermore, a novel implementation technique called double updating is developed for star stencils specifically to improve the in-core data reuse pattern. Evaluation results are provided that demonstrate the effectiveness of the approach.
模板计算在科学和工程应用程序中代表了一类非常常见的嵌套循环。经过深入研究的平铺是探索数据局部性和并行性的最强大的转换技术之一。现有的工作通常统一处理不同的模板形状。本文首先提出了自然块的概念,以区分星型和盒型的区别。然后,我们提出了一种新的星形模板的两级镶嵌方案,其中自然块及其后继块可以对空间空间进行镶嵌,它们沿着时间维度的扩展可以形成迭代空间的镶嵌。此外,本文还针对星型模板开发了一种新的实现技术——双更新,以改进核心内数据重用模式。评价结果表明了该方法的有效性。
{"title":"Tessellating Star Stencils","authors":"Liang Yuan, Shan Huang, Yunquan Zhang, Hang Cao","doi":"10.1145/3337821.3337835","DOIUrl":"https://doi.org/10.1145/3337821.3337835","url":null,"abstract":"Stencil computations represent a very common class of nested loops in scientific and engineering applications. The exhaustively studied tiling is one of the most powerful transformation techniques to explore the data locality and parallelism. Existing work often uniformly handles different stencil shapes. This paper first presents a concept called natural block to identify the difference between the star and box stencils. Then we propose anew two-level tessellation scheme for star stencils, where the natural block, as well as its successors can tessellate the spatial space and their extensions along the time dimension are able to form a tessellation of the iteration space. Furthermore, a novel implementation technique called double updating is developed for star stencils specifically to improve the in-core data reuse pattern. Evaluation results are provided that demonstrate the effectiveness of the approach.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132258162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
N-Code: An Optimal RAID-6 MDS Array Code for Load Balancing and High I/O Performance N-Code:负载均衡和高I/O性能的最优RAID-6 MDS阵列代码
Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337829
Ping Xie, Zhu Yuan, Jianzhong Huang, X. Qin
Existing RAID-6 codes are developed to optimize either reads or writes for storage systems. To improve both read and write operations, this paper proposes a novel RAID-6 MDS array code called N-Code. N-Code exhibits three aspects of salient features: (i) read performance. N-Code assigns both horizontal parity chains and horizontal parities across disks, without generating a dedicated parity disk. Such a parity layout not only makes all the disks service normal reads, but also allows continuous data elements to share the same horizontal chain to optimize degraded reads; (ii) write performance. Diagonal parities are distributed across disks in a decentralized manner to optimize partial stripe writes, and horizontal parity chains enable N-Code to reduce I/O costs of partial stripe writes by merging I/O operations; and (iii) balancing performance. Decentralized horizontal/diagonal parities potentially support the I/O balancing optimization for single writes. A theoretical analysis indicates that apart from the optimal storage efficiency, N-Code is featured with the optimal complexity for both encoding/decoding computations and update operations. The results of empirical experiments shows that N-Code demonstrates higher normal-read, degraded-read, and partial-stripe-write performance than the seven baseline popular RAID-6 codes. In particular, in the partial-stripe-write case, N-Code accelerates partial stripe writes by 32%-66% relative to horizontal codes; when it comes to degraded reads, N-Code improves degraded reads by 32%-53% compared to vertical codes. Furthermore, compared to the baseline codes, N-Code enhances load balancing by a factor anywhere between 1.19 to 9.09 for single-write workload, and between 1.3 to 6.92 for read-write mixed workload.
现有的RAID-6代码是为优化存储系统的读或写而开发的。为了改善读写操作,本文提出了一种新的RAID-6 MDS阵列代码,称为N-Code。N-Code表现出三个显著特征:(i)读取性能。N-Code在磁盘上分配水平奇偶链和水平奇偶,而不生成专用的奇偶盘。这种奇偶校验布局不仅使所有磁盘服务于正常读,而且允许连续的数据元素共享同一水平链,以优化降级读;(ii)写性能。对角奇偶以分散的方式分布在磁盘上,以优化部分分条写入,水平奇偶链使N-Code通过合并I/O操作来降低部分分条写入的I/O成本;(三)平衡性能。分散的水平/对角线对可能支持单次写的I/O平衡优化。理论分析表明,N-Code除了具有最优的存储效率外,还具有最优的编解码计算和更新操作复杂度。经验实验的结果表明,N-Code比七个流行的基准RAID-6代码具有更高的正常读取、降级读取和部分条带写入性能。特别是,在部分条带写入情况下,相对于水平代码,N-Code将部分条带写入加速32%-66%;当涉及到降级读取时,与垂直编码相比,N-Code将降级读取提高了32%-53%。此外,与基线代码相比,对于单写工作负载,N-Code增强负载平衡的系数在1.19到9.09之间,对于读写混合工作负载,则在1.3到6.92之间。
{"title":"N-Code: An Optimal RAID-6 MDS Array Code for Load Balancing and High I/O Performance","authors":"Ping Xie, Zhu Yuan, Jianzhong Huang, X. Qin","doi":"10.1145/3337821.3337829","DOIUrl":"https://doi.org/10.1145/3337821.3337829","url":null,"abstract":"Existing RAID-6 codes are developed to optimize either reads or writes for storage systems. To improve both read and write operations, this paper proposes a novel RAID-6 MDS array code called N-Code. N-Code exhibits three aspects of salient features: (i) read performance. N-Code assigns both horizontal parity chains and horizontal parities across disks, without generating a dedicated parity disk. Such a parity layout not only makes all the disks service normal reads, but also allows continuous data elements to share the same horizontal chain to optimize degraded reads; (ii) write performance. Diagonal parities are distributed across disks in a decentralized manner to optimize partial stripe writes, and horizontal parity chains enable N-Code to reduce I/O costs of partial stripe writes by merging I/O operations; and (iii) balancing performance. Decentralized horizontal/diagonal parities potentially support the I/O balancing optimization for single writes. A theoretical analysis indicates that apart from the optimal storage efficiency, N-Code is featured with the optimal complexity for both encoding/decoding computations and update operations. The results of empirical experiments shows that N-Code demonstrates higher normal-read, degraded-read, and partial-stripe-write performance than the seven baseline popular RAID-6 codes. In particular, in the partial-stripe-write case, N-Code accelerates partial stripe writes by 32%-66% relative to horizontal codes; when it comes to degraded reads, N-Code improves degraded reads by 32%-53% compared to vertical codes. Furthermore, compared to the baseline codes, N-Code enhances load balancing by a factor anywhere between 1.19 to 9.09 for single-write workload, and between 1.3 to 6.92 for read-write mixed workload.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"211 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132233985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Network Congestion Avoidance through Packet-chaining Reservation 通过分组链保留避免网络拥塞
Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337874
Ketong Wu, Dezun Dong, Cunlu Li, Shan Huang, Yi Dai
Endpoint congestion is a bottleneck in high-performance computing (HPC) networks and severely impacts system performance, especially for latency-sensitive applications. For long messages (or flows) whose duration is far larger than the round-trip time (RTT), endpoint congestion can be effectively mitigated by proactive or reactive counter-measures such that the injection rate of each source is dynamically controlled to a proper level. However, many HPC applications produce a hybrid traffic, a mix of short and long messages, and are dominated by short messages. Existing proactive congestion avoidance methods face the great challenge of scheduling the rapidly changing traffic pattern caused by these short messages. In this paper, we leverage the advantages of proactive and reactive congestion avoidance techniques and propose the Packet-chaining Reservation Protocol (PCRP) to make a dynamic balance between flows following proactive scheduling and packets subjected to reactive network conditions. We select the chaining packets as a flexible reservation granularity between the whole flow and one packet. We allow small flows to be speculatively transmitted without being discarded and give them higher priority over the entire network. Our PCRP can respond quickly to network conditions and effectively avoid the formation of endpoint congestion and reduce the average flow delay. We conduct extensive experiments to evaluate our PCRP and compare it with the state-of-the-art proactive reservation-based protocols, Speculative Reservation Protocol (SRP) and Bilateral Flow Reservation Protocol (BFRP). The simulation results demonstrate that in our design the flow latency can be reduced by 50.2% for hotspot traffic and 28.38% for uniform traffic.
端点拥塞是高性能计算(HPC)网络的瓶颈,严重影响系统性能,特别是对延迟敏感的应用程序。对于持续时间远远大于往返时间(RTT)的长消息(或流),可以通过主动或被动的应对措施有效地缓解端点拥塞,从而将每个源的注入速率动态控制到适当的水平。然而,许多HPC应用程序产生混合流量,混合了短消息和长消息,并且以短消息为主。现有的主动拥塞避免方法面临着调度这些短消息引起的快速变化的交通模式的巨大挑战。在本文中,我们利用主动和被动的拥塞避免技术的优点,提出了包链保留协议(PCRP),在遵循主动调度的流和受被动网络条件的数据包之间进行动态平衡。我们选择链接数据包作为整个流和单个数据包之间的灵活保留粒度。我们允许小流量在不被丢弃的情况下进行推测性传输,并在整个网络中赋予它们更高的优先级。我们的PCRP能够快速响应网络状况,有效避免端点拥塞的形成,降低平均流延迟。我们进行了大量的实验来评估我们的PCRP,并将其与最先进的基于主动保留的协议,投机保留协议(SRP)和双边流量保留协议(BFRP)进行比较。仿真结果表明,在我们的设计中,热点流量和均匀流量的流延迟分别降低了50.2%和28.38%。
{"title":"Network Congestion Avoidance through Packet-chaining Reservation","authors":"Ketong Wu, Dezun Dong, Cunlu Li, Shan Huang, Yi Dai","doi":"10.1145/3337821.3337874","DOIUrl":"https://doi.org/10.1145/3337821.3337874","url":null,"abstract":"Endpoint congestion is a bottleneck in high-performance computing (HPC) networks and severely impacts system performance, especially for latency-sensitive applications. For long messages (or flows) whose duration is far larger than the round-trip time (RTT), endpoint congestion can be effectively mitigated by proactive or reactive counter-measures such that the injection rate of each source is dynamically controlled to a proper level. However, many HPC applications produce a hybrid traffic, a mix of short and long messages, and are dominated by short messages. Existing proactive congestion avoidance methods face the great challenge of scheduling the rapidly changing traffic pattern caused by these short messages. In this paper, we leverage the advantages of proactive and reactive congestion avoidance techniques and propose the Packet-chaining Reservation Protocol (PCRP) to make a dynamic balance between flows following proactive scheduling and packets subjected to reactive network conditions. We select the chaining packets as a flexible reservation granularity between the whole flow and one packet. We allow small flows to be speculatively transmitted without being discarded and give them higher priority over the entire network. Our PCRP can respond quickly to network conditions and effectively avoid the formation of endpoint congestion and reduce the average flow delay. We conduct extensive experiments to evaluate our PCRP and compare it with the state-of-the-art proactive reservation-based protocols, Speculative Reservation Protocol (SRP) and Bilateral Flow Reservation Protocol (BFRP). The simulation results demonstrate that in our design the flow latency can be reduced by 50.2% for hotspot traffic and 28.38% for uniform traffic.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"71 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130851582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Nested Virtualization Without the Nest 无巢的嵌套虚拟化
Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337840
Mathieu Bacou, Grégoire Todeschi, A. Tchana, D. Hagimont
With the increasing popularity of containers, managing them on top of virtual machines becomes a common practice, called nested virtualization. This paper presents BrFusion and Hostlo, two solutions that address each of two networking issues of nested virtualization: network virtualization duplication and virtual machine-bounded pod deployments. The first issue lengthens network packet paths while the second issue leads to resource fragmentation. For instance, in respect with the first issue, we measured a throughput degradation of about 68% and a latency increase of about 31% in comparison with a single networking layer. We prototype BrFusion and Hostlo in Linux KVM/QEMU, Docker and Kubernetes systems. The evaluation results show that BrFusion leads to the same performance as a single-layer virtualization deployment. Concerning Hostlo, the results show that more than 11% of cloud clients see their cloud utilization cost reduced by down to 40%.
随着容器的日益普及,在虚拟机之上管理容器成为一种常见的做法,称为嵌套虚拟化。本文介绍了BrFusion和Hostlo,这两种解决方案分别解决了嵌套虚拟化的两个网络问题:网络虚拟化复制和虚拟机绑定pod部署。第一个问题延长了网络数据包路径,而第二个问题导致资源碎片化。例如,对于第一个问题,与单个网络层相比,我们测量到吞吐量下降了约68%,延迟增加了约31%。我们在Linux KVM/QEMU, Docker和Kubernetes系统中原型化了BrFusion和Hostlo。评估结果表明,BrFusion的性能与单层虚拟化部署相同。关于Hostlo,结果显示,超过11%的云客户认为他们的云利用成本降低了40%。
{"title":"Nested Virtualization Without the Nest","authors":"Mathieu Bacou, Grégoire Todeschi, A. Tchana, D. Hagimont","doi":"10.1145/3337821.3337840","DOIUrl":"https://doi.org/10.1145/3337821.3337840","url":null,"abstract":"With the increasing popularity of containers, managing them on top of virtual machines becomes a common practice, called nested virtualization. This paper presents BrFusion and Hostlo, two solutions that address each of two networking issues of nested virtualization: network virtualization duplication and virtual machine-bounded pod deployments. The first issue lengthens network packet paths while the second issue leads to resource fragmentation. For instance, in respect with the first issue, we measured a throughput degradation of about 68% and a latency increase of about 31% in comparison with a single networking layer. We prototype BrFusion and Hostlo in Linux KVM/QEMU, Docker and Kubernetes systems. The evaluation results show that BrFusion leads to the same performance as a single-layer virtualization deployment. Concerning Hostlo, the results show that more than 11% of cloud clients see their cloud utilization cost reduced by down to 40%.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116688964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Near-Data Processing-Enabled and Time-Aware Compaction Optimization for LSM-tree-based Key-Value Stores 基于lsm树的键值存储的近数据处理和时间感知压缩优化
Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337855
Hui Sun, Wei Liu, Jianzhong Huang, Song Fu, Zhi Qiao, Weisong Shi
With the growing volume of storage systems, the traditional relational databases cannot reach the high performance required by big-data applications. As high-throughput alternatives to relational databases, LSM-tree-based key-value stores (KV stores in short) are confronted with degraded write performance during compaction under update-intensive workloads. To address this issue, we design and implement a time-aware compaction optimization framework for KV stores called TStore. TStore explores the near-data processing (i.e., NDP) model. It dynamically partitions compaction tasks into both host and NDP-enabled device to minimize the total time of compaction. The partitioned compaction tasks are conducted by the host and the device in parallel. The NDP-based devices exhibit low-latency, high-performance and high-bandwidth capability, thus facilitating key-value stores. TStore can not only accomplish compaction for KV stores, but also improve overall performance by removing bottleneck in compaction. Results show that the TStore with an NDP framework can achieve 3.8x and 1.9x performance improvement over LevelDB and Co-KV under the db_bench workload. In addition, the TStore-enabled KV store outperforms LevelDB and Co-KV by a factor of 3.6x and 1.9x in throughput and 72.0% and 48.9% in latency, respectively, under realistic workloads generated by YCSB.
随着存储系统容量的不断增长,传统的关系型数据库已无法满足大数据应用对高性能的要求。作为关系数据库的高吞吐量替代品,基于lsm树的键值存储(简称KV存储)在更新密集型工作负载的压缩过程中面临写性能下降的问题。为了解决这个问题,我们设计并实现了一个名为TStore的KV存储的时间感知压缩优化框架。TStore探索近数据处理(即NDP)模型。它动态地将压缩任务划分为主机和启用ndp的设备,以最小化压缩的总时间。分区压缩任务由主机和设备并行执行。基于ndp的设备具有低延迟、高性能和高带宽的能力,便于键值存储。TStore不仅可以完成KV存储的压实,还可以通过消除压实瓶颈来提高整体性能。结果表明,在db_bench工作负载下,采用NDP框架的TStore比LevelDB和Co-KV的性能分别提高3.8倍和1.9倍。此外,在YCSB生成的实际工作负载下,启用tstore的KV存储比LevelDB和Co-KV存储的吞吐量分别高出3.6倍和1.9倍,延迟分别高出72.0%和48.9%。
{"title":"Near-Data Processing-Enabled and Time-Aware Compaction Optimization for LSM-tree-based Key-Value Stores","authors":"Hui Sun, Wei Liu, Jianzhong Huang, Song Fu, Zhi Qiao, Weisong Shi","doi":"10.1145/3337821.3337855","DOIUrl":"https://doi.org/10.1145/3337821.3337855","url":null,"abstract":"With the growing volume of storage systems, the traditional relational databases cannot reach the high performance required by big-data applications. As high-throughput alternatives to relational databases, LSM-tree-based key-value stores (KV stores in short) are confronted with degraded write performance during compaction under update-intensive workloads. To address this issue, we design and implement a time-aware compaction optimization framework for KV stores called TStore. TStore explores the near-data processing (i.e., NDP) model. It dynamically partitions compaction tasks into both host and NDP-enabled device to minimize the total time of compaction. The partitioned compaction tasks are conducted by the host and the device in parallel. The NDP-based devices exhibit low-latency, high-performance and high-bandwidth capability, thus facilitating key-value stores. TStore can not only accomplish compaction for KV stores, but also improve overall performance by removing bottleneck in compaction. Results show that the TStore with an NDP framework can achieve 3.8x and 1.9x performance improvement over LevelDB and Co-KV under the db_bench workload. In addition, the TStore-enabled KV store outperforms LevelDB and Co-KV by a factor of 3.6x and 1.9x in throughput and 72.0% and 48.9% in latency, respectively, under realistic workloads generated by YCSB.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127148456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Lightweight Fault Tolerance in Pregel-Like Systems 类预凝胶系统中的轻量级容错
Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337823
Da Yan, James Cheng, Hongzhi Chen, Cheng Long, P. Bangalore
Pregel-like systems are popular for iterative graph processing thanks to their user-friendly vertex-centric programming model. However, existing Pregel-like systems only adopt a naïve checkpointing approach for fault tolerance, which saves a large amount of data about the state of computation and significantly degrades the failure-free execution performance. Advanced fault tolerance/recovery techniques are left unexplored in the context of Pregel-like systems. This paper proposes a non-invasive lightweight checkpointing (LWCP) scheme which minimizes the data saved to each checkpoint, and additional data required for recovery are generated online from the saved data. This improvement results in 10x speedup in checkpointing, and an integration of it with a recently proposed log-based recovery approach can further speed up recovery when failure occurs. Extensive experiments verified that our proposed LWCP techniques are able to significantly improve the performance of both checkpointing and recovery in a Pregel-like system.
由于其用户友好的以顶点为中心的编程模型,类预凝胶系统在迭代图处理中很受欢迎。然而,现有的类pregel系统仅采用naïve检查点方法进行容错,这节省了大量关于计算状态的数据,大大降低了无故障执行性能。在类似pregel的系统中,高级容错/恢复技术尚未被探索。本文提出了一种非侵入性轻量级检查点(LWCP)方案,该方案将保存到每个检查点的数据最小化,并从保存的数据在线生成恢复所需的额外数据。这一改进使检查点的速度提高了10倍,并且将其与最近提出的基于日志的恢复方法集成可以在发生故障时进一步加快恢复速度。大量的实验证明,我们提出的LWCP技术能够显著提高类似pregel系统的检查点和恢复性能。
{"title":"Lightweight Fault Tolerance in Pregel-Like Systems","authors":"Da Yan, James Cheng, Hongzhi Chen, Cheng Long, P. Bangalore","doi":"10.1145/3337821.3337823","DOIUrl":"https://doi.org/10.1145/3337821.3337823","url":null,"abstract":"Pregel-like systems are popular for iterative graph processing thanks to their user-friendly vertex-centric programming model. However, existing Pregel-like systems only adopt a naïve checkpointing approach for fault tolerance, which saves a large amount of data about the state of computation and significantly degrades the failure-free execution performance. Advanced fault tolerance/recovery techniques are left unexplored in the context of Pregel-like systems. This paper proposes a non-invasive lightweight checkpointing (LWCP) scheme which minimizes the data saved to each checkpoint, and additional data required for recovery are generated online from the saved data. This improvement results in 10x speedup in checkpointing, and an integration of it with a recently proposed log-based recovery approach can further speed up recovery when failure occurs. Extensive experiments verified that our proposed LWCP techniques are able to significantly improve the performance of both checkpointing and recovery in a Pregel-like system.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"124 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121782264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
How to Make the Preconditioned Conjugate Gradient Method Resilient Against Multiple Node Failures 如何使预条件共轭梯度法适应多节点故障
Pub Date : 2019-07-30 DOI: 10.1145/3337821.3337849
C. Pachajoa, M. Levonyak, W. Gansterer, J. Träff
We study algorithmic approaches for recovering from the failure of several compute nodes in the parallel preconditioned conjugate gradient (PCG) solver on large-scale parallel computers. In particular, we analyze and extend an exact state reconstruction (ESR) approach, which is based on a method proposed by Chen (2011). In the ESR approach, the solver keeps redundant information from previous search directions, so that the solver state can be fully reconstructed if a node fails unexpectedly. ESR does not require checkpointing or external storage for saving dynamic solver data and has low overhead compared to the failure-free situation. In this paper, we improve the fault tolerance of the PCG algorithm based on the ESR approach. In particular, we support recovery from simultaneous or overlapping failures of several nodes for general sparsity patterns of the system matrix, which cannot be handled by Chen's method. For this purpose, we refine the strategy for how to store redundant information across nodes. We analyze and implement our new method and perform numerical experiments with large sparse matrices from real-world applications on 128 nodes of the Vienna Scientific Cluster (VSC). For recovering from three simultaneous node failures we observe average runtime overheads between only 2.8% and 55.0%. The overhead of the improved resilience depends on the sparsity pattern of the system matrix.
研究了大规模并行计算机上并行预条件共轭梯度(PCG)求解器中多个计算节点故障恢复的算法方法。特别地,我们分析并扩展了一种基于Chen(2011)提出的方法的精确状态重建(ESR)方法。在ESR方法中,求解器保留了先前搜索方向的冗余信息,以便在节点意外故障时完全重构求解器状态。ESR不需要检查点或外部存储来保存动态求解器数据,与无故障情况相比,其开销较低。本文在ESR方法的基础上提高了PCG算法的容错性。特别是,对于系统矩阵的一般稀疏模式,我们支持从多个节点同时或重叠故障中恢复,这是Chen的方法无法处理的。为此,我们细化了如何跨节点存储冗余信息的策略。我们分析和实现了我们的新方法,并在维也纳科学集群(VSC)的128个节点上对来自实际应用的大型稀疏矩阵进行了数值实验。对于从三个同时发生的节点故障中恢复,我们观察到平均运行时开销仅在2.8%到55.0%之间。改进弹性的开销取决于系统矩阵的稀疏性模式。
{"title":"How to Make the Preconditioned Conjugate Gradient Method Resilient Against Multiple Node Failures","authors":"C. Pachajoa, M. Levonyak, W. Gansterer, J. Träff","doi":"10.1145/3337821.3337849","DOIUrl":"https://doi.org/10.1145/3337821.3337849","url":null,"abstract":"We study algorithmic approaches for recovering from the failure of several compute nodes in the parallel preconditioned conjugate gradient (PCG) solver on large-scale parallel computers. In particular, we analyze and extend an exact state reconstruction (ESR) approach, which is based on a method proposed by Chen (2011). In the ESR approach, the solver keeps redundant information from previous search directions, so that the solver state can be fully reconstructed if a node fails unexpectedly. ESR does not require checkpointing or external storage for saving dynamic solver data and has low overhead compared to the failure-free situation. In this paper, we improve the fault tolerance of the PCG algorithm based on the ESR approach. In particular, we support recovery from simultaneous or overlapping failures of several nodes for general sparsity patterns of the system matrix, which cannot be handled by Chen's method. For this purpose, we refine the strategy for how to store redundant information across nodes. We analyze and implement our new method and perform numerical experiments with large sparse matrices from real-world applications on 128 nodes of the Vienna Scientific Cluster (VSC). For recovering from three simultaneous node failures we observe average runtime overheads between only 2.8% and 55.0%. The overhead of the improved resilience depends on the sparsity pattern of the system matrix.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129966451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
A 2D Parallel Triangle Counting Algorithm for Distributed-Memory Architectures 分布式存储体系结构的二维并行三角形计数算法
Pub Date : 2019-07-22 DOI: 10.1145/3337821.3337853
A. Tom, G. Karypis
Triangle counting is a fundamental graph analytic operation that is used extensively in network science and graph mining. As the size of the graphs that needs to be analyzed continues to grow, there is a requirement in developing scalable algorithms for distributed-memory parallel systems. To this end, we present a distributed-memory triangle counting algorithm, which uses a 2D cyclic decomposition to balance the computations and reduce the communication overheads. The algorithm structures its communication and computational steps such that it reduces its memory overhead and includes key optimizations that leverage the sparsity of the graph and the way the computations are structured. Experiments on synthetic and real-world graphs show that our algorithm obtains an average relative speedup range between 3.24 to 7.22 out of 10.56 across the datasets using 169 MPI ranks over the performance achieved by 16 MPI ranks. Moreover, we obtain an average speedup of 10.2 times on comparison with previously developed distributed-memory parallel algorithms.
三角形计数是一种基本的图分析运算,在网络科学和图挖掘中有着广泛的应用。随着需要分析的图的大小不断增长,需要为分布式内存并行系统开发可伸缩算法。为此,我们提出了一种分布式内存三角形计数算法,该算法使用二维循环分解来平衡计算并减少通信开销。该算法对其通信和计算步骤进行了结构化,从而减少了内存开销,并包括利用图的稀疏性和计算结构方式的关键优化。在合成图和真实世界图上的实验表明,我们的算法在使用169个MPI排名的数据集上获得的平均相对加速范围在3.24到7.22之间,其中10.56个数据集使用16个MPI排名获得的性能。此外,与以前开发的分布式内存并行算法相比,我们获得了平均10.2倍的加速。
{"title":"A 2D Parallel Triangle Counting Algorithm for Distributed-Memory Architectures","authors":"A. Tom, G. Karypis","doi":"10.1145/3337821.3337853","DOIUrl":"https://doi.org/10.1145/3337821.3337853","url":null,"abstract":"Triangle counting is a fundamental graph analytic operation that is used extensively in network science and graph mining. As the size of the graphs that needs to be analyzed continues to grow, there is a requirement in developing scalable algorithms for distributed-memory parallel systems. To this end, we present a distributed-memory triangle counting algorithm, which uses a 2D cyclic decomposition to balance the computations and reduce the communication overheads. The algorithm structures its communication and computational steps such that it reduces its memory overhead and includes key optimizations that leverage the sparsity of the graph and the way the computations are structured. Experiments on synthetic and real-world graphs show that our algorithm obtains an average relative speedup range between 3.24 to 7.22 out of 10.56 across the datasets using 169 MPI ranks over the performance achieved by 16 MPI ranks. Moreover, we obtain an average speedup of 10.2 times on comparison with previously developed distributed-memory parallel algorithms.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122544308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Automatic Differentiation for Adjoint Stencil Loops 伴随模板环的自动判别
Pub Date : 2019-07-05 DOI: 10.1145/3337821.3337906
J. Hückelheim, Navjot Kukreja, S. Narayanan, F. Luporini, G. Gorman, P. Hovland
Stencil loops are a common motif in computations including convolutional neural networks, structured-mesh solvers for partial differential equations, and image processing. Stencil loops are easy to parallelise, and their fast execution is aided by compilers, libraries, and domain-specific languages. Reverse-mode automatic differentiation, also known as algorithmic differentiation, autodiff, adjoint differentiation, or back-propagation, is sometimes used to obtain gradients of programs that contain stencil loops. Unfortunately, conventional automatic differentiation results in a memory access pattern that is not stencil-like and not easily parallelisable. In this paper we present a novel combination of automatic differentiation and loop transformations that preserves the structure and memory access pattern of stencil loops, while computing fully consistent derivatives. The generated loops can be parallelised and optimised for performance in the same way and using the same tools as the original computation. We have implemented this new technique in the Python tool PerforAD, which we release with this paper along with test cases derived from seismic imaging and computational fluid dynamics applications.
模板循环是卷积神经网络、偏微分方程的结构网格求解器和图像处理等计算中的常见主题。模板循环很容易并行化,它们的快速执行得到编译器、库和特定于领域的语言的帮助。逆模式自动微分,也称为算法微分、自动微分、伴随微分或反向传播,有时用于获得包含模板循环的程序的梯度。不幸的是,传统的自动区分导致内存访问模式不像模板,也不容易并行。在本文中,我们提出了一种自动微分和循环变换的新组合,它保留了模板循环的结构和内存访问模式,同时计算完全一致的导数。生成的循环可以以相同的方式和使用与原始计算相同的工具来并行化和优化性能。我们已经在Python工具PerforAD中实现了这项新技术,该工具与地震成像和计算流体动力学应用程序的测试用例一起发布。
{"title":"Automatic Differentiation for Adjoint Stencil Loops","authors":"J. Hückelheim, Navjot Kukreja, S. Narayanan, F. Luporini, G. Gorman, P. Hovland","doi":"10.1145/3337821.3337906","DOIUrl":"https://doi.org/10.1145/3337821.3337906","url":null,"abstract":"Stencil loops are a common motif in computations including convolutional neural networks, structured-mesh solvers for partial differential equations, and image processing. Stencil loops are easy to parallelise, and their fast execution is aided by compilers, libraries, and domain-specific languages. Reverse-mode automatic differentiation, also known as algorithmic differentiation, autodiff, adjoint differentiation, or back-propagation, is sometimes used to obtain gradients of programs that contain stencil loops. Unfortunately, conventional automatic differentiation results in a memory access pattern that is not stencil-like and not easily parallelisable. In this paper we present a novel combination of automatic differentiation and loop transformations that preserves the structure and memory access pattern of stencil loops, while computing fully consistent derivatives. The generated loops can be parallelised and optimised for performance in the same way and using the same tools as the original computation. We have implemented this new technique in the Python tool PerforAD, which we release with this paper along with test cases derived from seismic imaging and computational fluid dynamics applications.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134470738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
期刊
Proceedings of the 48th International Conference on Parallel Processing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1