Proceedings of the 48th International Conference on Parallel Processing最新文献

英文中文

The Communication-Overlapped Hybrid Decomposition Parallel Algorithm for Multi-Scale Fluid Simulations 多尺度流体模拟的通信-重叠混合分解并行算法

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337882

Yi Liu, Xiao-Wei Guo, Chao Li, Canqun Yang, X. Gan, P. Zhang, Yi Wang, Ran Zhao, Sijiang Fan

The MCDPar (Parallel algorithm for multi-scale simulations based on Mesh and BCF Decomposition) algorithm significantly reduced the execution time and improved the parallel scalability for the multi-scale fluid simulations. However, the performance bottleneck still exists for extremely large-scale parallel simulations. In this paper, we designed a communication-overlapped hybrid decomposition parallel algorithm to improve the performance of the original MCDPar on large-scale clusters. Through non-blocking communication and code scheduling, the communication overhead between the master and slave groups have been overlapped with the computation of more microscopic configuration fields for the master process. Thus the parallel efficiency and scalability of the multi-scale solver could be improved on large-scale parallel simulations. In the test case with the number of configuration fields NBCF = 1000 and mesh cells Ncell = 64000, the communication percentage between the corresponding master and slave processes is reduced by 39.71%. In the test case with NBCF = 3000 and Ncell = 64000, the time cost of the fastest execution is reduced by 31.13% using the communication-overlapped algorithm, which offers a better parallel scaling on 256 cores compared to original 128 cores.

MCDPar (Parallel algorithm for multi-scale simulation based Mesh and BCF Decomposition)算法显著缩短了多尺度流体模拟的执行时间，提高了多尺度流体模拟的并行扩展性。然而，对于超大规模的并行仿真，性能瓶颈仍然存在。本文设计了一种通信重叠混合分解并行算法，以提高原有MCDPar在大规模集群上的性能。通过非阻塞通信和代码调度，使主从组之间的通信开销与主进程更微观的配置域的计算重叠。从而提高了多尺度求解器在大规模并行仿真中的并行效率和可扩展性。在配置字段数NBCF = 1000，网格单元数Ncell = 64000的测试用例中，对应主从进程之间的通信百分比降低了39.71%。在NBCF = 3000和Ncell = 64000的测试用例中，使用通信重叠算法，最快执行的时间成本降低了31.13%，与原来的128核相比，256核提供了更好的并行扩展。

{"title":"The Communication-Overlapped Hybrid Decomposition Parallel Algorithm for Multi-Scale Fluid Simulations","authors":"Yi Liu, Xiao-Wei Guo, Chao Li, Canqun Yang, X. Gan, P. Zhang, Yi Wang, Ran Zhao, Sijiang Fan","doi":"10.1145/3337821.3337882","DOIUrl":"https://doi.org/10.1145/3337821.3337882","url":null,"abstract":"The MCDPar (Parallel algorithm for multi-scale simulations based on Mesh and BCF Decomposition) algorithm significantly reduced the execution time and improved the parallel scalability for the multi-scale fluid simulations. However, the performance bottleneck still exists for extremely large-scale parallel simulations. In this paper, we designed a communication-overlapped hybrid decomposition parallel algorithm to improve the performance of the original MCDPar on large-scale clusters. Through non-blocking communication and code scheduling, the communication overhead between the master and slave groups have been overlapped with the computation of more microscopic configuration fields for the master process. Thus the parallel efficiency and scalability of the multi-scale solver could be improved on large-scale parallel simulations. In the test case with the number of configuration fields NBCF = 1000 and mesh cells Ncell = 64000, the communication percentage between the corresponding master and slave processes is reduced by 39.71%. In the test case with NBCF = 3000 and Ncell = 64000, the time cost of the fastest execution is reduced by 31.13% using the communication-overlapped algorithm, which offers a better parallel scaling on 256 cores compared to original 128 cores.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115159532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

When Power Oversubscription Meets Traffic Flood Attack: Re-Thinking Data Center Peak Load Management 当电力超额认购遭遇流量洪水攻击:对数据中心峰值负荷管理的重新思考

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337856

Xiaofeng Hou, Mingyu Liang, Chao Li, Wenli Zheng, Quan Chen, M. Guo

The state-of-the-art techniques on data center peak power management are too optimistic; they overestimate their benefits in a potentially insecure operating environment. Especially in data centers that oversubscribe power infrastructure, it is likely that unexpected traffics can violate power budget before an effective network DoS attack is observed. In this work, we take the first to investigate the joint effect of power throttling and traffic flooding. We characterize a special operating region in which DoS attacks can provoke undesirable power peaks without exhibiting network traffic anomalies. In this region, an attacker can trigger power emergency by sending normal traffics throughout the Internet. We term this new type of threat as DOPE (Denial of Power and Energy). We show that existing technologies are insufficient for eliminating DOPE without negative performance effects on legitimate users. To enhance data center resiliency, we propose a request-aware power management framework called Anti-DOPE. The key feature of Anti-DOPE is bridging the gap between network traffic controlling and server power management. Specifically, it pre-processes of incoming requests to isolate malicious power attacks on the network load balancer side and then post-processes of compute node performance to minimize the collateral damage it may cause. Anti-DOPE is orthogonal to prior power management schemes and requires minute system modification. Using Alibaba container trace we show that Anti-DOPE allows 44% shorter average response time. It also improves the 90th percentile tail latency by 68.1% compared to the other power controlling methods.

数据中心峰值电源管理的最新技术过于乐观;他们高估了自己在潜在不安全的操作环境中的利益。特别是在过度订阅电力基础设施的数据中心中，在观察到有效的网络DoS攻击之前，意外流量可能会违反电力预算。在这项工作中，我们首先研究了功率节流和交通泛滥的联合效应。我们描述了一个特殊的操作区域，其中DoS攻击可以在不显示网络流量异常的情况下引发不希望的功率峰值。在该区域，攻击者可以通过在Internet上发送正常流量来触发电源紧急情况。我们把这种新型的威胁称为DOPE(拒绝权力和能量)。我们表明，现有技术不足以在不对合法用户产生负面性能影响的情况下消除DOPE。为了增强数据中心的弹性，我们提出了一个名为Anti-DOPE的请求感知电源管理框架。Anti-DOPE的主要特点是在网络流量控制和服务器电源管理之间架起了桥梁。具体来说，它对传入请求进行预处理，以隔离网络负载均衡器侧的恶意电源攻击，然后对计算节点的性能进行后处理，以尽量减少可能造成的附带损害。Anti-DOPE与先前的电源管理方案正交，需要对系统进行微小的修改。使用阿里巴巴的集装箱跟踪，我们发现Anti-DOPE可以缩短44%的平均响应时间。与其他功率控制方法相比，它还将第90百分位尾部延迟提高了68.1%。

{"title":"When Power Oversubscription Meets Traffic Flood Attack: Re-Thinking Data Center Peak Load Management","authors":"Xiaofeng Hou, Mingyu Liang, Chao Li, Wenli Zheng, Quan Chen, M. Guo","doi":"10.1145/3337821.3337856","DOIUrl":"https://doi.org/10.1145/3337821.3337856","url":null,"abstract":"The state-of-the-art techniques on data center peak power management are too optimistic; they overestimate their benefits in a potentially insecure operating environment. Especially in data centers that oversubscribe power infrastructure, it is likely that unexpected traffics can violate power budget before an effective network DoS attack is observed. In this work, we take the first to investigate the joint effect of power throttling and traffic flooding. We characterize a special operating region in which DoS attacks can provoke undesirable power peaks without exhibiting network traffic anomalies. In this region, an attacker can trigger power emergency by sending normal traffics throughout the Internet. We term this new type of threat as DOPE (Denial of Power and Energy). We show that existing technologies are insufficient for eliminating DOPE without negative performance effects on legitimate users. To enhance data center resiliency, we propose a request-aware power management framework called Anti-DOPE. The key feature of Anti-DOPE is bridging the gap between network traffic controlling and server power management. Specifically, it pre-processes of incoming requests to isolate malicious power attacks on the network load balancer side and then post-processes of compute node performance to minimize the collateral damage it may cause. Anti-DOPE is orthogonal to prior power management schemes and requires minute system modification. Using Alibaba container trace we show that Anti-DOPE allows 44% shorter average response time. It also improves the 90th percentile tail latency by 68.1% compared to the other power controlling methods.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128073840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

DICER DICER

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337891

K. Nikas, Nikela Papadopoulou, Dimitra Giantsidi, Vasileios Karakostas, G. Goumas, N. Koziris

Workload consolidation has been shown to achieve improved resource utilisation in modern datacentres. In this paper we focus on the extended problem of allocating resources when co-locating High-Priority (HP) and Best-Effort (BE) applications. Current approaches either neglect this prioritisation and focus on maximising the utilisation of the server or favour HP execution resulting to severe performance degradation for BEs. We propose DICER, a novel, practical, dynamic cache partitioning scheme that adapts the LLC allocation to the needs of the HP and assigns spare cache resources to the BEs. Our evaluation reveals that DICER successfully increases the system's utilisation, while at the same time minimising the impact of co-location on HP's performance.

引用次数: 14

RFPL RFPL

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337887

Gaoxiang Xu, Dan Feng, Zhipeng Tan, Xinyan Zhang, Jie Xu, Xing Shu, Yifeng Zhu

Parity based RAID suffers from poor small write performance due to heavy parity update overhead. The recently proposed method EPLOG constructs a new stripe with updated data chunks without updating old parity chunks. However, due to skewness of data accesses, old versions of updated data chunks often need to be kept to protect other data chunks of the same stripe. This seriously hurts the efficiency of recovering system from device failures due to the need of reconstructing the preserved old data chunks on failed devices. In this paper, we propose a Recovery Friendly Parity Logging scheme, called RFPL, which minimizes small write penalty and provides high recovery performance for SSD RAID. The key idea of RFPL is to reduce the mixture of old and new data chunks in a stripe by exploiting skewness of data accesses. RFPL constructs a new stripe with updated data chunks of the same old stripe. Since cold data chunks of the old stripe are rarely updated, it is likely that all of data chunks written to the new stripe are hot data and become old together within a short time span. This co-old of data chunks in a stripe effectively mitigates the total number of old data chunks which need to be preserved. We have implemented RFPL on a RAID-5 SSD array in Linux 4.3. Experimental results show that, compared with the Linux software RAID, RFPL reduces user I/O response time by 83.1% for normal state and 81.6% for reconstruction state. Compared with the state-of-the-art scheme EPLOG, RFPL reduces user I/O response time by 46.8% for normal state and 40.9% for reconstruction state. Our reliability analysis shows RFPL improves the mean time to data loss (MTTDL) by 9.36X and 1.44X compared with the Linux software RAID and EPLOG.

{"title":"RFPL","authors":"Gaoxiang Xu, Dan Feng, Zhipeng Tan, Xinyan Zhang, Jie Xu, Xing Shu, Yifeng Zhu","doi":"10.1145/3337821.3337887","DOIUrl":"https://doi.org/10.1145/3337821.3337887","url":null,"abstract":"Parity based RAID suffers from poor small write performance due to heavy parity update overhead. The recently proposed method EPLOG constructs a new stripe with updated data chunks without updating old parity chunks. However, due to skewness of data accesses, old versions of updated data chunks often need to be kept to protect other data chunks of the same stripe. This seriously hurts the efficiency of recovering system from device failures due to the need of reconstructing the preserved old data chunks on failed devices. In this paper, we propose a Recovery Friendly Parity Logging scheme, called RFPL, which minimizes small write penalty and provides high recovery performance for SSD RAID. The key idea of RFPL is to reduce the mixture of old and new data chunks in a stripe by exploiting skewness of data accesses. RFPL constructs a new stripe with updated data chunks of the same old stripe. Since cold data chunks of the old stripe are rarely updated, it is likely that all of data chunks written to the new stripe are hot data and become old together within a short time span. This co-old of data chunks in a stripe effectively mitigates the total number of old data chunks which need to be preserved. We have implemented RFPL on a RAID-5 SSD array in Linux 4.3. Experimental results show that, compared with the Linux software RAID, RFPL reduces user I/O response time by 83.1% for normal state and 81.6% for reconstruction state. Compared with the state-of-the-art scheme EPLOG, RFPL reduces user I/O response time by 46.8% for normal state and 40.9% for reconstruction state. Our reliability analysis shows RFPL improves the mean time to data loss (MTTDL) by 9.36X and 1.44X compared with the Linux software RAID and EPLOG.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"679 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116107120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Runtime Adaptive Task Inlining on Asynchronous Multitasking Runtime Systems 异步多任务运行时系统的运行时自适应任务内联

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337915

Bibek Wagle, Mohammad Alaul Haque Monil, K. Huck, A. Malony, Adrian Serio, Hartmut Kaiser

As the era of high frequency, single core processors have come to a close, the new paradigm of many core processors has come to dominate. In response to these systems, asynchronous multitasking runtime systems have been developed as a promising solution to efficiently utilize these newly available hardware. Asynchronous multitasking runtime systems work by dividing a problem into a large number of fine grained tasks. However, as the number of tasks created increase, the overheads associated with task creation and management cannot be ignored. Task inlining, a method where the parent thread consumes a child thread, enables the runtime system to achieve the balance between parallelism and its overhead. As largely impacted by different processor architectures, the decision of task inlining is dynamic in nature. In this research, we present adaptive techniques for deciding, at runtime, whether a particular task should be inlined or not. We present two policies, a baseline policy that makes inlining decision based on a fixed threshold and an adaptive policy which decides the threshold dynamically at runtime. We also evaluate and justify the performance of these policies on different processor architectures. To the best of our knowledge, this is the first study of the impacts of adaptive policy at runtime for task inlining in an asynchronous multitasking runtime system on different processor architectures. From experimentation, we find that the baseline policy improves the execution time from 7.61% to 54.09%. Furthermore, the adaptive policy improves over the baseline policy by up to 74%.

随着高频、单核处理器时代的结束，多核处理器的新范式开始占据主导地位。为了响应这些系统，异步多任务运行时系统作为一种很有前途的解决方案被开发出来，以有效地利用这些新的可用硬件。异步多任务运行时系统通过将问题划分为大量细粒度任务来工作。然而，随着创建的任务数量的增加，与任务创建和管理相关的开销也不容忽视。任务内联是一种父线程消耗子线程的方法，它使运行时系统能够在并行性及其开销之间实现平衡。由于受到不同处理器体系结构的影响，任务内联的决策本质上是动态的。在这项研究中，我们提出了自适应技术来决定，在运行时，一个特定的任务是否应该内联。我们提出了两种策略，一种是基于固定阈值进行内联决策的基线策略，另一种是在运行时动态决定阈值的自适应策略。我们还评估和证明了这些策略在不同处理器架构上的性能。据我们所知，这是第一次研究在不同处理器架构的异步多任务运行时系统中，自适应策略在运行时对任务内联的影响。通过实验，我们发现基线策略将执行时间从7.61%提高到54.09%。此外，自适应策略比基线策略提高了74%。

{"title":"Runtime Adaptive Task Inlining on Asynchronous Multitasking Runtime Systems","authors":"Bibek Wagle, Mohammad Alaul Haque Monil, K. Huck, A. Malony, Adrian Serio, Hartmut Kaiser","doi":"10.1145/3337821.3337915","DOIUrl":"https://doi.org/10.1145/3337821.3337915","url":null,"abstract":"As the era of high frequency, single core processors have come to a close, the new paradigm of many core processors has come to dominate. In response to these systems, asynchronous multitasking runtime systems have been developed as a promising solution to efficiently utilize these newly available hardware. Asynchronous multitasking runtime systems work by dividing a problem into a large number of fine grained tasks. However, as the number of tasks created increase, the overheads associated with task creation and management cannot be ignored. Task inlining, a method where the parent thread consumes a child thread, enables the runtime system to achieve the balance between parallelism and its overhead. As largely impacted by different processor architectures, the decision of task inlining is dynamic in nature. In this research, we present adaptive techniques for deciding, at runtime, whether a particular task should be inlined or not. We present two policies, a baseline policy that makes inlining decision based on a fixed threshold and an adaptive policy which decides the threshold dynamically at runtime. We also evaluate and justify the performance of these policies on different processor architectures. To the best of our knowledge, this is the first study of the impacts of adaptive policy at runtime for task inlining in an asynchronous multitasking runtime system on different processor architectures. From experimentation, we find that the baseline policy improves the execution time from 7.61% to 54.09%. Furthermore, the adaptive policy improves over the baseline policy by up to 74%.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115020416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

On Integration of Appends and Merges in Log-Structured Merge Trees 日志结构合并树中追加和归并的集成

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337836

Caixin Gong, Shuibing He, Yili Gong, Yingchun Lei

As widely used indices in key-value stores, the Log-Structured Merge-tree (LSM-tree) and its variants suffer from severe write amplification due to frequent merges in compactions for write-intensive applications. To address the problem, we first propose the Log-Structured Append-tree (LSA-tree), which tries to compact data with appends instead of merges, significantly reduces the write amplification and solves the issues existed in current append trees. However LSA increases read and space amplifications. Furthermore based on LSA, we design the Integrated Append/Merge-tree (IAM-tree). IAM selects appends or merges in compaction operations according to the size of memory-cached data. Theoretical analysis shows that IAM reduces the write amplification of LSM while keep the same read and space amplification. We implement IAM as a user library named IamDB. Experiments show that its write amplification is much less than that of LSM, only 8.71 vs. 19.00 for 1TB data with 64GB memory. Compared with nicely tuned LevelDB and RocksDB, IamDB provides 1.4-2.7× and 1.6-1.9× better write throughput, saves 12% and 10% disk space respectively, as well as the comparable read and scan performance. At the meantime IamDB achieves the most stable tail latency.

作为键值存储中广泛使用的索引，日志结构合并树(Log-Structured Merge-tree, LSM-tree)及其变体由于在写密集型应用程序的压缩中频繁合并而遭受严重的写放大。为了解决这个问题，我们首先提出了日志结构的追加树(LSA-tree)，它尝试用追加来压缩数据而不是合并，显著减少了写放大，解决了当前追加树存在的问题。但是LSA增加了读取和空间放大。在此基础上，设计了集成追加/合并树(IAM-tree)。在压缩操作中，IAM根据内存缓存数据的大小选择追加或合并。理论分析表明，IAM在保持相同的读放大和空间放大的同时，降低了LSM的写放大。我们将IAM实现为一个名为IamDB的用户库。实验表明，它的写放大比LSM小得多，对于1TB数据和64GB内存，它的写放大只有8.71比19.00。与经过优化的LevelDB和RocksDB相比，IamDB提供了1.4-2.7倍和1.6-1.9倍的写吞吐量，分别节省了12%和10%的磁盘空间，以及相当的读取和扫描性能。同时，IamDB实现了最稳定的尾部延迟。

{"title":"On Integration of Appends and Merges in Log-Structured Merge Trees","authors":"Caixin Gong, Shuibing He, Yili Gong, Yingchun Lei","doi":"10.1145/3337821.3337836","DOIUrl":"https://doi.org/10.1145/3337821.3337836","url":null,"abstract":"As widely used indices in key-value stores, the Log-Structured Merge-tree (LSM-tree) and its variants suffer from severe write amplification due to frequent merges in compactions for write-intensive applications. To address the problem, we first propose the Log-Structured Append-tree (LSA-tree), which tries to compact data with appends instead of merges, significantly reduces the write amplification and solves the issues existed in current append trees. However LSA increases read and space amplifications. Furthermore based on LSA, we design the Integrated Append/Merge-tree (IAM-tree). IAM selects appends or merges in compaction operations according to the size of memory-cached data. Theoretical analysis shows that IAM reduces the write amplification of LSM while keep the same read and space amplification. We implement IAM as a user library named IamDB. Experiments show that its write amplification is much less than that of LSM, only 8.71 vs. 19.00 for 1TB data with 64GB memory. Compared with nicely tuned LevelDB and RocksDB, IamDB provides 1.4-2.7× and 1.6-1.9× better write throughput, saves 12% and 10% disk space respectively, as well as the comparable read and scan performance. At the meantime IamDB achieves the most stable tail latency.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115022164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

HOPE 希望

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337899

M. Yasugi, Daisuke Muraoka, Tasuku Hiraishi, Seiji Umatani, Kento Emoto

This paper presents a new approach to fault-tolerant language systems without a single point of failure for irregular parallel applications. Work-stealing frameworks provide good load balancing for many parallel applications, including irregular ones written in a divide-and-conquer style. However, work-stealing frameworks with fault-tolerant features such as checkpointing do not always work well. This paper proposes a completely opposite "work omission" paradigm and its more detailed concept as a "hierarchical omission"-based parallel execution model called HOPE. HOPE programmers' task is to specify which regions in imperative code can be executed in sequential but arbitrary order and how their partial results can be accessed. HOPE workers spawn no tasks/threads at all; rather, every worker has the entire work of the program with its own planned execution order, and then the workers and the underlying message mediation systems automatically exchange partial results to omit hierarchical subcomputations. Even with fault tolerance, the HOPE framework provides parallel speedups for many parallel applications, including irregular ones.

引用次数: 4

PhSIH

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337859

Zhengyu Liao, Shiyou Qian, Jian Cao, Yanhua Cao, Guangtao Xue, Jiadi Yu, Yanmin Zhu, Minglu Li

The matching algorithm is a critical component of the content-based publish/subscribe system, whose performance has direct effects on the QoS of the whole system. Aiming to improve and stabilize the matching performance, we propose a lightweight parallelization method called PhSIH on the basis of three existing algorithms. PhSIH fulfills Parallelization by horizontally Segmenting the Indexing Hierarchy of data structures to support multiple threads performing matching tasks in parallel on a common data structure. PhSIH can adaptively adjust the degree of parallelism according to the changing workloads in order to meet the performance requirement. The main work of PhSIH concerns dynamically adjusting the degree of parallelism and computing a task allocation solution for parallel threads. PhSIH is implemented in Apache Kafka to augment it as a content-based publish/subscribe system, which makes Kafka suitable for real-time fine-grained event dissemination scenarios, such as stock ticks. To evaluate the parallelization effect and adaptability of PhSIH, a series of experiments are conducted based on synthetic and real-world data. The experiment results demonstrate that PhSIH achieves a good parallelization effect on the three existing algorithms and possesses a desirable adaptability that stabilizes the performance of the matching algorithms.

{"title":"PhSIH","authors":"Zhengyu Liao, Shiyou Qian, Jian Cao, Yanhua Cao, Guangtao Xue, Jiadi Yu, Yanmin Zhu, Minglu Li","doi":"10.1145/3337821.3337859","DOIUrl":"https://doi.org/10.1145/3337821.3337859","url":null,"abstract":"The matching algorithm is a critical component of the content-based publish/subscribe system, whose performance has direct effects on the QoS of the whole system. Aiming to improve and stabilize the matching performance, we propose a lightweight parallelization method called PhSIH on the basis of three existing algorithms. PhSIH fulfills Parallelization by horizontally Segmenting the Indexing Hierarchy of data structures to support multiple threads performing matching tasks in parallel on a common data structure. PhSIH can adaptively adjust the degree of parallelism according to the changing workloads in order to meet the performance requirement. The main work of PhSIH concerns dynamically adjusting the degree of parallelism and computing a task allocation solution for parallel threads. PhSIH is implemented in Apache Kafka to augment it as a content-based publish/subscribe system, which makes Kafka suitable for real-time fine-grained event dissemination scenarios, such as stock ticks. To evaluate the parallelization effect and adaptability of PhSIH, a series of experiments are conducted based on synthetic and real-world data. The experiment results demonstrate that PhSIH achieves a good parallelization effect on the three existing algorithms and possesses a desirable adaptability that stabilizes the performance of the matching algorithms.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126259015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

AVR

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1016/b978-0-7506-5635-1.x5018-3

引用次数: 0

Accelerating All-Edge Common Neighbor Counting on Three Processors 在三个处理器上加速全边缘共邻计数

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337917

Yulin Che, Zhuohang Lai, Shixuan Sun, Qiong Luo, Yue Wang

We propose to accelerate an important but time-consuming operation in online graph analytics, which is the counting of common neighbors for each pair of adjacent vertices (u,v), or edge (u,v), on three modern processors of different architectures. We study two representative algorithms for this problem: (1) a merge-based pivot-skip algorithm (MPS) that intersects the two sets of neighbor vertices of each edge (u,v) to obtain the count; and (2) a bitmap-based algorithm (BMP), which dynamically constructs a bitmap index on the neighbor set of each vertex u, and for each neighbor v of u, looks up v's neighbors in u's bitmap. We parallelize and optimize both algorithms on a multicore CPU, an Intel Xeon Phi Knights Landing processor (KNL), and an NVIDIA GPU. Our experiments show that (1) Both the CPU and the GPU favor BMP whereas MPS wins on the KNL; (2) Across all datasets, the best performer is either MPS on the KNL or BMP on the GPU; and (3) Our optimized algorithms can complete the operation within tens of seconds on billion-edge Twitter graphs, enabling online analytics.

我们建议加速在线图分析中一个重要但耗时的操作，即在三个不同架构的现代处理器上对每对相邻顶点(u,v)或边(u,v)的共同邻居进行计数。我们研究了两种具有代表性的算法:(1)一种基于合并的枢轴跳跃算法(MPS)，该算法与每条边(u,v)的两组相邻顶点相交以获得计数;(2)基于位图的BMP算法，该算法在每个顶点u的邻居集中动态构建位图索引，对于u的每个邻居v，在u的位图中查找v的邻居。我们在多核CPU、Intel Xeon Phi Knights Landing处理器(KNL)和NVIDIA GPU上并行化和优化了这两种算法。我们的实验表明:(1)CPU和GPU都支持BMP，而MPS在KNL上获胜;(2)在所有数据集中，性能最好的是KNL上的MPS或GPU上的BMP;(3)我们优化的算法可以在几十秒内完成十亿边缘Twitter图的操作，实现在线分析。

引用次数: 4

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 48th International Conference on Parallel Processing

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀