首页 > 最新文献

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献

英文 中文
FaultyRank: A Graph-based Parallel File System Checker FaultyRank:一个基于图的并行文件系统检查器
Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00029
Saisha Kamat, Abdullah Al Raqibul Islam, Mai Zheng, Dong Dai
Similar to local file system checkers such as e2fsck for Ext4, a parallel file system (PFS) checker ensures the file system's correctness. The basic idea of file system checkers is straightforward: important metadata are stored redundantly in separate places for cross-checking; inconsistent metadata will be repaired or overwritten by its ‘more correct' counterpart, which is defined by the developers. Unfortunately, implementing the idea for PFSes is non-trivial due to the system complexity. Although many popular parallel file systems already contain dedicated checkers (e.g., LFSCK for Lustre, BeeGFS-FSCK for BeeGFS, mmfsck for GPFS), the existing checkers often cannot detect or repair inconsistencies accurately due to one fundamental limitation: they rely on a fixed set of consistency rules predefined by developers, which cannot cover the various failure scenarios that may occur in practice.In this study, we propose a new graph-based method to build PFS checkers. Specifically, we model important PFS metadata into graphs, then generalize the logic of cross-checking and repairing into graph analytic tasks. We design a new graph algorithm, FaultyRank, to quantitatively calculate the correctness of each metadata object. By leveraging the calculated correctness, we are able to recommend the most promising repairs to users. Based on the idea, we implement a prototype of FaultyRank on Lustre, one of the most widely used parallel file systems, and compare it with Lustre's default file system checker LFSCK. Our experiments show that FaultyRank can achieve the same checking and repairing logic as LFSCK. Moreover, it is capable of detecting and repairing complicated PFS consistency issues that LFSCK can not handle. We also show the performance advantage of FaultyRank compared with LFSCK. Through this study, we believe FaultyRank opens a new opportunity for building PFS checkers effectively and efficiently.
与Ext4的e2fsck等本地文件系统检查器类似,并行文件系统(PFS)检查器可确保文件系统的正确性。文件系统检查器的基本思想很简单:重要的元数据被冗余地存储在不同的地方进行交叉检查;不一致的元数据将由其“更正确”的对应物(由开发人员定义)修复或覆盖。不幸的是,由于系统的复杂性,为pfse实现这个想法并不是微不足道的。尽管许多流行的并行文件系统已经包含了专用的检查器(例如,Lustre的LFSCK, BeeGFS的BeeGFS- fsck, GPFS的mmfsck),但是由于一个基本的限制,现有的检查器通常不能准确地检测或修复不一致性:它们依赖于开发人员预定义的一组固定的一致性规则,这些规则不能涵盖实践中可能发生的各种故障场景。在这项研究中,我们提出了一种新的基于图的方法来构建PFS检查器。具体来说,我们将重要的PFS元数据建模为图,然后将交叉检查和修复的逻辑推广到图分析任务中。我们设计了一种新的图算法FaultyRank来定量计算每个元数据对象的正确性。通过利用计算的正确性,我们能够向用户推荐最有希望的维修。基于这一思想,我们在Lustre上实现了一个FaultyRank的原型,并与Lustre的默认文件系统检查器LFSCK进行了比较。实验表明,FaultyRank可以实现与LFSCK相同的检测和修复逻辑。此外,它还能够检测和修复LFSCK无法处理的复杂PFS一致性问题。我们还展示了FaultyRank与LFSCK相比的性能优势。通过这项研究,我们相信FaultyRank为有效和高效地构建PFS检查器提供了新的机会。
{"title":"FaultyRank: A Graph-based Parallel File System Checker","authors":"Saisha Kamat, Abdullah Al Raqibul Islam, Mai Zheng, Dong Dai","doi":"10.1109/IPDPS54959.2023.00029","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00029","url":null,"abstract":"Similar to local file system checkers such as e2fsck for Ext4, a parallel file system (PFS) checker ensures the file system's correctness. The basic idea of file system checkers is straightforward: important metadata are stored redundantly in separate places for cross-checking; inconsistent metadata will be repaired or overwritten by its ‘more correct' counterpart, which is defined by the developers. Unfortunately, implementing the idea for PFSes is non-trivial due to the system complexity. Although many popular parallel file systems already contain dedicated checkers (e.g., LFSCK for Lustre, BeeGFS-FSCK for BeeGFS, mmfsck for GPFS), the existing checkers often cannot detect or repair inconsistencies accurately due to one fundamental limitation: they rely on a fixed set of consistency rules predefined by developers, which cannot cover the various failure scenarios that may occur in practice.In this study, we propose a new graph-based method to build PFS checkers. Specifically, we model important PFS metadata into graphs, then generalize the logic of cross-checking and repairing into graph analytic tasks. We design a new graph algorithm, FaultyRank, to quantitatively calculate the correctness of each metadata object. By leveraging the calculated correctness, we are able to recommend the most promising repairs to users. Based on the idea, we implement a prototype of FaultyRank on Lustre, one of the most widely used parallel file systems, and compare it with Lustre's default file system checker LFSCK. Our experiments show that FaultyRank can achieve the same checking and repairing logic as LFSCK. Moreover, it is capable of detecting and repairing complicated PFS consistency issues that LFSCK can not handle. We also show the performance advantage of FaultyRank compared with LFSCK. Through this study, we believe FaultyRank opens a new opportunity for building PFS checkers effectively and efficiently.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120962310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
IPDPS 2023 Technical Program Committee IPDPS 2023技术计划委员会
Pub Date : 2023-05-01 DOI: 10.1109/ipdps54959.2023.00009
{"title":"IPDPS 2023 Technical Program Committee","authors":"","doi":"10.1109/ipdps54959.2023.00009","DOIUrl":"https://doi.org/10.1109/ipdps54959.2023.00009","url":null,"abstract":"","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121057424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ZFP-X: Efficient Embedded Coding for Accelerating Lossy Floating Point Compression ZFP-X:加速有损浮点压缩的高效嵌入式编码
Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00107
Bing Lu, Yida Li, Junqi Wang, Huizhang Luo, Kenli Li
Today’s scientific simulations are confronting seriously limited I/O bandwidth, network bandwidth, and storage capacity because of immense volumes of data generated in high-performance computing systems. Data compression has emerged as one of the most effective approaches to resolve the issue of the exponential increase of scientific data. However, existing state-of-the-art compressors also are confronting the issue of low throughput, especially under the trend of growing disparities between the compute and I/O rates. Among them, embedded coding is widely applied, which contributes to the dominant running time for the corresponding compressors. In this work, we propose a new kind of embedded coding algorithm, and apply it as the backend embedded coding of ZFP, one of the most successful lossy compressors. Our embedded coding algorithm uses bit groups instead of bit planes to store the compressed data, avoiding the time overhead of generating bit planes and group tests of bit planes, which significantly reduces the running time of ZFP. Our embedded coding algorithm can also accelerate the decompression of ZFP, because the costly procedures of the reverse of group tests and reconstructing bit planes are also avoided. Moreover, we provide theoretical proof that the proposed coding algorithm has the same compression ratio as the baseline ZFP. Experiments with four representative real-world scientific simulation datasets show that the compression and decompression throughput of our solution is up to 2.5× (2.1× on average), and up to 2.1× (1.5× on average) as those of ZFP, respectively.
由于高性能计算系统中产生的大量数据,今天的科学模拟面临着严重有限的I/O带宽、网络带宽和存储容量。数据压缩已成为解决科学数据呈指数增长问题的最有效方法之一。然而,现有的最先进的压缩机也面临着低吞吐量的问题,特别是在计算和I/O速率之间的差距越来越大的趋势下。其中,嵌入式编码被广泛应用,这使得相应压缩器的运行时间占主导地位。本文提出了一种新的嵌入式编码算法,并将其应用于最成功的有损压缩器之一ZFP的后端嵌入式编码。我们的嵌入式编码算法使用位组代替位平面来存储压缩后的数据,避免了生成位平面和对位平面进行分组测试的时间开销,大大缩短了ZFP的运行时间。我们的嵌入式编码算法还可以加快ZFP的解压缩速度,因为它避免了组测试反求和重构位平面的昂贵过程。此外,我们提供理论证明,所提出的编码算法具有相同的压缩比的基线ZFP。在四个具有代表性的真实科学模拟数据集上进行的实验表明,我们的解决方案的压缩和解压吞吐量分别是ZFP的2.5倍(平均2.1倍)和2.1倍(平均1.5倍)。
{"title":"ZFP-X: Efficient Embedded Coding for Accelerating Lossy Floating Point Compression","authors":"Bing Lu, Yida Li, Junqi Wang, Huizhang Luo, Kenli Li","doi":"10.1109/IPDPS54959.2023.00107","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00107","url":null,"abstract":"Today’s scientific simulations are confronting seriously limited I/O bandwidth, network bandwidth, and storage capacity because of immense volumes of data generated in high-performance computing systems. Data compression has emerged as one of the most effective approaches to resolve the issue of the exponential increase of scientific data. However, existing state-of-the-art compressors also are confronting the issue of low throughput, especially under the trend of growing disparities between the compute and I/O rates. Among them, embedded coding is widely applied, which contributes to the dominant running time for the corresponding compressors. In this work, we propose a new kind of embedded coding algorithm, and apply it as the backend embedded coding of ZFP, one of the most successful lossy compressors. Our embedded coding algorithm uses bit groups instead of bit planes to store the compressed data, avoiding the time overhead of generating bit planes and group tests of bit planes, which significantly reduces the running time of ZFP. Our embedded coding algorithm can also accelerate the decompression of ZFP, because the costly procedures of the reverse of group tests and reconstructing bit planes are also avoided. Moreover, we provide theoretical proof that the proposed coding algorithm has the same compression ratio as the baseline ZFP. Experiments with four representative real-world scientific simulation datasets show that the compression and decompression throughput of our solution is up to 2.5× (2.1× on average), and up to 2.1× (1.5× on average) as those of ZFP, respectively.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129743435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Opportunities and Limitations of Hardware Timestamps in Concurrent Data Structures 硬件时间戳在并发数据结构中的机遇与局限
Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00068
Olivia Grimes, J. Nelson-Slivon, A. Hassan, R. Palmieri
Designing high-performance, highly-concurrent linearizable data structures is complex, especially when bulk operations (e.g., range queries) are included. Relying on a single source of synchronization, such as a logical global timestamp, unequivocally eases the design of the synchronization schemes. However, such a design creates a single point of contention, and thus carries performance downsides. As a result, designers often face the dilemma between a simple design and a performance bottleneck. Recently, modern commodity architectures have enabled low-level mechanisms that guarantee that the timestamp registers of all CPUs are synchronized, thus enabling the use of hardware timestamps in data structure designs. Although recent work already exploits this, this work aims at understanding the opportunities and limitations of using hardware timestamps in existing data structure designs. We address this challenge by applying hardware time-stamping to three recent state-of-the-art algorithms that use logical timestamps to support range queries in concurrent data structures. Our evaluation shows that the use of hardware timestamps does indeed improve performance compared to the original designs, achieving up to 5.5x improvement. More importantly, by removing the bottleneck of using global logical timestamps in these algorithms, we highlight the design choices that most significantly impact the use of hardware timestamps. Specifically, we show that the mechanism of labeling objects with timestamps plays an important role in maximizing the benefits of leveraging hardware timestamps.
设计高性能、高并发的线性化数据结构是很复杂的,特别是当包含批量操作(例如,范围查询)时。依赖于单一的同步源,比如逻辑全局时间戳,无疑简化了同步方案的设计。然而,这样的设计会产生单点争用,从而带来性能上的缺点。因此,设计人员经常面临简单设计和性能瓶颈之间的两难境地。最近,现代商品架构启用了低级机制,以保证所有cpu的时间戳寄存器是同步的,从而允许在数据结构设计中使用硬件时间戳。虽然最近的工作已经利用了这一点,但这项工作的目的是了解在现有数据结构设计中使用硬件时间戳的机会和局限性。我们通过将硬件时间戳应用于三种最新的最先进的算法来解决这个挑战,这些算法使用逻辑时间戳来支持并发数据结构中的范围查询。我们的评估表明,与原始设计相比,硬件时间戳的使用确实提高了性能,实现了5.5倍的改进。更重要的是,通过消除在这些算法中使用全局逻辑时间戳的瓶颈,我们突出了对硬件时间戳使用影响最大的设计选择。具体来说,我们展示了用时间戳标记对象的机制在最大化利用硬件时间戳的好处方面起着重要作用。
{"title":"Opportunities and Limitations of Hardware Timestamps in Concurrent Data Structures","authors":"Olivia Grimes, J. Nelson-Slivon, A. Hassan, R. Palmieri","doi":"10.1109/IPDPS54959.2023.00068","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00068","url":null,"abstract":"Designing high-performance, highly-concurrent linearizable data structures is complex, especially when bulk operations (e.g., range queries) are included. Relying on a single source of synchronization, such as a logical global timestamp, unequivocally eases the design of the synchronization schemes. However, such a design creates a single point of contention, and thus carries performance downsides. As a result, designers often face the dilemma between a simple design and a performance bottleneck. Recently, modern commodity architectures have enabled low-level mechanisms that guarantee that the timestamp registers of all CPUs are synchronized, thus enabling the use of hardware timestamps in data structure designs. Although recent work already exploits this, this work aims at understanding the opportunities and limitations of using hardware timestamps in existing data structure designs. We address this challenge by applying hardware time-stamping to three recent state-of-the-art algorithms that use logical timestamps to support range queries in concurrent data structures. Our evaluation shows that the use of hardware timestamps does indeed improve performance compared to the original designs, achieving up to 5.5x improvement. More importantly, by removing the bottleneck of using global logical timestamps in these algorithms, we highlight the design choices that most significantly impact the use of hardware timestamps. Specifically, we show that the mechanism of labeling objects with timestamps plays an important role in maximizing the benefits of leveraging hardware timestamps.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116375062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploiting Input Tensor Dynamics in Activation Checkpointing for Efficient Training on GPU 利用激活检查点中的输入张量动力学实现GPU的高效训练
Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00025
Jian Liao, Mingzhen Li, Hailong Yang, Qingxiao Sun, Biao Sun, Jiwei Hao, Tianyu Feng, F. Yu, Shengdong Chen, Ye Tao, Zicheng Zhang, Zhongzhi Luan, D. Qian
Larger deep learning models usually lead to higher model quality, however with an ever-increasing GPU memory footprint. Although several tensor checkpointing techniques have been proposed to enable training under a restricted GPU memory budget, they fail to exploit the input tensor dynamics due to diverse datasets and subsequent data augmentation, and thus leave the training optimization on table. In this paper, we propose Mimose, an input-aware tensor checkpointing planner respecting the memory budget while enabling efficient model training on GPU. Mimose builds a lightweight but accurate prediction model of GPU memory usage online, without pre-analyzing the model. It generates a tensor checkpointing plan based on per-layer memory prediction and applies it to the training process on the fly. Our experiments show that Mimose achieves superior training throughput compared to state-of-the-art checkpointing frameworks under the same GPU memory budgets.
更大的深度学习模型通常会带来更高的模型质量,但是GPU内存占用会不断增加。虽然已经提出了几种张量检查点技术来实现在有限的GPU内存预算下的训练,但由于不同的数据集和随后的数据增加,它们无法利用输入张量的动态,从而使训练优化留在表中。在本文中,我们提出了Mimose,一个输入感知张量检查点规划器,在尊重内存预算的同时,在GPU上实现高效的模型训练。Mimose构建了一个轻量级但准确的GPU内存使用在线预测模型,而无需对模型进行预分析。它基于每层记忆预测生成一个张量检查点计划,并将其应用于动态训练过程。我们的实验表明,在相同的GPU内存预算下,与最先进的检查点框架相比,Mimose实现了更高的训练吞吐量。
{"title":"Exploiting Input Tensor Dynamics in Activation Checkpointing for Efficient Training on GPU","authors":"Jian Liao, Mingzhen Li, Hailong Yang, Qingxiao Sun, Biao Sun, Jiwei Hao, Tianyu Feng, F. Yu, Shengdong Chen, Ye Tao, Zicheng Zhang, Zhongzhi Luan, D. Qian","doi":"10.1109/IPDPS54959.2023.00025","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00025","url":null,"abstract":"Larger deep learning models usually lead to higher model quality, however with an ever-increasing GPU memory footprint. Although several tensor checkpointing techniques have been proposed to enable training under a restricted GPU memory budget, they fail to exploit the input tensor dynamics due to diverse datasets and subsequent data augmentation, and thus leave the training optimization on table. In this paper, we propose Mimose, an input-aware tensor checkpointing planner respecting the memory budget while enabling efficient model training on GPU. Mimose builds a lightweight but accurate prediction model of GPU memory usage online, without pre-analyzing the model. It generates a tensor checkpointing plan based on per-layer memory prediction and applies it to the training process on the fly. Our experiments show that Mimose achieves superior training throughput compared to state-of-the-art checkpointing frameworks under the same GPU memory budgets.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130754634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards Faster Fully Homomorphic Encryption Implementation with Integer and Floating-point Computing Power of GPUs 利用gpu的整数和浮点运算能力实现更快的全同态加密
Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00085
Guang Fan, Fangyu Zheng, Lipeng Wan, Lili Gao, Yuan Zhao, Jiankuo Dong, Yixuan Song, Yuewu Wang, Jingqiang Lin
Fully Homomorphic Encryption (FHE) allows computations on encrypted data without knowledge of the plaintext message and currently has been the focus of both academia and industry. However, the performance issue hinders its large-scale application, highlighting the urgent requirements of high-performance FHE implementations.With noticing the tremendous potential of GPUs in the field of cryptographic acceleration, this paper comprehensively investigates how to convert the available computing resources residing in GPUs into FHE workhorses, and implement a full set of low-level and middle-level FHE primitives based on two arithmetic units (i.e., INT32 and FP64 units) with three types of data precision (i.e., INT32, INT64 and FP64). This paper gives a comprehensive evaluation and comparison based on each road-map. Our implementations of fundamental functions outperform the implementations on the same platform by 1.7× to 16.7×. Taking CKKS FHE schemes as a case study, our implementation of homomorphic multiplication achieves 3.2× speedup over the state-of-the-art GPU-based implementation, even considering the difference of platforms. The detailed evaluation and comparison of this paper would offer a vital reference for the follow-up work to choose appropriate underlying arithmetic units and important primitive optimizations in GPU-based FHE implementations.
完全同态加密(FHE)允许在不知道明文消息的情况下对加密数据进行计算,目前已成为学术界和工业界关注的焦点。然而,性能问题阻碍了其大规模应用,凸显了高性能FHE实现的迫切需求。注意到gpu在加密加速领域的巨大潜力,本文全面研究了如何将gpu中可用的计算资源转换为FHE工作马,并基于两种算术单元(即INT32和FP64单元)实现了一套基于三种数据精度(即INT32, INT64和FP64)的低级和中级FHE原语。本文在各路线图的基础上进行了综合评价和比较。我们的基本函数实现比同一平台上的实现性能高1.7到16.7倍。以CKKS FHE方案为例,即使考虑到平台的差异,我们的同态乘法实现也比最先进的基于gpu的实现实现提高了3.2倍的速度。本文的详细评价和比较将为后续基于gpu的FHE实现中选择合适的底层运算单元和重要的原语优化提供重要的参考。
{"title":"Towards Faster Fully Homomorphic Encryption Implementation with Integer and Floating-point Computing Power of GPUs","authors":"Guang Fan, Fangyu Zheng, Lipeng Wan, Lili Gao, Yuan Zhao, Jiankuo Dong, Yixuan Song, Yuewu Wang, Jingqiang Lin","doi":"10.1109/IPDPS54959.2023.00085","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00085","url":null,"abstract":"Fully Homomorphic Encryption (FHE) allows computations on encrypted data without knowledge of the plaintext message and currently has been the focus of both academia and industry. However, the performance issue hinders its large-scale application, highlighting the urgent requirements of high-performance FHE implementations.With noticing the tremendous potential of GPUs in the field of cryptographic acceleration, this paper comprehensively investigates how to convert the available computing resources residing in GPUs into FHE workhorses, and implement a full set of low-level and middle-level FHE primitives based on two arithmetic units (i.e., INT32 and FP64 units) with three types of data precision (i.e., INT32, INT64 and FP64). This paper gives a comprehensive evaluation and comparison based on each road-map. Our implementations of fundamental functions outperform the implementations on the same platform by 1.7× to 16.7×. Taking CKKS FHE schemes as a case study, our implementation of homomorphic multiplication achieves 3.2× speedup over the state-of-the-art GPU-based implementation, even considering the difference of platforms. The detailed evaluation and comparison of this paper would offer a vital reference for the follow-up work to choose appropriate underlying arithmetic units and important primitive optimizations in GPU-based FHE implementations.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128872436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Keynote: The Adventurous Life of a System Software Researcher 主题演讲:系统软件研究人员的冒险生活
Pub Date : 2023-05-01 DOI: 10.1109/ipdps54959.2023.00043
{"title":"Keynote: The Adventurous Life of a System Software Researcher","authors":"","doi":"10.1109/ipdps54959.2023.00043","DOIUrl":"https://doi.org/10.1109/ipdps54959.2023.00043","url":null,"abstract":"","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115730831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fast Sparse GPU Kernels for Accelerated Training of Graph Neural Networks 快速稀疏GPU核加速训练图神经网络
Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00057
Ruibo Fan, Wei Wang, X. Chu
Graph Neural Networks (GNNs) are gaining huge traction recently as they achieve state-of-the-art performance on various graph-related problems. GNN training typically follows the standard Message Passing Paradigm, in which SpMM and SDDMM are the two essential sparse kernels. However, existing sparse GPU kernels are inefficient and may suffer from load imbalance, dynamics in GNN computing, poor memory efficiency, and tail effect. We propose two new kernels, Hybrid-Parallel SpMM (HP-SpMM) and Hybrid-Parallel SDDMM (HP-SDDMM), that efficiently perform SpMM and SDDMM on GPUs with a unified hybrid parallel strategy of mixing nodes and edges. In view of the emerging graph-sampling training, we design the Dynamic Task Partition (DTP) method to minimize the tail effect by exposing sufficient parallelism. We further devise the Hierarchical Vectorized Memory Access scheme to achieve aligned global memory accesses and enable vectorized instructions for improved memory efficiency. We also propose to enhance data locality by reordering the graphs with the Graph Clustering method. Experiments on extensive sparse matrices collected from real GNN applications demonstrate that our kernels achieve significant performance improvements over state-of-the-art implementations. We implement our sparse kernels in popular GNN frameworks and use them to train various GNN models, including the GCN model in full-graph mode and the GraphSAINT model in graph-sampling mode. Evaluation results show that our kernels can accelerate GNN training by up to 1.72×.
图神经网络(gnn)最近获得了巨大的关注,因为它们在各种与图相关的问题上取得了最先进的性能。GNN训练通常遵循标准的消息传递范式,其中SpMM和SDDMM是两个基本的稀疏核。然而,现有的稀疏GPU内核效率低下,存在负载不平衡、GNN计算的动态性、内存效率不高、尾效应等问题。我们提出了两个新的核,混合并行SpMM (HP-SpMM)和混合并行SDDMM (HP-SDDMM),它们采用节点和边混合的混合并行策略在gpu上高效地执行SpMM和SDDMM。针对新兴的图采样训练,我们设计了动态任务划分(DTP)方法,通过暴露足够的并行性来最小化尾部效应。我们进一步设计了分层向量化内存访问方案,以实现对齐的全局内存访问,并启用向量化指令,以提高内存效率。我们还提出了用图聚类方法对图进行重新排序来增强数据的局部性。从实际GNN应用中收集的广泛稀疏矩阵的实验表明,我们的内核比最先进的实现实现实现了显着的性能改进。我们在流行的GNN框架中实现了我们的稀疏核,并使用它们来训练各种GNN模型,包括全图模式的GCN模型和图采样模式的GraphSAINT模型。评估结果表明,我们的内核可以将GNN训练速度提高1.72倍。
{"title":"Fast Sparse GPU Kernels for Accelerated Training of Graph Neural Networks","authors":"Ruibo Fan, Wei Wang, X. Chu","doi":"10.1109/IPDPS54959.2023.00057","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00057","url":null,"abstract":"Graph Neural Networks (GNNs) are gaining huge traction recently as they achieve state-of-the-art performance on various graph-related problems. GNN training typically follows the standard Message Passing Paradigm, in which SpMM and SDDMM are the two essential sparse kernels. However, existing sparse GPU kernels are inefficient and may suffer from load imbalance, dynamics in GNN computing, poor memory efficiency, and tail effect. We propose two new kernels, Hybrid-Parallel SpMM (HP-SpMM) and Hybrid-Parallel SDDMM (HP-SDDMM), that efficiently perform SpMM and SDDMM on GPUs with a unified hybrid parallel strategy of mixing nodes and edges. In view of the emerging graph-sampling training, we design the Dynamic Task Partition (DTP) method to minimize the tail effect by exposing sufficient parallelism. We further devise the Hierarchical Vectorized Memory Access scheme to achieve aligned global memory accesses and enable vectorized instructions for improved memory efficiency. We also propose to enhance data locality by reordering the graphs with the Graph Clustering method. Experiments on extensive sparse matrices collected from real GNN applications demonstrate that our kernels achieve significant performance improvements over state-of-the-art implementations. We implement our sparse kernels in popular GNN frameworks and use them to train various GNN models, including the GCN model in full-graph mode and the GraphSAINT model in graph-sampling mode. Evaluation results show that our kernels can accelerate GNN training by up to 1.72×.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114416105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Lyra: Fast and Scalable Resilience to Reordering Attacks in Blockchains Lyra:对区块链中重新排序攻击的快速和可扩展的弹性
Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00097
Pouriya Zarbafian, V. Gramoli
Reordering blockchain transactions to manipulate markets profited hackers by hundreds of millions of dollars. Because they rely on State Machine Replication (SMR), blockchains order transactions without preventing hackers from influencing the chosen order. Some order-fair consensus protocols, like Pompē [33], order transactions before agreeing on this order. They are insufficient because a hacker can leverage the lack of triangle inequality among network latencies to observe pending transactions before issuing their own. Other DAG-based protocols, like Fino [24], use commit-reveal to obfuscate transactions, but cannot prevent reordering by a Byzantine leader.In this paper, we present Lyra, a protocol that solves this problem. The key idea is the combination of a commit-reveal protocol to obfuscate transaction payloads, and a leaderless ordered consensus protocol that predicts the order of transactions. Lyra has optimal good-case latency, prevents reordering attacks, and is scalable. Finally, it outperforms the latency of Pompē by up to 2 times and its throughput by up to 7 times on a 100-node network over 3 continents.
通过重新排序区块链交易来操纵市场,黑客从中获利数亿美元。由于它们依赖于状态机复制(SMR),区块链在不阻止黑客影响所选顺序的情况下对交易进行排序。一些秩序公平的共识协议,如pompku[33],在同意这个顺序之前对交易进行排序。它们是不够的,因为黑客可以利用网络延迟之间缺乏三角形不等式来观察待处理的事务,然后再发布自己的事务。其他基于dag的协议,如Fino[24],使用commit-reveal来混淆交易,但不能防止拜占庭领导的重新排序。在本文中,我们提出了Lyra,一个解决这个问题的协议。其关键思想是将用于混淆事务有效负载的提交-披露协议与用于预测事务顺序的无领导有序共识协议相结合。Lyra具有最佳的良好案例延迟,防止重新排序攻击,并且具有可扩展性。最后,在3个大洲的100个节点网络上,它的延迟比pompæ高2倍,吞吐量高7倍。
{"title":"Lyra: Fast and Scalable Resilience to Reordering Attacks in Blockchains","authors":"Pouriya Zarbafian, V. Gramoli","doi":"10.1109/IPDPS54959.2023.00097","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00097","url":null,"abstract":"Reordering blockchain transactions to manipulate markets profited hackers by hundreds of millions of dollars. Because they rely on State Machine Replication (SMR), blockchains order transactions without preventing hackers from influencing the chosen order. Some order-fair consensus protocols, like Pompē [33], order transactions before agreeing on this order. They are insufficient because a hacker can leverage the lack of triangle inequality among network latencies to observe pending transactions before issuing their own. Other DAG-based protocols, like Fino [24], use commit-reveal to obfuscate transactions, but cannot prevent reordering by a Byzantine leader.In this paper, we present Lyra, a protocol that solves this problem. The key idea is the combination of a commit-reveal protocol to obfuscate transaction payloads, and a leaderless ordered consensus protocol that predicts the order of transactions. Lyra has optimal good-case latency, prevents reordering attacks, and is scalable. Finally, it outperforms the latency of Pompē by up to 2 times and its throughput by up to 7 times on a 100-node network over 3 continents.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128916416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Drill: Log-based Anomaly Detection for Large-scale Storage Systems Using Source Code Analysis 演练:基于日志的大型存储系统异常检测使用源代码分析
Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00028
Di Zhang, Chris Egersdoerfer, Tabassum Mahmud, Mai Zheng, Dong Dai
Large-scale storage systems, a critical part of modern computing systems, are subject to various runtime bugs, failures, and anomalies in production. Identifying their anomalies at runtime is thus critical for users and administrators. Since runtime logs record the important status of the systems, log-based anomaly detection has been studied extensively for timely identifying system malfunctions. However, existing log-based anomaly detection solutions share common limitations in representing log entries accurately and robustly, hence can not effectively handle log entries that were not seen in the historical logs, which is a common real-world scenario due to logs' inherent rarity and the continuous evolution of the systems. To address the issues of existing methods, we propose Drill, a new log pre-processing method to generate high-quality vector representation of runtime logs by leveraging both storage system-specific sentiment-classifying language models and log contexts built from the source code. Through extensive evaluations of two representative distributed storage systems (Apache HDFS and Lustre), we show that Drill can achieve up to 41% improvement when compared with state-of-the-art anomaly detection solutions, showing it is a promising solution for general anomaly detection.
大型存储系统是现代计算系统的重要组成部分,在生产中容易出现各种运行时错误、故障和异常。因此,在运行时识别它们的异常对于用户和管理员来说至关重要。由于运行时日志记录了系统的重要状态,为了及时发现系统故障,基于日志的异常检测得到了广泛的研究。然而,现有的基于日志的异常检测解决方案在准确和鲁棒地表示日志条目方面存在共同的局限性,因此无法有效地处理历史日志中未见的日志条目,这是由于日志固有的稀缺性和系统的不断发展而导致的常见现实场景。为了解决现有方法的问题,我们提出了Drill,这是一种新的日志预处理方法,通过利用存储系统特定的情感分类语言模型和从源代码构建的日志上下文来生成高质量的运行时日志向量表示。通过对两个具有代表性的分布式存储系统(Apache HDFS和Lustre)的广泛评估,我们表明,与最先进的异常检测解决方案相比,Drill可以实现高达41%的改进,这表明它是一个很有前途的通用异常检测解决方案。
{"title":"Drill: Log-based Anomaly Detection for Large-scale Storage Systems Using Source Code Analysis","authors":"Di Zhang, Chris Egersdoerfer, Tabassum Mahmud, Mai Zheng, Dong Dai","doi":"10.1109/IPDPS54959.2023.00028","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00028","url":null,"abstract":"Large-scale storage systems, a critical part of modern computing systems, are subject to various runtime bugs, failures, and anomalies in production. Identifying their anomalies at runtime is thus critical for users and administrators. Since runtime logs record the important status of the systems, log-based anomaly detection has been studied extensively for timely identifying system malfunctions. However, existing log-based anomaly detection solutions share common limitations in representing log entries accurately and robustly, hence can not effectively handle log entries that were not seen in the historical logs, which is a common real-world scenario due to logs' inherent rarity and the continuous evolution of the systems. To address the issues of existing methods, we propose Drill, a new log pre-processing method to generate high-quality vector representation of runtime logs by leveraging both storage system-specific sentiment-classifying language models and log contexts built from the source code. Through extensive evaluations of two representative distributed storage systems (Apache HDFS and Lustre), we show that Drill can achieve up to 41% improvement when compared with state-of-the-art anomaly detection solutions, showing it is a promising solution for general anomaly detection.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116784500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1