Pub Date : 2023-05-01DOI: 10.1109/IPDPS54959.2023.00029
Saisha Kamat, Abdullah Al Raqibul Islam, Mai Zheng, Dong Dai
Similar to local file system checkers such as e2fsck for Ext4, a parallel file system (PFS) checker ensures the file system's correctness. The basic idea of file system checkers is straightforward: important metadata are stored redundantly in separate places for cross-checking; inconsistent metadata will be repaired or overwritten by its ‘more correct' counterpart, which is defined by the developers. Unfortunately, implementing the idea for PFSes is non-trivial due to the system complexity. Although many popular parallel file systems already contain dedicated checkers (e.g., LFSCK for Lustre, BeeGFS-FSCK for BeeGFS, mmfsck for GPFS), the existing checkers often cannot detect or repair inconsistencies accurately due to one fundamental limitation: they rely on a fixed set of consistency rules predefined by developers, which cannot cover the various failure scenarios that may occur in practice.In this study, we propose a new graph-based method to build PFS checkers. Specifically, we model important PFS metadata into graphs, then generalize the logic of cross-checking and repairing into graph analytic tasks. We design a new graph algorithm, FaultyRank, to quantitatively calculate the correctness of each metadata object. By leveraging the calculated correctness, we are able to recommend the most promising repairs to users. Based on the idea, we implement a prototype of FaultyRank on Lustre, one of the most widely used parallel file systems, and compare it with Lustre's default file system checker LFSCK. Our experiments show that FaultyRank can achieve the same checking and repairing logic as LFSCK. Moreover, it is capable of detecting and repairing complicated PFS consistency issues that LFSCK can not handle. We also show the performance advantage of FaultyRank compared with LFSCK. Through this study, we believe FaultyRank opens a new opportunity for building PFS checkers effectively and efficiently.
{"title":"FaultyRank: A Graph-based Parallel File System Checker","authors":"Saisha Kamat, Abdullah Al Raqibul Islam, Mai Zheng, Dong Dai","doi":"10.1109/IPDPS54959.2023.00029","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00029","url":null,"abstract":"Similar to local file system checkers such as e2fsck for Ext4, a parallel file system (PFS) checker ensures the file system's correctness. The basic idea of file system checkers is straightforward: important metadata are stored redundantly in separate places for cross-checking; inconsistent metadata will be repaired or overwritten by its ‘more correct' counterpart, which is defined by the developers. Unfortunately, implementing the idea for PFSes is non-trivial due to the system complexity. Although many popular parallel file systems already contain dedicated checkers (e.g., LFSCK for Lustre, BeeGFS-FSCK for BeeGFS, mmfsck for GPFS), the existing checkers often cannot detect or repair inconsistencies accurately due to one fundamental limitation: they rely on a fixed set of consistency rules predefined by developers, which cannot cover the various failure scenarios that may occur in practice.In this study, we propose a new graph-based method to build PFS checkers. Specifically, we model important PFS metadata into graphs, then generalize the logic of cross-checking and repairing into graph analytic tasks. We design a new graph algorithm, FaultyRank, to quantitatively calculate the correctness of each metadata object. By leveraging the calculated correctness, we are able to recommend the most promising repairs to users. Based on the idea, we implement a prototype of FaultyRank on Lustre, one of the most widely used parallel file systems, and compare it with Lustre's default file system checker LFSCK. Our experiments show that FaultyRank can achieve the same checking and repairing logic as LFSCK. Moreover, it is capable of detecting and repairing complicated PFS consistency issues that LFSCK can not handle. We also show the performance advantage of FaultyRank compared with LFSCK. Through this study, we believe FaultyRank opens a new opportunity for building PFS checkers effectively and efficiently.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120962310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/ipdps54959.2023.00009
{"title":"IPDPS 2023 Technical Program Committee","authors":"","doi":"10.1109/ipdps54959.2023.00009","DOIUrl":"https://doi.org/10.1109/ipdps54959.2023.00009","url":null,"abstract":"","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121057424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/IPDPS54959.2023.00107
Bing Lu, Yida Li, Junqi Wang, Huizhang Luo, Kenli Li
Today’s scientific simulations are confronting seriously limited I/O bandwidth, network bandwidth, and storage capacity because of immense volumes of data generated in high-performance computing systems. Data compression has emerged as one of the most effective approaches to resolve the issue of the exponential increase of scientific data. However, existing state-of-the-art compressors also are confronting the issue of low throughput, especially under the trend of growing disparities between the compute and I/O rates. Among them, embedded coding is widely applied, which contributes to the dominant running time for the corresponding compressors. In this work, we propose a new kind of embedded coding algorithm, and apply it as the backend embedded coding of ZFP, one of the most successful lossy compressors. Our embedded coding algorithm uses bit groups instead of bit planes to store the compressed data, avoiding the time overhead of generating bit planes and group tests of bit planes, which significantly reduces the running time of ZFP. Our embedded coding algorithm can also accelerate the decompression of ZFP, because the costly procedures of the reverse of group tests and reconstructing bit planes are also avoided. Moreover, we provide theoretical proof that the proposed coding algorithm has the same compression ratio as the baseline ZFP. Experiments with four representative real-world scientific simulation datasets show that the compression and decompression throughput of our solution is up to 2.5× (2.1× on average), and up to 2.1× (1.5× on average) as those of ZFP, respectively.
{"title":"ZFP-X: Efficient Embedded Coding for Accelerating Lossy Floating Point Compression","authors":"Bing Lu, Yida Li, Junqi Wang, Huizhang Luo, Kenli Li","doi":"10.1109/IPDPS54959.2023.00107","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00107","url":null,"abstract":"Today’s scientific simulations are confronting seriously limited I/O bandwidth, network bandwidth, and storage capacity because of immense volumes of data generated in high-performance computing systems. Data compression has emerged as one of the most effective approaches to resolve the issue of the exponential increase of scientific data. However, existing state-of-the-art compressors also are confronting the issue of low throughput, especially under the trend of growing disparities between the compute and I/O rates. Among them, embedded coding is widely applied, which contributes to the dominant running time for the corresponding compressors. In this work, we propose a new kind of embedded coding algorithm, and apply it as the backend embedded coding of ZFP, one of the most successful lossy compressors. Our embedded coding algorithm uses bit groups instead of bit planes to store the compressed data, avoiding the time overhead of generating bit planes and group tests of bit planes, which significantly reduces the running time of ZFP. Our embedded coding algorithm can also accelerate the decompression of ZFP, because the costly procedures of the reverse of group tests and reconstructing bit planes are also avoided. Moreover, we provide theoretical proof that the proposed coding algorithm has the same compression ratio as the baseline ZFP. Experiments with four representative real-world scientific simulation datasets show that the compression and decompression throughput of our solution is up to 2.5× (2.1× on average), and up to 2.1× (1.5× on average) as those of ZFP, respectively.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129743435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/IPDPS54959.2023.00068
Olivia Grimes, J. Nelson-Slivon, A. Hassan, R. Palmieri
Designing high-performance, highly-concurrent linearizable data structures is complex, especially when bulk operations (e.g., range queries) are included. Relying on a single source of synchronization, such as a logical global timestamp, unequivocally eases the design of the synchronization schemes. However, such a design creates a single point of contention, and thus carries performance downsides. As a result, designers often face the dilemma between a simple design and a performance bottleneck. Recently, modern commodity architectures have enabled low-level mechanisms that guarantee that the timestamp registers of all CPUs are synchronized, thus enabling the use of hardware timestamps in data structure designs. Although recent work already exploits this, this work aims at understanding the opportunities and limitations of using hardware timestamps in existing data structure designs. We address this challenge by applying hardware time-stamping to three recent state-of-the-art algorithms that use logical timestamps to support range queries in concurrent data structures. Our evaluation shows that the use of hardware timestamps does indeed improve performance compared to the original designs, achieving up to 5.5x improvement. More importantly, by removing the bottleneck of using global logical timestamps in these algorithms, we highlight the design choices that most significantly impact the use of hardware timestamps. Specifically, we show that the mechanism of labeling objects with timestamps plays an important role in maximizing the benefits of leveraging hardware timestamps.
{"title":"Opportunities and Limitations of Hardware Timestamps in Concurrent Data Structures","authors":"Olivia Grimes, J. Nelson-Slivon, A. Hassan, R. Palmieri","doi":"10.1109/IPDPS54959.2023.00068","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00068","url":null,"abstract":"Designing high-performance, highly-concurrent linearizable data structures is complex, especially when bulk operations (e.g., range queries) are included. Relying on a single source of synchronization, such as a logical global timestamp, unequivocally eases the design of the synchronization schemes. However, such a design creates a single point of contention, and thus carries performance downsides. As a result, designers often face the dilemma between a simple design and a performance bottleneck. Recently, modern commodity architectures have enabled low-level mechanisms that guarantee that the timestamp registers of all CPUs are synchronized, thus enabling the use of hardware timestamps in data structure designs. Although recent work already exploits this, this work aims at understanding the opportunities and limitations of using hardware timestamps in existing data structure designs. We address this challenge by applying hardware time-stamping to three recent state-of-the-art algorithms that use logical timestamps to support range queries in concurrent data structures. Our evaluation shows that the use of hardware timestamps does indeed improve performance compared to the original designs, achieving up to 5.5x improvement. More importantly, by removing the bottleneck of using global logical timestamps in these algorithms, we highlight the design choices that most significantly impact the use of hardware timestamps. Specifically, we show that the mechanism of labeling objects with timestamps plays an important role in maximizing the benefits of leveraging hardware timestamps.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116375062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/IPDPS54959.2023.00025
Jian Liao, Mingzhen Li, Hailong Yang, Qingxiao Sun, Biao Sun, Jiwei Hao, Tianyu Feng, F. Yu, Shengdong Chen, Ye Tao, Zicheng Zhang, Zhongzhi Luan, D. Qian
Larger deep learning models usually lead to higher model quality, however with an ever-increasing GPU memory footprint. Although several tensor checkpointing techniques have been proposed to enable training under a restricted GPU memory budget, they fail to exploit the input tensor dynamics due to diverse datasets and subsequent data augmentation, and thus leave the training optimization on table. In this paper, we propose Mimose, an input-aware tensor checkpointing planner respecting the memory budget while enabling efficient model training on GPU. Mimose builds a lightweight but accurate prediction model of GPU memory usage online, without pre-analyzing the model. It generates a tensor checkpointing plan based on per-layer memory prediction and applies it to the training process on the fly. Our experiments show that Mimose achieves superior training throughput compared to state-of-the-art checkpointing frameworks under the same GPU memory budgets.
{"title":"Exploiting Input Tensor Dynamics in Activation Checkpointing for Efficient Training on GPU","authors":"Jian Liao, Mingzhen Li, Hailong Yang, Qingxiao Sun, Biao Sun, Jiwei Hao, Tianyu Feng, F. Yu, Shengdong Chen, Ye Tao, Zicheng Zhang, Zhongzhi Luan, D. Qian","doi":"10.1109/IPDPS54959.2023.00025","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00025","url":null,"abstract":"Larger deep learning models usually lead to higher model quality, however with an ever-increasing GPU memory footprint. Although several tensor checkpointing techniques have been proposed to enable training under a restricted GPU memory budget, they fail to exploit the input tensor dynamics due to diverse datasets and subsequent data augmentation, and thus leave the training optimization on table. In this paper, we propose Mimose, an input-aware tensor checkpointing planner respecting the memory budget while enabling efficient model training on GPU. Mimose builds a lightweight but accurate prediction model of GPU memory usage online, without pre-analyzing the model. It generates a tensor checkpointing plan based on per-layer memory prediction and applies it to the training process on the fly. Our experiments show that Mimose achieves superior training throughput compared to state-of-the-art checkpointing frameworks under the same GPU memory budgets.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130754634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fully Homomorphic Encryption (FHE) allows computations on encrypted data without knowledge of the plaintext message and currently has been the focus of both academia and industry. However, the performance issue hinders its large-scale application, highlighting the urgent requirements of high-performance FHE implementations.With noticing the tremendous potential of GPUs in the field of cryptographic acceleration, this paper comprehensively investigates how to convert the available computing resources residing in GPUs into FHE workhorses, and implement a full set of low-level and middle-level FHE primitives based on two arithmetic units (i.e., INT32 and FP64 units) with three types of data precision (i.e., INT32, INT64 and FP64). This paper gives a comprehensive evaluation and comparison based on each road-map. Our implementations of fundamental functions outperform the implementations on the same platform by 1.7× to 16.7×. Taking CKKS FHE schemes as a case study, our implementation of homomorphic multiplication achieves 3.2× speedup over the state-of-the-art GPU-based implementation, even considering the difference of platforms. The detailed evaluation and comparison of this paper would offer a vital reference for the follow-up work to choose appropriate underlying arithmetic units and important primitive optimizations in GPU-based FHE implementations.
{"title":"Towards Faster Fully Homomorphic Encryption Implementation with Integer and Floating-point Computing Power of GPUs","authors":"Guang Fan, Fangyu Zheng, Lipeng Wan, Lili Gao, Yuan Zhao, Jiankuo Dong, Yixuan Song, Yuewu Wang, Jingqiang Lin","doi":"10.1109/IPDPS54959.2023.00085","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00085","url":null,"abstract":"Fully Homomorphic Encryption (FHE) allows computations on encrypted data without knowledge of the plaintext message and currently has been the focus of both academia and industry. However, the performance issue hinders its large-scale application, highlighting the urgent requirements of high-performance FHE implementations.With noticing the tremendous potential of GPUs in the field of cryptographic acceleration, this paper comprehensively investigates how to convert the available computing resources residing in GPUs into FHE workhorses, and implement a full set of low-level and middle-level FHE primitives based on two arithmetic units (i.e., INT32 and FP64 units) with three types of data precision (i.e., INT32, INT64 and FP64). This paper gives a comprehensive evaluation and comparison based on each road-map. Our implementations of fundamental functions outperform the implementations on the same platform by 1.7× to 16.7×. Taking CKKS FHE schemes as a case study, our implementation of homomorphic multiplication achieves 3.2× speedup over the state-of-the-art GPU-based implementation, even considering the difference of platforms. The detailed evaluation and comparison of this paper would offer a vital reference for the follow-up work to choose appropriate underlying arithmetic units and important primitive optimizations in GPU-based FHE implementations.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128872436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/ipdps54959.2023.00043
{"title":"Keynote: The Adventurous Life of a System Software Researcher","authors":"","doi":"10.1109/ipdps54959.2023.00043","DOIUrl":"https://doi.org/10.1109/ipdps54959.2023.00043","url":null,"abstract":"","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115730831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/IPDPS54959.2023.00057
Ruibo Fan, Wei Wang, X. Chu
Graph Neural Networks (GNNs) are gaining huge traction recently as they achieve state-of-the-art performance on various graph-related problems. GNN training typically follows the standard Message Passing Paradigm, in which SpMM and SDDMM are the two essential sparse kernels. However, existing sparse GPU kernels are inefficient and may suffer from load imbalance, dynamics in GNN computing, poor memory efficiency, and tail effect. We propose two new kernels, Hybrid-Parallel SpMM (HP-SpMM) and Hybrid-Parallel SDDMM (HP-SDDMM), that efficiently perform SpMM and SDDMM on GPUs with a unified hybrid parallel strategy of mixing nodes and edges. In view of the emerging graph-sampling training, we design the Dynamic Task Partition (DTP) method to minimize the tail effect by exposing sufficient parallelism. We further devise the Hierarchical Vectorized Memory Access scheme to achieve aligned global memory accesses and enable vectorized instructions for improved memory efficiency. We also propose to enhance data locality by reordering the graphs with the Graph Clustering method. Experiments on extensive sparse matrices collected from real GNN applications demonstrate that our kernels achieve significant performance improvements over state-of-the-art implementations. We implement our sparse kernels in popular GNN frameworks and use them to train various GNN models, including the GCN model in full-graph mode and the GraphSAINT model in graph-sampling mode. Evaluation results show that our kernels can accelerate GNN training by up to 1.72×.
{"title":"Fast Sparse GPU Kernels for Accelerated Training of Graph Neural Networks","authors":"Ruibo Fan, Wei Wang, X. Chu","doi":"10.1109/IPDPS54959.2023.00057","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00057","url":null,"abstract":"Graph Neural Networks (GNNs) are gaining huge traction recently as they achieve state-of-the-art performance on various graph-related problems. GNN training typically follows the standard Message Passing Paradigm, in which SpMM and SDDMM are the two essential sparse kernels. However, existing sparse GPU kernels are inefficient and may suffer from load imbalance, dynamics in GNN computing, poor memory efficiency, and tail effect. We propose two new kernels, Hybrid-Parallel SpMM (HP-SpMM) and Hybrid-Parallel SDDMM (HP-SDDMM), that efficiently perform SpMM and SDDMM on GPUs with a unified hybrid parallel strategy of mixing nodes and edges. In view of the emerging graph-sampling training, we design the Dynamic Task Partition (DTP) method to minimize the tail effect by exposing sufficient parallelism. We further devise the Hierarchical Vectorized Memory Access scheme to achieve aligned global memory accesses and enable vectorized instructions for improved memory efficiency. We also propose to enhance data locality by reordering the graphs with the Graph Clustering method. Experiments on extensive sparse matrices collected from real GNN applications demonstrate that our kernels achieve significant performance improvements over state-of-the-art implementations. We implement our sparse kernels in popular GNN frameworks and use them to train various GNN models, including the GCN model in full-graph mode and the GraphSAINT model in graph-sampling mode. Evaluation results show that our kernels can accelerate GNN training by up to 1.72×.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114416105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/IPDPS54959.2023.00097
Pouriya Zarbafian, V. Gramoli
Reordering blockchain transactions to manipulate markets profited hackers by hundreds of millions of dollars. Because they rely on State Machine Replication (SMR), blockchains order transactions without preventing hackers from influencing the chosen order. Some order-fair consensus protocols, like Pompē [33], order transactions before agreeing on this order. They are insufficient because a hacker can leverage the lack of triangle inequality among network latencies to observe pending transactions before issuing their own. Other DAG-based protocols, like Fino [24], use commit-reveal to obfuscate transactions, but cannot prevent reordering by a Byzantine leader.In this paper, we present Lyra, a protocol that solves this problem. The key idea is the combination of a commit-reveal protocol to obfuscate transaction payloads, and a leaderless ordered consensus protocol that predicts the order of transactions. Lyra has optimal good-case latency, prevents reordering attacks, and is scalable. Finally, it outperforms the latency of Pompē by up to 2 times and its throughput by up to 7 times on a 100-node network over 3 continents.
{"title":"Lyra: Fast and Scalable Resilience to Reordering Attacks in Blockchains","authors":"Pouriya Zarbafian, V. Gramoli","doi":"10.1109/IPDPS54959.2023.00097","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00097","url":null,"abstract":"Reordering blockchain transactions to manipulate markets profited hackers by hundreds of millions of dollars. Because they rely on State Machine Replication (SMR), blockchains order transactions without preventing hackers from influencing the chosen order. Some order-fair consensus protocols, like Pompē [33], order transactions before agreeing on this order. They are insufficient because a hacker can leverage the lack of triangle inequality among network latencies to observe pending transactions before issuing their own. Other DAG-based protocols, like Fino [24], use commit-reveal to obfuscate transactions, but cannot prevent reordering by a Byzantine leader.In this paper, we present Lyra, a protocol that solves this problem. The key idea is the combination of a commit-reveal protocol to obfuscate transaction payloads, and a leaderless ordered consensus protocol that predicts the order of transactions. Lyra has optimal good-case latency, prevents reordering attacks, and is scalable. Finally, it outperforms the latency of Pompē by up to 2 times and its throughput by up to 7 times on a 100-node network over 3 continents.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128916416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/IPDPS54959.2023.00028
Di Zhang, Chris Egersdoerfer, Tabassum Mahmud, Mai Zheng, Dong Dai
Large-scale storage systems, a critical part of modern computing systems, are subject to various runtime bugs, failures, and anomalies in production. Identifying their anomalies at runtime is thus critical for users and administrators. Since runtime logs record the important status of the systems, log-based anomaly detection has been studied extensively for timely identifying system malfunctions. However, existing log-based anomaly detection solutions share common limitations in representing log entries accurately and robustly, hence can not effectively handle log entries that were not seen in the historical logs, which is a common real-world scenario due to logs' inherent rarity and the continuous evolution of the systems. To address the issues of existing methods, we propose Drill, a new log pre-processing method to generate high-quality vector representation of runtime logs by leveraging both storage system-specific sentiment-classifying language models and log contexts built from the source code. Through extensive evaluations of two representative distributed storage systems (Apache HDFS and Lustre), we show that Drill can achieve up to 41% improvement when compared with state-of-the-art anomaly detection solutions, showing it is a promising solution for general anomaly detection.
{"title":"Drill: Log-based Anomaly Detection for Large-scale Storage Systems Using Source Code Analysis","authors":"Di Zhang, Chris Egersdoerfer, Tabassum Mahmud, Mai Zheng, Dong Dai","doi":"10.1109/IPDPS54959.2023.00028","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00028","url":null,"abstract":"Large-scale storage systems, a critical part of modern computing systems, are subject to various runtime bugs, failures, and anomalies in production. Identifying their anomalies at runtime is thus critical for users and administrators. Since runtime logs record the important status of the systems, log-based anomaly detection has been studied extensively for timely identifying system malfunctions. However, existing log-based anomaly detection solutions share common limitations in representing log entries accurately and robustly, hence can not effectively handle log entries that were not seen in the historical logs, which is a common real-world scenario due to logs' inherent rarity and the continuous evolution of the systems. To address the issues of existing methods, we propose Drill, a new log pre-processing method to generate high-quality vector representation of runtime logs by leveraging both storage system-specific sentiment-classifying language models and log contexts built from the source code. Through extensive evaluations of two representative distributed storage systems (Apache HDFS and Lustre), we show that Drill can achieve up to 41% improvement when compared with state-of-the-art anomaly detection solutions, showing it is a promising solution for general anomaly detection.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116784500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}