Pub Date : 2023-05-01DOI: 10.1109/IPDPS54959.2023.00038
Kyu-Jin Cho, Injae Kang, Jin-Soo Kim
As the burst buffer is being widely deployed in the HPC (High-Performance Computing) systems, the distributed file system layer is taking the role of campaign storage where scalability and cost-effectiveness are of paramount importance. However, the centralized metadata management in the distributed file system layer poses a scalability challenge. The object storage system has emerged as an alternative thanks to its simplified interface and scale-out architecture. Despite this, the HPC communities are used to working with the POSIX interface to organize their files into a global directory hierarchy and control access through access control lists.In this paper, we present ArkFS, a near-POSIX compliant and scalable distributed file system implemented on top of the object storage system. ArkFS achieves high scalability without any centralized metadata servers. Instead, ArkFS lets each client manage a portion of the file system metadata on a per-directory basis. ArkFS supports any distributed object storage system such as Ceph RADOS or S3-compatible system with an appropriate API translation module. Our experimental results indicate that ArkFS shows significant performance improvement under metadata-intensive workloads while showing near-linear scalability. We also demonstrate that ArkFS is suitable for handling the bursty I/O traffic coming from the burst buffer layer to archive cold data.
{"title":"ArkFS: A Distributed File System on Object Storage for Archiving Data in HPC Environment","authors":"Kyu-Jin Cho, Injae Kang, Jin-Soo Kim","doi":"10.1109/IPDPS54959.2023.00038","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00038","url":null,"abstract":"As the burst buffer is being widely deployed in the HPC (High-Performance Computing) systems, the distributed file system layer is taking the role of campaign storage where scalability and cost-effectiveness are of paramount importance. However, the centralized metadata management in the distributed file system layer poses a scalability challenge. The object storage system has emerged as an alternative thanks to its simplified interface and scale-out architecture. Despite this, the HPC communities are used to working with the POSIX interface to organize their files into a global directory hierarchy and control access through access control lists.In this paper, we present ArkFS, a near-POSIX compliant and scalable distributed file system implemented on top of the object storage system. ArkFS achieves high scalability without any centralized metadata servers. Instead, ArkFS lets each client manage a portion of the file system metadata on a per-directory basis. ArkFS supports any distributed object storage system such as Ceph RADOS or S3-compatible system with an appropriate API translation module. Our experimental results indicate that ArkFS shows significant performance improvement under metadata-intensive workloads while showing near-linear scalability. We also demonstrate that ArkFS is suitable for handling the bursty I/O traffic coming from the burst buffer layer to archive cold data.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126909430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present GraphTensor, a comprehensive open-source framework that supports efficient parallel neural network processing on large graphs. GraphTensor offers a set of easy-to-use programming primitives that appreciate both graph and neural network execution behaviors from the beginning (graph sampling) to the end (dense data processing). Our framework runs diverse graph neural network (GNN) models in a destination-centric, feature-wise manner, which can significantly shorten training execution times in a GPU. In addition, GraphTensor rearranges multiple GNN kernels based on their system hyperparameters in a self-governing manner, thereby reducing the processing dimensionality and the latencies further. From the end-to-end execution viewpoint, GraphTensor significantly shortens the service-level GNN latency by applying pipeline parallelism for efficient graph dataset preprocessing. Our evaluation shows that GraphTensor exhibits 1.4× better training performance than emerging GNN frameworks under the execution of large-scale, real-world graph workloads. For the end-to-end services, GraphTensor reduces training latencies of an advanced version of the GNN frameworks (optimized for multi-threaded graph sampling) by 2.4×, on average.
{"title":"GraphTensor: Comprehensive GNN-Acceleration Framework for Efficient Parallel Processing of Massive Datasets","authors":"Junhyeok Jang, Miryeong Kwon, Donghyun Gouk, Hanyeoreum Bae, Myoungsoo Jung","doi":"10.1109/IPDPS54959.2023.00011","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00011","url":null,"abstract":"We present GraphTensor, a comprehensive open-source framework that supports efficient parallel neural network processing on large graphs. GraphTensor offers a set of easy-to-use programming primitives that appreciate both graph and neural network execution behaviors from the beginning (graph sampling) to the end (dense data processing). Our framework runs diverse graph neural network (GNN) models in a destination-centric, feature-wise manner, which can significantly shorten training execution times in a GPU. In addition, GraphTensor rearranges multiple GNN kernels based on their system hyperparameters in a self-governing manner, thereby reducing the processing dimensionality and the latencies further. From the end-to-end execution viewpoint, GraphTensor significantly shortens the service-level GNN latency by applying pipeline parallelism for efficient graph dataset preprocessing. Our evaluation shows that GraphTensor exhibits 1.4× better training performance than emerging GNN frameworks under the execution of large-scale, real-world graph workloads. For the end-to-end services, GraphTensor reduces training latencies of an advanced version of the GNN frameworks (optimized for multi-threaded graph sampling) by 2.4×, on average.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"183 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121835356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/IPDPS54959.2023.00084
Haohao Liao, Mahmoud A. Elmohr, Xuan Dong, Yanjun Qian, Wenzhe Yang, Zhiwei Shang, Yin Tan
With the burgeoning demands for cloud computing in various fields followed by the rising attention to sensitive data exposure, Fully Homomorphic Encryption (FHE) is gaining popularity as a potential solution to privacy protection. By performing computations directly on the ciphertext (encrypted data) without decrypting it, FHE can guarantee the security of data throughout its lifecycle without compromising the privacy. However, the excruciatingly slow speed of FHE scheme makes adopting it impractical in real life applications. Therefore, hardware accelerators come to the rescue to mitigate the problem. Among various hardware platforms, FPGA clusters are particularly promising because of their flexibility and ready availability at many cloud providers such as FPGA-as-a-Service (FaaS). Hence, reusing the existing infrastructure can greatly facilitate the implementation of FHE on the cloud.In this paper, we present TurboHE, the first hardware accelerator for FHE operations based on an FPGA cluster. TurboHE aims to boost the performance of CKKS, one of the fastest FHE schemes which is most suitable to machine learning applications, by accelerating its computationally intensive and frequently used operation: relinearization. The proposed scalable architecture based on hardware partitioning can be easily configured to accommodate high acceleration requirements for relinearization with very large CKKS parameters. As a demonstration, an implementation, which supports 32,768 polynomial coefficients and a coefficient bitwidth of 594 decomposed into 11 Residue Number System (RNS) components, was deployed on a cluster consisting of 9 Xilinx VU13P FPGAs. The cluster operated at 200 MHz and achieved 1096 times throughput compared with a single threaded CPU implementation. Moreover, the low level hardware components implemented in this work such as the NTT module can also be applied to accelerate other lattice-based cryptography schemes.
{"title":"TurboHE: Accelerating Fully Homomorphic Encryption Using FPGA Clusters","authors":"Haohao Liao, Mahmoud A. Elmohr, Xuan Dong, Yanjun Qian, Wenzhe Yang, Zhiwei Shang, Yin Tan","doi":"10.1109/IPDPS54959.2023.00084","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00084","url":null,"abstract":"With the burgeoning demands for cloud computing in various fields followed by the rising attention to sensitive data exposure, Fully Homomorphic Encryption (FHE) is gaining popularity as a potential solution to privacy protection. By performing computations directly on the ciphertext (encrypted data) without decrypting it, FHE can guarantee the security of data throughout its lifecycle without compromising the privacy. However, the excruciatingly slow speed of FHE scheme makes adopting it impractical in real life applications. Therefore, hardware accelerators come to the rescue to mitigate the problem. Among various hardware platforms, FPGA clusters are particularly promising because of their flexibility and ready availability at many cloud providers such as FPGA-as-a-Service (FaaS). Hence, reusing the existing infrastructure can greatly facilitate the implementation of FHE on the cloud.In this paper, we present TurboHE, the first hardware accelerator for FHE operations based on an FPGA cluster. TurboHE aims to boost the performance of CKKS, one of the fastest FHE schemes which is most suitable to machine learning applications, by accelerating its computationally intensive and frequently used operation: relinearization. The proposed scalable architecture based on hardware partitioning can be easily configured to accommodate high acceleration requirements for relinearization with very large CKKS parameters. As a demonstration, an implementation, which supports 32,768 polynomial coefficients and a coefficient bitwidth of 594 decomposed into 11 Residue Number System (RNS) components, was deployed on a cluster consisting of 9 Xilinx VU13P FPGAs. The cluster operated at 200 MHz and achieved 1096 times throughput compared with a single threaded CPU implementation. Moreover, the low level hardware components implemented in this work such as the NTT module can also be applied to accelerate other lattice-based cryptography schemes.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132871307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/IPDPS54959.2023.00082
Fei Li, A. Mazumder
In this paper, we design, analyze, and evaluate a hybrid quantum algorithm for the metric traveling salesman problem (TSP). TSP is a well-studied NP-complete problem that many algorithmic techniques have been developed for, on both classic computers and quantum computers. The existing literature of algorithms for TSP are neither adaptive to input data nor suitable for processing medium-size data on the modern classic and quantum machines. In this work, we leverage the classic computers’ power (large memory) and the quantum computers’ power (quantum parallelism), based on the input data, to fasten the hybrid algorithm’s overall running time. Our algorithmic ideas include trimming the input data efficiently using a classic algorithm, finding an optimal solution for the post-processed data using a quantum-only algorithm, and constructing an optimal solution for the untrimmed data input efficiently using a classic algorithm. We conduct experiments to compare our hybrid algorithm against the state-of-the-art classic and quantum algorithms on real data sets. The experimental results show that our solution truly outperforms the others and thus confirm our theoretical analysis. This work provides insightful quantitative tools for people and compilers to choose appropriate quantum or classical or hybrid algorithms, especially in the NISQ (noisy intermediate-scale quantum) era, for NP-complete problems such as TSP.
{"title":"An Adaptive Hybrid Quantum Algorithm for the Metric Traveling Salesman Problem","authors":"Fei Li, A. Mazumder","doi":"10.1109/IPDPS54959.2023.00082","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00082","url":null,"abstract":"In this paper, we design, analyze, and evaluate a hybrid quantum algorithm for the metric traveling salesman problem (TSP). TSP is a well-studied NP-complete problem that many algorithmic techniques have been developed for, on both classic computers and quantum computers. The existing literature of algorithms for TSP are neither adaptive to input data nor suitable for processing medium-size data on the modern classic and quantum machines. In this work, we leverage the classic computers’ power (large memory) and the quantum computers’ power (quantum parallelism), based on the input data, to fasten the hybrid algorithm’s overall running time. Our algorithmic ideas include trimming the input data efficiently using a classic algorithm, finding an optimal solution for the post-processed data using a quantum-only algorithm, and constructing an optimal solution for the untrimmed data input efficiently using a classic algorithm. We conduct experiments to compare our hybrid algorithm against the state-of-the-art classic and quantum algorithms on real data sets. The experimental results show that our solution truly outperforms the others and thus confirm our theoretical analysis. This work provides insightful quantitative tools for people and compilers to choose appropriate quantum or classical or hybrid algorithms, especially in the NISQ (noisy intermediate-scale quantum) era, for NP-complete problems such as TSP.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130557955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/IPDPS54959.2023.00092
Zhuo Huang, Haoqiang Fan, Chaoyi Cheng, Song Wu, Hai Jin
A growing number of applications are moving to serverless architectures for high elasticity and fine-grained billing. For stateful applications, however, the use of serverless architectures is likely to lead to significant performance degradation, as frequent data sharing between different execution stages involves time-consuming remote storage access. Current platforms leverage memory cache to speed up remote access. However, conventional caching strategies show limited performance improvement. We experimentally find that the reason is that current strategies overlook the stage-dependent access patterns of stateful serverless applications, i.e., data that are read multiple times across stages (denoted as multi-read data) are wrongly evicted by data that are read only once (denoted as read-once data), causing a high cache miss ratio.Accordingly, we propose a new caching strategy, Duo, whose design principle is to cache multi-read data as long as possible. Specifically, Duo contains a large cache list and a small cache list, which act as Leader list and Wingman list, respectively. Leader list ignores the data that is read for the first time to prevent itself from being polluted by massive read-once data at each stage. Wingman list inspects the data that are ignored or evicted by Leader list, and pre-fetches the data that will probably be read again based on the observation that multi-read data usually appear periodically in groups. Compared to the state-of-the-art works, Duo improves hit ratio by 1.1×-2.1× and reduces the data sharing overhead by 25%-62%.
{"title":"Duo: Improving Data Sharing of Stateful Serverless Applications by Efficiently Caching Multi-Read Data","authors":"Zhuo Huang, Haoqiang Fan, Chaoyi Cheng, Song Wu, Hai Jin","doi":"10.1109/IPDPS54959.2023.00092","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00092","url":null,"abstract":"A growing number of applications are moving to serverless architectures for high elasticity and fine-grained billing. For stateful applications, however, the use of serverless architectures is likely to lead to significant performance degradation, as frequent data sharing between different execution stages involves time-consuming remote storage access. Current platforms leverage memory cache to speed up remote access. However, conventional caching strategies show limited performance improvement. We experimentally find that the reason is that current strategies overlook the stage-dependent access patterns of stateful serverless applications, i.e., data that are read multiple times across stages (denoted as multi-read data) are wrongly evicted by data that are read only once (denoted as read-once data), causing a high cache miss ratio.Accordingly, we propose a new caching strategy, Duo, whose design principle is to cache multi-read data as long as possible. Specifically, Duo contains a large cache list and a small cache list, which act as Leader list and Wingman list, respectively. Leader list ignores the data that is read for the first time to prevent itself from being polluted by massive read-once data at each stage. Wingman list inspects the data that are ignored or evicted by Leader list, and pre-fetches the data that will probably be read again based on the observation that multi-read data usually appear periodically in groups. Compared to the state-of-the-art works, Duo improves hit ratio by 1.1×-2.1× and reduces the data sharing overhead by 25%-62%.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127778306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/IPDPS54959.2023.00023
Qinghua Zhou, Quentin G. Anthony, Lang Xu, A. Shafi, M. Abduljabbar, H. Subramoni, Dhabaleswar K. Panda
Fully Sharded Data Parallel (FSDP) technology achieves higher performance by scaling out data-parallel training of Deep Learning (DL) models. It shards the model parameters, gradients, and optimizer states of the model among multiple GPUs. Consequently, this requires data-intensive Allgather and Reduce-Scatter communication to share the model parameters, which becomes a bottleneck. Existing schemes that use GPU-aware MPI libraries are highly prone to saturating the interconnect bandwidth. Therefore, integrating GPU-based compression into MPI libraries has proven efficient to achieve faster training time. In this paper, we propose an optimized Ring algorithm of Allgather and Reduce-Scatter collectives that encompass an efficient collective-level online compression scheme. At the microbenchmark level, Allgather achieves benefits of up to 83.6% and 30.3% compared to the baseline and existing point-to-point-based compression in a state-of-the-art MPI library on modern GPU clusters. Reduce-Scatter achieves 88.1% and 40.6% compared to baseline and point-to-point compression, respectively. For distributed DL training with PyTorch-FSDP, our approach yields 31.7% faster training than the baseline, and up to 12.5% compared to the existing point-to-point-based compression while maintaining similar accuracy.
{"title":"Accelerating Distributed Deep Learning Training with Compression Assisted Allgather and Reduce-Scatter Communication","authors":"Qinghua Zhou, Quentin G. Anthony, Lang Xu, A. Shafi, M. Abduljabbar, H. Subramoni, Dhabaleswar K. Panda","doi":"10.1109/IPDPS54959.2023.00023","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00023","url":null,"abstract":"Fully Sharded Data Parallel (FSDP) technology achieves higher performance by scaling out data-parallel training of Deep Learning (DL) models. It shards the model parameters, gradients, and optimizer states of the model among multiple GPUs. Consequently, this requires data-intensive Allgather and Reduce-Scatter communication to share the model parameters, which becomes a bottleneck. Existing schemes that use GPU-aware MPI libraries are highly prone to saturating the interconnect bandwidth. Therefore, integrating GPU-based compression into MPI libraries has proven efficient to achieve faster training time. In this paper, we propose an optimized Ring algorithm of Allgather and Reduce-Scatter collectives that encompass an efficient collective-level online compression scheme. At the microbenchmark level, Allgather achieves benefits of up to 83.6% and 30.3% compared to the baseline and existing point-to-point-based compression in a state-of-the-art MPI library on modern GPU clusters. Reduce-Scatter achieves 88.1% and 40.6% compared to baseline and point-to-point compression, respectively. For distributed DL training with PyTorch-FSDP, our approach yields 31.7% faster training than the baseline, and up to 12.5% compared to the existing point-to-point-based compression while maintaining similar accuracy.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128058401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Graph Neural Networks (GNNs) have emerged as a very powerful and popular machine learning model for numerous application domains. Each stage of a GNN requires an aggregation (sparse matrix-matrix multiplication) and a linear operation (dense matrix-matrix multiplication). Numerous efforts have addressed the development of distributed implementations for GNNs. Although efficient algorithms for distributed matrix multiplication are well known, the challenge here is the collective optimization of sequences of distributed matrix-matrix multiplications required for GNN, where many degrees of freedom also exist in the ordering of the component matrix-multiplication operations.This paper develops a new approach to distributed GNN, ReDistribution of Matrices (RDM), centered around communication-free distributed matrix-multiplication enabled by matrix redistribution between GNN stages. While the approach is applicable to the numerous algorithmic variants of GNN, the experimental evaluation focuses on GCN (Graph Convolutional Network), including both full-batch training as well as sampling-based training using GraphSAINT. Experimental evaluation with 2-layer and 3-layer GCN, using 128 or 256 hidden features, across eight sparse datasets, on a multi-GPU system with 8 GPUs shows that RDM attains a geometric mean speedup between 2× and 3.7× over two state-of-the-art multi-GPU GCN implementations, CAGNET and DGCL.
{"title":"Communication Optimization for Distributed Execution of Graph Neural Networks","authors":"Süreyya Emre Kurt, Jinghua Yan, Aravind Sukumaran-Rajam, Prashant Pandey, P. Sadayappan","doi":"10.1109/IPDPS54959.2023.00058","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00058","url":null,"abstract":"Graph Neural Networks (GNNs) have emerged as a very powerful and popular machine learning model for numerous application domains. Each stage of a GNN requires an aggregation (sparse matrix-matrix multiplication) and a linear operation (dense matrix-matrix multiplication). Numerous efforts have addressed the development of distributed implementations for GNNs. Although efficient algorithms for distributed matrix multiplication are well known, the challenge here is the collective optimization of sequences of distributed matrix-matrix multiplications required for GNN, where many degrees of freedom also exist in the ordering of the component matrix-multiplication operations.This paper develops a new approach to distributed GNN, ReDistribution of Matrices (RDM), centered around communication-free distributed matrix-multiplication enabled by matrix redistribution between GNN stages. While the approach is applicable to the numerous algorithmic variants of GNN, the experimental evaluation focuses on GCN (Graph Convolutional Network), including both full-batch training as well as sampling-based training using GraphSAINT. Experimental evaluation with 2-layer and 3-layer GCN, using 128 or 256 hidden features, across eight sparse datasets, on a multi-GPU system with 8 GPUs shows that RDM attains a geometric mean speedup between 2× and 3.7× over two state-of-the-art multi-GPU GCN implementations, CAGNET and DGCL.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129667002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/IPDPS54959.2023.00046
Yichen Zhang, Shengguo Li, Fan Yuan, Dezun Dong, Xiaojian Yang, Tiejun Li, Z. Wang
This paper presents a novel approach to optimize multiple invocations of a sparse matrix-vector multiplication (SpMV) kernel performed on the same sparse matrix A and dense vector x, like Ax, A2x, ⋯, Akx, and their linear combinations such as Ax + A2x. Such computations are frequently used in scientific applications for solving linear equations and in multi-grid methods. Existing SpMV optimization techniques typically focus on a single SpMV invocation and do not consider opportunities for optimization across a sequence of SpMV operations (SSpMV), leaving much room for performance improvement. Our work aims to bridge this performance gap. It achieve this by partitioning the sparse matrix into submatrices and devising a new computation pipeline that reduces memory access to the sparse matrix and exploits the data locality of the dense vector of SpMV. Additionally, we demonstrate how our approach can be integrated with parallelization schemes to further improve performance. We evaluate our approach on four distinct multi-core systems, including three ARM and one Intel platform. Experimental results show that our techniques improve the standard implementation and the highly-optimized Intel math kernel library (MKL) by a large margin.
{"title":"Memory-aware Optimization for Sequences of Sparse Matrix-Vector Multiplications","authors":"Yichen Zhang, Shengguo Li, Fan Yuan, Dezun Dong, Xiaojian Yang, Tiejun Li, Z. Wang","doi":"10.1109/IPDPS54959.2023.00046","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00046","url":null,"abstract":"This paper presents a novel approach to optimize multiple invocations of a sparse matrix-vector multiplication (SpMV) kernel performed on the same sparse matrix A and dense vector x, like Ax, A2x, ⋯, Akx, and their linear combinations such as Ax + A2x. Such computations are frequently used in scientific applications for solving linear equations and in multi-grid methods. Existing SpMV optimization techniques typically focus on a single SpMV invocation and do not consider opportunities for optimization across a sequence of SpMV operations (SSpMV), leaving much room for performance improvement. Our work aims to bridge this performance gap. It achieve this by partitioning the sparse matrix into submatrices and devising a new computation pipeline that reduces memory access to the sparse matrix and exploits the data locality of the dense vector of SpMV. Additionally, we demonstrate how our approach can be integrated with parallelization schemes to further improve performance. We evaluate our approach on four distinct multi-core systems, including three ARM and one Intel platform. Experimental results show that our techniques improve the standard implementation and the highly-optimized Intel math kernel library (MKL) by a large margin.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129307949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/IPDPS54959.2023.00053
Ke Liu, Kan Wu, Hua Wang, Ke Zhou, Ji Zhang, Cong Li
"Learned" admission policies have shown promise in improving Content Delivery Network (CDN) cache performance and lowering operational costs. Unfortunately, existing learned policies are optimized with a few fixed cache sizes while in reality, cache sizes often vary over time in an unpredictable manner. As a result, existing solutions cannot provide consistent benefits in production settings.We present SLAP, a learned CDN cache admission approach based on segmented object reuse time prediction. SLAP predicts an object’s reuse time range using the Long-Short-Term-Memory model and admits objects that will be reused (before eviction) given the current cache size. SLAP separates model training from cache size, allowing it to adapt to arbitrary sizes. The key to our solution is a novel segmented labeling scheme that enables SLAP to precisely predict object reuse time. To further make SLAP a practical and efficient solution, we propose aggressive reusing of computation and training on sampled traces to optimize model training, and a specialized predictor architecture that overlaps prediction computation with miss object fetching to optimize model inference. Our experiments with production CDN traces show that SLAP achieves significantly lower write traffic (38%-59%), longer SSDs service life (104%-178%), a consistently higher hit rate (3.2%-11.7%), and requires no effort to adapt to changing cache sizes, outperforming existing policies.
{"title":"SLAP: An Adaptive, Learned Admission Policy for Content Delivery Network Caching","authors":"Ke Liu, Kan Wu, Hua Wang, Ke Zhou, Ji Zhang, Cong Li","doi":"10.1109/IPDPS54959.2023.00053","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00053","url":null,"abstract":"\"Learned\" admission policies have shown promise in improving Content Delivery Network (CDN) cache performance and lowering operational costs. Unfortunately, existing learned policies are optimized with a few fixed cache sizes while in reality, cache sizes often vary over time in an unpredictable manner. As a result, existing solutions cannot provide consistent benefits in production settings.We present SLAP, a learned CDN cache admission approach based on segmented object reuse time prediction. SLAP predicts an object’s reuse time range using the Long-Short-Term-Memory model and admits objects that will be reused (before eviction) given the current cache size. SLAP separates model training from cache size, allowing it to adapt to arbitrary sizes. The key to our solution is a novel segmented labeling scheme that enables SLAP to precisely predict object reuse time. To further make SLAP a practical and efficient solution, we propose aggressive reusing of computation and training on sampled traces to optimize model training, and a specialized predictor architecture that overlaps prediction computation with miss object fetching to optimize model inference. Our experiments with production CDN traces show that SLAP achieves significantly lower write traffic (38%-59%), longer SSDs service life (104%-178%), a consistently higher hit rate (3.2%-11.7%), and requires no effort to adapt to changing cache sizes, outperforming existing policies.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"108 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116014884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/IPDPS54959.2023.00035
Danlin Jia, Yiming Xie, Li Wang, Xiaoqian Zhang, Allen Yang, Xuebin Yao, Mahsa Bayati, Pradeep Subedi, B. Sheng, N. Mi
The industry has adopted disaggregated storage systems to provide high-quality services for hyper-scale architectures. This infrastructure enables organizations to access storage resources that can be independently managed, configured, and scaled. It is supported by the recent advances of all-flash arrays and NVMe-over-Fabric protocol, enabling remote access to NVMe devices over different network fabrics. A surge of research has been proposed to mitigate network congestion in traditional remote direct memory access protocol (RDMA). However, NVMe-oF raises new challenges in congestion control for disaggregated storage systems.In this work, we investigate the performance degradation of the read throughput on storage nodes caused by traditional network congestion control mechanisms. We design a storage-side rate control (SRC) to relieve network congestion while avoiding performance degradation on storage nodes. First, we design an I/O throughput control mechanism in the NVMe driver layer to enable throughput control on storage nodes. Second, we construct a throughput prediction model to learn a mapping function between workload characteristics and I/O throughput. Third, we deploy SRC on storage nodes to cooperate with traditional network congestion control on an NVMe-over-RDMA architecture. Finally, we evaluate SRC with varying workloads, SSD configurations, and network topologies. The experimental results show that SRC achieves significant performance improvement.
{"title":"SRC: Mitigate I/O Throughput Degradation in Network Congestion Control of Disaggregated Storage Systems","authors":"Danlin Jia, Yiming Xie, Li Wang, Xiaoqian Zhang, Allen Yang, Xuebin Yao, Mahsa Bayati, Pradeep Subedi, B. Sheng, N. Mi","doi":"10.1109/IPDPS54959.2023.00035","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00035","url":null,"abstract":"The industry has adopted disaggregated storage systems to provide high-quality services for hyper-scale architectures. This infrastructure enables organizations to access storage resources that can be independently managed, configured, and scaled. It is supported by the recent advances of all-flash arrays and NVMe-over-Fabric protocol, enabling remote access to NVMe devices over different network fabrics. A surge of research has been proposed to mitigate network congestion in traditional remote direct memory access protocol (RDMA). However, NVMe-oF raises new challenges in congestion control for disaggregated storage systems.In this work, we investigate the performance degradation of the read throughput on storage nodes caused by traditional network congestion control mechanisms. We design a storage-side rate control (SRC) to relieve network congestion while avoiding performance degradation on storage nodes. First, we design an I/O throughput control mechanism in the NVMe driver layer to enable throughput control on storage nodes. Second, we construct a throughput prediction model to learn a mapping function between workload characteristics and I/O throughput. Third, we deploy SRC on storage nodes to cooperate with traditional network congestion control on an NVMe-over-RDMA architecture. Finally, we evaluate SRC with varying workloads, SSD configurations, and network topologies. The experimental results show that SRC achieves significant performance improvement.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116099795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}