2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献_第3页

ArkFS: A Distributed File System on Object Storage for Archiving Data in HPC Environment ArkFS:一种基于对象存储的分布式文件系统，用于HPC环境下的数据归档

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00038

Kyu-Jin Cho, Injae Kang, Jin-Soo Kim

As the burst buffer is being widely deployed in the HPC (High-Performance Computing) systems, the distributed file system layer is taking the role of campaign storage where scalability and cost-effectiveness are of paramount importance. However, the centralized metadata management in the distributed file system layer poses a scalability challenge. The object storage system has emerged as an alternative thanks to its simplified interface and scale-out architecture. Despite this, the HPC communities are used to working with the POSIX interface to organize their files into a global directory hierarchy and control access through access control lists.In this paper, we present ArkFS, a near-POSIX compliant and scalable distributed file system implemented on top of the object storage system. ArkFS achieves high scalability without any centralized metadata servers. Instead, ArkFS lets each client manage a portion of the file system metadata on a per-directory basis. ArkFS supports any distributed object storage system such as Ceph RADOS or S3-compatible system with an appropriate API translation module. Our experimental results indicate that ArkFS shows significant performance improvement under metadata-intensive workloads while showing near-linear scalability. We also demonstrate that ArkFS is suitable for handling the bursty I/O traffic coming from the burst buffer layer to archive cold data.

随着突发缓冲区在HPC(高性能计算)系统中的广泛部署，分布式文件系统层正在扮演活动存储的角色，其中可伸缩性和成本效益至关重要。然而，分布式文件系统层的集中元数据管理带来了可伸缩性方面的挑战。对象存储系统由于其简化的接口和向外扩展的架构而成为另一种选择。尽管如此，HPC社区习惯于使用POSIX接口将其文件组织到全局目录层次结构中，并通过访问控制列表控制访问。在本文中，我们提出了ArkFS，一个接近posix兼容和可扩展的分布式文件系统，实现在对象存储系统之上。ArkFS无需任何集中式元数据服务器即可实现高可扩展性。相反，ArkFS允许每个客户端以每个目录为基础管理文件系统元数据的一部分。ArkFS支持任何分布式对象存储系统，如Ceph RADOS或具有适当API转换模块的s3兼容系统。我们的实验结果表明，ArkFS在元数据密集型工作负载下表现出显着的性能改进，同时显示出近似线性的可扩展性。我们还证明了ArkFS适合处理来自突发缓冲层的突发I/O流量，以归档冷数据。

{"title":"ArkFS: A Distributed File System on Object Storage for Archiving Data in HPC Environment","authors":"Kyu-Jin Cho, Injae Kang, Jin-Soo Kim","doi":"10.1109/IPDPS54959.2023.00038","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00038","url":null,"abstract":"As the burst buffer is being widely deployed in the HPC (High-Performance Computing) systems, the distributed file system layer is taking the role of campaign storage where scalability and cost-effectiveness are of paramount importance. However, the centralized metadata management in the distributed file system layer poses a scalability challenge. The object storage system has emerged as an alternative thanks to its simplified interface and scale-out architecture. Despite this, the HPC communities are used to working with the POSIX interface to organize their files into a global directory hierarchy and control access through access control lists.In this paper, we present ArkFS, a near-POSIX compliant and scalable distributed file system implemented on top of the object storage system. ArkFS achieves high scalability without any centralized metadata servers. Instead, ArkFS lets each client manage a portion of the file system metadata on a per-directory basis. ArkFS supports any distributed object storage system such as Ceph RADOS or S3-compatible system with an appropriate API translation module. Our experimental results indicate that ArkFS shows significant performance improvement under metadata-intensive workloads while showing near-linear scalability. We also demonstrate that ArkFS is suitable for handling the bursty I/O traffic coming from the burst buffer layer to archive cold data.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126909430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GraphTensor: Comprehensive GNN-Acceleration Framework for Efficient Parallel Processing of Massive Datasets GraphTensor:用于大规模数据集高效并行处理的综合gnn加速框架

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00011

Junhyeok Jang, Miryeong Kwon, Donghyun Gouk, Hanyeoreum Bae, Myoungsoo Jung

We present GraphTensor, a comprehensive open-source framework that supports efficient parallel neural network processing on large graphs. GraphTensor offers a set of easy-to-use programming primitives that appreciate both graph and neural network execution behaviors from the beginning (graph sampling) to the end (dense data processing). Our framework runs diverse graph neural network (GNN) models in a destination-centric, feature-wise manner, which can significantly shorten training execution times in a GPU. In addition, GraphTensor rearranges multiple GNN kernels based on their system hyperparameters in a self-governing manner, thereby reducing the processing dimensionality and the latencies further. From the end-to-end execution viewpoint, GraphTensor significantly shortens the service-level GNN latency by applying pipeline parallelism for efficient graph dataset preprocessing. Our evaluation shows that GraphTensor exhibits 1.4× better training performance than emerging GNN frameworks under the execution of large-scale, real-world graph workloads. For the end-to-end services, GraphTensor reduces training latencies of an advanced version of the GNN frameworks (optimized for multi-threaded graph sampling) by 2.4×, on average.

我们提出GraphTensor，一个全面的开源框架，支持对大型图进行高效的并行神经网络处理。GraphTensor提供了一组易于使用的编程原语，从开始(图采样)到结束(密集数据处理)都可以欣赏图和神经网络的执行行为。我们的框架以目标为中心，以特征为导向的方式运行各种图形神经网络(GNN)模型，这可以显着缩短GPU中的训练执行时间。此外，GraphTensor基于多个GNN核的系统超参数，以自治的方式重新排列多个GNN核，从而进一步降低了处理维数和延迟。从端到端执行的角度来看，GraphTensor通过应用流水线并行性进行高效的图数据预处理，显著缩短了服务级GNN延迟。我们的评估表明，在执行大规模、真实的图工作负载时，GraphTensor的训练性能比新兴的GNN框架好1.4倍。对于端到端服务，GraphTensor平均将高级版本的GNN框架(针对多线程图采样进行了优化)的训练延迟减少了2.4倍。

{"title":"GraphTensor: Comprehensive GNN-Acceleration Framework for Efficient Parallel Processing of Massive Datasets","authors":"Junhyeok Jang, Miryeong Kwon, Donghyun Gouk, Hanyeoreum Bae, Myoungsoo Jung","doi":"10.1109/IPDPS54959.2023.00011","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00011","url":null,"abstract":"We present GraphTensor, a comprehensive open-source framework that supports efficient parallel neural network processing on large graphs. GraphTensor offers a set of easy-to-use programming primitives that appreciate both graph and neural network execution behaviors from the beginning (graph sampling) to the end (dense data processing). Our framework runs diverse graph neural network (GNN) models in a destination-centric, feature-wise manner, which can significantly shorten training execution times in a GPU. In addition, GraphTensor rearranges multiple GNN kernels based on their system hyperparameters in a self-governing manner, thereby reducing the processing dimensionality and the latencies further. From the end-to-end execution viewpoint, GraphTensor significantly shortens the service-level GNN latency by applying pipeline parallelism for efficient graph dataset preprocessing. Our evaluation shows that GraphTensor exhibits 1.4× better training performance than emerging GNN frameworks under the execution of large-scale, real-world graph workloads. For the end-to-end services, GraphTensor reduces training latencies of an advanced version of the GNN frameworks (optimized for multi-threaded graph sampling) by 2.4×, on average.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"183 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121835356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

TurboHE: Accelerating Fully Homomorphic Encryption Using FPGA Clusters TurboHE:利用FPGA集群加速全同态加密

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00084

Haohao Liao, Mahmoud A. Elmohr, Xuan Dong, Yanjun Qian, Wenzhe Yang, Zhiwei Shang, Yin Tan

With the burgeoning demands for cloud computing in various fields followed by the rising attention to sensitive data exposure, Fully Homomorphic Encryption (FHE) is gaining popularity as a potential solution to privacy protection. By performing computations directly on the ciphertext (encrypted data) without decrypting it, FHE can guarantee the security of data throughout its lifecycle without compromising the privacy. However, the excruciatingly slow speed of FHE scheme makes adopting it impractical in real life applications. Therefore, hardware accelerators come to the rescue to mitigate the problem. Among various hardware platforms, FPGA clusters are particularly promising because of their flexibility and ready availability at many cloud providers such as FPGA-as-a-Service (FaaS). Hence, reusing the existing infrastructure can greatly facilitate the implementation of FHE on the cloud.In this paper, we present TurboHE, the first hardware accelerator for FHE operations based on an FPGA cluster. TurboHE aims to boost the performance of CKKS, one of the fastest FHE schemes which is most suitable to machine learning applications, by accelerating its computationally intensive and frequently used operation: relinearization. The proposed scalable architecture based on hardware partitioning can be easily configured to accommodate high acceleration requirements for relinearization with very large CKKS parameters. As a demonstration, an implementation, which supports 32,768 polynomial coefficients and a coefficient bitwidth of 594 decomposed into 11 Residue Number System (RNS) components, was deployed on a cluster consisting of 9 Xilinx VU13P FPGAs. The cluster operated at 200 MHz and achieved 1096 times throughput compared with a single threaded CPU implementation. Moreover, the low level hardware components implemented in this work such as the NTT module can also be applied to accelerate other lattice-based cryptography schemes.

随着各个领域对云计算的需求不断增长，以及对敏感数据暴露的日益关注，完全同态加密(Fully Homomorphic Encryption, FHE)作为一种潜在的隐私保护解决方案越来越受欢迎。通过直接对密文(加密数据)进行计算而不进行解密，FHE可以在不损害隐私的情况下保证数据在整个生命周期中的安全性。然而，FHE方案的速度慢得令人难以忍受，这使得它在实际应用中不切实际。因此，硬件加速器可以缓解这个问题。在各种硬件平台中，FPGA集群特别有前途，因为它们在许多云提供商(如FPGA即服务(FaaS))中具有灵活性和现成的可用性。因此，重用现有的基础设施可以极大地促进FHE在云上的实现。在本文中，我们提出了TurboHE，第一个基于FPGA集群的FHE操作硬件加速器。TurboHE旨在提高CKKS的性能，CKKS是最适合机器学习应用的最快的FHE方案之一，通过加速其计算密集型和频繁使用的操作:线性化。提出的基于硬件分区的可扩展架构可以很容易地配置，以适应具有非常大的CKKS参数的线性化的高加速要求。作为演示，在由9个Xilinx VU13P fpga组成的集群上部署了一个支持32,768个多项式系数和594个系数位宽分解为11个残余数系统(RNS)组件的实现。与单线程CPU实现相比，集群在200 MHz下运行，实现了1096倍的吞吐量。此外，在本工作中实现的底层硬件组件，如NTT模块，也可以应用于加速其他基于点阵的加密方案。

{"title":"TurboHE: Accelerating Fully Homomorphic Encryption Using FPGA Clusters","authors":"Haohao Liao, Mahmoud A. Elmohr, Xuan Dong, Yanjun Qian, Wenzhe Yang, Zhiwei Shang, Yin Tan","doi":"10.1109/IPDPS54959.2023.00084","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00084","url":null,"abstract":"With the burgeoning demands for cloud computing in various fields followed by the rising attention to sensitive data exposure, Fully Homomorphic Encryption (FHE) is gaining popularity as a potential solution to privacy protection. By performing computations directly on the ciphertext (encrypted data) without decrypting it, FHE can guarantee the security of data throughout its lifecycle without compromising the privacy. However, the excruciatingly slow speed of FHE scheme makes adopting it impractical in real life applications. Therefore, hardware accelerators come to the rescue to mitigate the problem. Among various hardware platforms, FPGA clusters are particularly promising because of their flexibility and ready availability at many cloud providers such as FPGA-as-a-Service (FaaS). Hence, reusing the existing infrastructure can greatly facilitate the implementation of FHE on the cloud.In this paper, we present TurboHE, the first hardware accelerator for FHE operations based on an FPGA cluster. TurboHE aims to boost the performance of CKKS, one of the fastest FHE schemes which is most suitable to machine learning applications, by accelerating its computationally intensive and frequently used operation: relinearization. The proposed scalable architecture based on hardware partitioning can be easily configured to accommodate high acceleration requirements for relinearization with very large CKKS parameters. As a demonstration, an implementation, which supports 32,768 polynomial coefficients and a coefficient bitwidth of 594 decomposed into 11 Residue Number System (RNS) components, was deployed on a cluster consisting of 9 Xilinx VU13P FPGAs. The cluster operated at 200 MHz and achieved 1096 times throughput compared with a single threaded CPU implementation. Moreover, the low level hardware components implemented in this work such as the NTT module can also be applied to accelerate other lattice-based cryptography schemes.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132871307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Adaptive Hybrid Quantum Algorithm for the Metric Traveling Salesman Problem 度量旅行商问题的自适应混合量子算法

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00082

Fei Li, A. Mazumder

In this paper, we design, analyze, and evaluate a hybrid quantum algorithm for the metric traveling salesman problem (TSP). TSP is a well-studied NP-complete problem that many algorithmic techniques have been developed for, on both classic computers and quantum computers. The existing literature of algorithms for TSP are neither adaptive to input data nor suitable for processing medium-size data on the modern classic and quantum machines. In this work, we leverage the classic computers’ power (large memory) and the quantum computers’ power (quantum parallelism), based on the input data, to fasten the hybrid algorithm’s overall running time. Our algorithmic ideas include trimming the input data efficiently using a classic algorithm, finding an optimal solution for the post-processed data using a quantum-only algorithm, and constructing an optimal solution for the untrimmed data input efficiently using a classic algorithm. We conduct experiments to compare our hybrid algorithm against the state-of-the-art classic and quantum algorithms on real data sets. The experimental results show that our solution truly outperforms the others and thus confirm our theoretical analysis. This work provides insightful quantitative tools for people and compilers to choose appropriate quantum or classical or hybrid algorithms, especially in the NISQ (noisy intermediate-scale quantum) era, for NP-complete problems such as TSP.

本文设计、分析并评价了一种用于度量旅行商问题(TSP)的混合量子算法。TSP是一个被充分研究的np完全问题，许多算法技术已经在经典计算机和量子计算机上开发出来。现有的TSP算法文献既不适应输入数据，也不适合在现代经典机器和量子机器上处理中等规模的数据。在这项工作中，我们利用经典计算机的能力(大内存)和量子计算机的能力(量子并行性)，基于输入数据，加快混合算法的整体运行时间。我们的算法思想包括使用经典算法有效地修剪输入数据，使用纯量子算法为后处理数据找到最优解，以及使用经典算法有效地为未修剪的数据输入构建最优解。我们进行了实验，将我们的混合算法与最先进的经典算法和量子算法在实际数据集上进行比较。实验结果表明，我们的解决方案确实优于其他方案，从而证实了我们的理论分析。这项工作为人们和编译者提供了有洞察力的定量工具，以选择适当的量子或经典或混合算法，特别是在NISQ(噪声中等规模量子)时代，用于np完全问题，如TSP。

{"title":"An Adaptive Hybrid Quantum Algorithm for the Metric Traveling Salesman Problem","authors":"Fei Li, A. Mazumder","doi":"10.1109/IPDPS54959.2023.00082","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00082","url":null,"abstract":"In this paper, we design, analyze, and evaluate a hybrid quantum algorithm for the metric traveling salesman problem (TSP). TSP is a well-studied NP-complete problem that many algorithmic techniques have been developed for, on both classic computers and quantum computers. The existing literature of algorithms for TSP are neither adaptive to input data nor suitable for processing medium-size data on the modern classic and quantum machines. In this work, we leverage the classic computers’ power (large memory) and the quantum computers’ power (quantum parallelism), based on the input data, to fasten the hybrid algorithm’s overall running time. Our algorithmic ideas include trimming the input data efficiently using a classic algorithm, finding an optimal solution for the post-processed data using a quantum-only algorithm, and constructing an optimal solution for the untrimmed data input efficiently using a classic algorithm. We conduct experiments to compare our hybrid algorithm against the state-of-the-art classic and quantum algorithms on real data sets. The experimental results show that our solution truly outperforms the others and thus confirm our theoretical analysis. This work provides insightful quantitative tools for people and compilers to choose appropriate quantum or classical or hybrid algorithms, especially in the NISQ (noisy intermediate-scale quantum) era, for NP-complete problems such as TSP.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130557955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Duo: Improving Data Sharing of Stateful Serverless Applications by Efficiently Caching Multi-Read Data 通过高效缓存多读数据来改善有状态无服务器应用程序的数据共享

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00092

Zhuo Huang, Haoqiang Fan, Chaoyi Cheng, Song Wu, Hai Jin

A growing number of applications are moving to serverless architectures for high elasticity and fine-grained billing. For stateful applications, however, the use of serverless architectures is likely to lead to significant performance degradation, as frequent data sharing between different execution stages involves time-consuming remote storage access. Current platforms leverage memory cache to speed up remote access. However, conventional caching strategies show limited performance improvement. We experimentally find that the reason is that current strategies overlook the stage-dependent access patterns of stateful serverless applications, i.e., data that are read multiple times across stages (denoted as multi-read data) are wrongly evicted by data that are read only once (denoted as read-once data), causing a high cache miss ratio.Accordingly, we propose a new caching strategy, Duo, whose design principle is to cache multi-read data as long as possible. Specifically, Duo contains a large cache list and a small cache list, which act as Leader list and Wingman list, respectively. Leader list ignores the data that is read for the first time to prevent itself from being polluted by massive read-once data at each stage. Wingman list inspects the data that are ignored or evicted by Leader list, and pre-fetches the data that will probably be read again based on the observation that multi-read data usually appear periodically in groups. Compared to the state-of-the-art works, Duo improves hit ratio by 1.1×-2.1× and reduces the data sharing overhead by 25%-62%.

越来越多的应用程序正在转向无服务器架构，以实现高弹性和细粒度计费。然而，对于有状态应用程序，使用无服务器架构可能会导致显著的性能下降，因为在不同执行阶段之间频繁的数据共享涉及耗时的远程存储访问。当前的平台利用内存缓存来加速远程访问。然而，传统的缓存策略显示出有限的性能改进。我们通过实验发现，原因是当前的策略忽略了有状态无服务器应用程序的阶段相关访问模式，即跨阶段读取多次的数据(表示为多次读取数据)被仅读取一次的数据(表示为一次读取数据)错误地驱逐，导致高缓存缺失率。因此，我们提出了一种新的缓存策略Duo，其设计原则是尽可能长时间地缓存多次读取的数据。具体来说，Duo包含一个大缓存列表和一个小缓存列表，它们分别充当Leader列表和Wingman列表。Leader list会忽略第一次读取的数据，以防止每个阶段大量的只读数据污染Leader list。Wingman列表检查被Leader列表忽略或删除的数据，并根据观察到多次读取的数据通常周期性地出现在组中，预取可能再次读取的数据。与最先进的作品相比，Duo提高了1.1×-2.1×命中率，并减少了25%-62%的数据共享开销。

{"title":"Duo: Improving Data Sharing of Stateful Serverless Applications by Efficiently Caching Multi-Read Data","authors":"Zhuo Huang, Haoqiang Fan, Chaoyi Cheng, Song Wu, Hai Jin","doi":"10.1109/IPDPS54959.2023.00092","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00092","url":null,"abstract":"A growing number of applications are moving to serverless architectures for high elasticity and fine-grained billing. For stateful applications, however, the use of serverless architectures is likely to lead to significant performance degradation, as frequent data sharing between different execution stages involves time-consuming remote storage access. Current platforms leverage memory cache to speed up remote access. However, conventional caching strategies show limited performance improvement. We experimentally find that the reason is that current strategies overlook the stage-dependent access patterns of stateful serverless applications, i.e., data that are read multiple times across stages (denoted as multi-read data) are wrongly evicted by data that are read only once (denoted as read-once data), causing a high cache miss ratio.Accordingly, we propose a new caching strategy, Duo, whose design principle is to cache multi-read data as long as possible. Specifically, Duo contains a large cache list and a small cache list, which act as Leader list and Wingman list, respectively. Leader list ignores the data that is read for the first time to prevent itself from being polluted by massive read-once data at each stage. Wingman list inspects the data that are ignored or evicted by Leader list, and pre-fetches the data that will probably be read again based on the observation that multi-read data usually appear periodically in groups. Compared to the state-of-the-art works, Duo improves hit ratio by 1.1×-2.1× and reduces the data sharing overhead by 25%-62%.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127778306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Accelerating Distributed Deep Learning Training with Compression Assisted Allgather and Reduce-Scatter Communication 利用压缩辅助的全聚和减散通信加速分布式深度学习训练

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00023

Qinghua Zhou, Quentin G. Anthony, Lang Xu, A. Shafi, M. Abduljabbar, H. Subramoni, Dhabaleswar K. Panda

Fully Sharded Data Parallel (FSDP) technology achieves higher performance by scaling out data-parallel training of Deep Learning (DL) models. It shards the model parameters, gradients, and optimizer states of the model among multiple GPUs. Consequently, this requires data-intensive Allgather and Reduce-Scatter communication to share the model parameters, which becomes a bottleneck. Existing schemes that use GPU-aware MPI libraries are highly prone to saturating the interconnect bandwidth. Therefore, integrating GPU-based compression into MPI libraries has proven efficient to achieve faster training time. In this paper, we propose an optimized Ring algorithm of Allgather and Reduce-Scatter collectives that encompass an efficient collective-level online compression scheme. At the microbenchmark level, Allgather achieves benefits of up to 83.6% and 30.3% compared to the baseline and existing point-to-point-based compression in a state-of-the-art MPI library on modern GPU clusters. Reduce-Scatter achieves 88.1% and 40.6% compared to baseline and point-to-point compression, respectively. For distributed DL training with PyTorch-FSDP, our approach yields 31.7% faster training than the baseline, and up to 12.5% compared to the existing point-to-point-based compression while maintaining similar accuracy.

全分片数据并行(FSDP)技术通过扩展深度学习(DL)模型的数据并行训练来实现更高的性能。它将模型参数、梯度和模型的优化器状态在多个gpu之间进行分片。因此，这需要数据密集型的Allgather和Reduce-Scatter通信来共享模型参数，这成为瓶颈。使用gpu感知MPI库的现有方案非常容易使互连带宽饱和。因此，将基于gpu的压缩集成到MPI库中已被证明是有效的，可以实现更快的训练时间。在本文中，我们提出了一个优化的Allgather和Reduce-Scatter集合的环算法，该算法包含了一个有效的集合级在线压缩方案。在微基准测试水平上，Allgather在现代GPU集群上使用最先进的MPI库，与基线和现有的基于点对点的压缩相比，实现了高达83.6%和30.3%的优势。与基线和点对点压缩相比，Reduce-Scatter分别达到了88.1%和40.6%。对于使用PyTorch-FSDP进行分布式深度学习训练，我们的方法比基线训练速度快31.7%，与现有的基于点对点的压缩相比，在保持相似精度的情况下，训练速度提高了12.5%。

{"title":"Accelerating Distributed Deep Learning Training with Compression Assisted Allgather and Reduce-Scatter Communication","authors":"Qinghua Zhou, Quentin G. Anthony, Lang Xu, A. Shafi, M. Abduljabbar, H. Subramoni, Dhabaleswar K. Panda","doi":"10.1109/IPDPS54959.2023.00023","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00023","url":null,"abstract":"Fully Sharded Data Parallel (FSDP) technology achieves higher performance by scaling out data-parallel training of Deep Learning (DL) models. It shards the model parameters, gradients, and optimizer states of the model among multiple GPUs. Consequently, this requires data-intensive Allgather and Reduce-Scatter communication to share the model parameters, which becomes a bottleneck. Existing schemes that use GPU-aware MPI libraries are highly prone to saturating the interconnect bandwidth. Therefore, integrating GPU-based compression into MPI libraries has proven efficient to achieve faster training time. In this paper, we propose an optimized Ring algorithm of Allgather and Reduce-Scatter collectives that encompass an efficient collective-level online compression scheme. At the microbenchmark level, Allgather achieves benefits of up to 83.6% and 30.3% compared to the baseline and existing point-to-point-based compression in a state-of-the-art MPI library on modern GPU clusters. Reduce-Scatter achieves 88.1% and 40.6% compared to baseline and point-to-point compression, respectively. For distributed DL training with PyTorch-FSDP, our approach yields 31.7% faster training than the baseline, and up to 12.5% compared to the existing point-to-point-based compression while maintaining similar accuracy.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128058401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Communication Optimization for Distributed Execution of Graph Neural Networks 图神经网络分布式执行的通信优化

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00058

Süreyya Emre Kurt, Jinghua Yan, Aravind Sukumaran-Rajam, Prashant Pandey, P. Sadayappan

Graph Neural Networks (GNNs) have emerged as a very powerful and popular machine learning model for numerous application domains. Each stage of a GNN requires an aggregation (sparse matrix-matrix multiplication) and a linear operation (dense matrix-matrix multiplication). Numerous efforts have addressed the development of distributed implementations for GNNs. Although efficient algorithms for distributed matrix multiplication are well known, the challenge here is the collective optimization of sequences of distributed matrix-matrix multiplications required for GNN, where many degrees of freedom also exist in the ordering of the component matrix-multiplication operations.This paper develops a new approach to distributed GNN, ReDistribution of Matrices (RDM), centered around communication-free distributed matrix-multiplication enabled by matrix redistribution between GNN stages. While the approach is applicable to the numerous algorithmic variants of GNN, the experimental evaluation focuses on GCN (Graph Convolutional Network), including both full-batch training as well as sampling-based training using GraphSAINT. Experimental evaluation with 2-layer and 3-layer GCN, using 128 or 256 hidden features, across eight sparse datasets, on a multi-GPU system with 8 GPUs shows that RDM attains a geometric mean speedup between 2× and 3.7× over two state-of-the-art multi-GPU GCN implementations, CAGNET and DGCL.

图神经网络(gnn)已经成为一种非常强大和流行的机器学习模型，适用于许多应用领域。GNN的每个阶段都需要一个聚合(稀疏矩阵-矩阵乘法)和一个线性运算(密集矩阵-矩阵乘法)。许多努力已经解决了gnn的分布式实现的开发。尽管分布式矩阵乘法的高效算法是众所周知的，但这里的挑战是GNN所需的分布式矩阵-矩阵乘法序列的集体优化，其中在分量矩阵-乘法操作的排序中也存在许多自由度。本文提出了一种分布式GNN的新方法，即矩阵的再分配(RDM)，该方法以GNN各阶段之间的矩阵再分配实现无通信的分布式矩阵乘法为中心。虽然该方法适用于GNN的众多算法变体，但实验评估主要集中在GCN(图卷积网络)上，包括全批训练和使用GraphSAINT的基于采样的训练。在8个gpu的多gpu系统上，在8个稀疏数据集上使用128或256个隐藏特征对2层和3层GCN进行的实验评估表明，RDM比两种最先进的多gpu GCN实现CAGNET和DGCL获得了2到3.7倍的几何平均加速。

{"title":"Communication Optimization for Distributed Execution of Graph Neural Networks","authors":"Süreyya Emre Kurt, Jinghua Yan, Aravind Sukumaran-Rajam, Prashant Pandey, P. Sadayappan","doi":"10.1109/IPDPS54959.2023.00058","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00058","url":null,"abstract":"Graph Neural Networks (GNNs) have emerged as a very powerful and popular machine learning model for numerous application domains. Each stage of a GNN requires an aggregation (sparse matrix-matrix multiplication) and a linear operation (dense matrix-matrix multiplication). Numerous efforts have addressed the development of distributed implementations for GNNs. Although efficient algorithms for distributed matrix multiplication are well known, the challenge here is the collective optimization of sequences of distributed matrix-matrix multiplications required for GNN, where many degrees of freedom also exist in the ordering of the component matrix-multiplication operations.This paper develops a new approach to distributed GNN, ReDistribution of Matrices (RDM), centered around communication-free distributed matrix-multiplication enabled by matrix redistribution between GNN stages. While the approach is applicable to the numerous algorithmic variants of GNN, the experimental evaluation focuses on GCN (Graph Convolutional Network), including both full-batch training as well as sampling-based training using GraphSAINT. Experimental evaluation with 2-layer and 3-layer GCN, using 128 or 256 hidden features, across eight sparse datasets, on a multi-GPU system with 8 GPUs shows that RDM attains a geometric mean speedup between 2× and 3.7× over two state-of-the-art multi-GPU GCN implementations, CAGNET and DGCL.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129667002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Memory-aware Optimization for Sequences of Sparse Matrix-Vector Multiplications 稀疏矩阵向量乘法序列的内存感知优化

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00046

Yichen Zhang, Shengguo Li, Fan Yuan, Dezun Dong, Xiaojian Yang, Tiejun Li, Z. Wang

This paper presents a novel approach to optimize multiple invocations of a sparse matrix-vector multiplication (SpMV) kernel performed on the same sparse matrix A and dense vector x, like Ax, A2x, ⋯, Akx, and their linear combinations such as Ax + A2x. Such computations are frequently used in scientific applications for solving linear equations and in multi-grid methods. Existing SpMV optimization techniques typically focus on a single SpMV invocation and do not consider opportunities for optimization across a sequence of SpMV operations (SSpMV), leaving much room for performance improvement. Our work aims to bridge this performance gap. It achieve this by partitioning the sparse matrix into submatrices and devising a new computation pipeline that reduces memory access to the sparse matrix and exploits the data locality of the dense vector of SpMV. Additionally, we demonstrate how our approach can be integrated with parallelization schemes to further improve performance. We evaluate our approach on four distinct multi-core systems, including three ARM and one Intel platform. Experimental results show that our techniques improve the standard implementation and the highly-optimized Intel math kernel library (MKL) by a large margin.

本文提出了一种新的方法来优化对相同稀疏矩阵a和密集向量x(如Ax, A2x，⋯⋯，Akx)及其线性组合(如Ax + A2x)执行的稀疏矩阵向量乘法(SpMV)核的多次调用。这种计算在求解线性方程和多重网格方法的科学应用中经常使用。现有的SpMV优化技术通常侧重于单个SpMV调用，而没有考虑跨一系列SpMV操作(SSpMV)进行优化的机会，从而为性能改进留下了很大的空间。我们的工作旨在弥合这一绩效差距。它通过将稀疏矩阵划分为子矩阵并设计一个新的计算管道来实现这一目标，该管道减少了对稀疏矩阵的内存访问并利用了SpMV密集向量的数据局部性。此外，我们还演示了如何将我们的方法与并行化方案集成以进一步提高性能。我们在四种不同的多核系统上评估了我们的方法，包括三种ARM平台和一种英特尔平台。实验结果表明，我们的技术在很大程度上改进了标准实现和高度优化的Intel数学内核库(MKL)。

{"title":"Memory-aware Optimization for Sequences of Sparse Matrix-Vector Multiplications","authors":"Yichen Zhang, Shengguo Li, Fan Yuan, Dezun Dong, Xiaojian Yang, Tiejun Li, Z. Wang","doi":"10.1109/IPDPS54959.2023.00046","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00046","url":null,"abstract":"This paper presents a novel approach to optimize multiple invocations of a sparse matrix-vector multiplication (SpMV) kernel performed on the same sparse matrix A and dense vector x, like Ax, A2x, ⋯, Akx, and their linear combinations such as Ax + A2x. Such computations are frequently used in scientific applications for solving linear equations and in multi-grid methods. Existing SpMV optimization techniques typically focus on a single SpMV invocation and do not consider opportunities for optimization across a sequence of SpMV operations (SSpMV), leaving much room for performance improvement. Our work aims to bridge this performance gap. It achieve this by partitioning the sparse matrix into submatrices and devising a new computation pipeline that reduces memory access to the sparse matrix and exploits the data locality of the dense vector of SpMV. Additionally, we demonstrate how our approach can be integrated with parallelization schemes to further improve performance. We evaluate our approach on four distinct multi-core systems, including three ARM and one Intel platform. Experimental results show that our techniques improve the standard implementation and the highly-optimized Intel math kernel library (MKL) by a large margin.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129307949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

SLAP: An Adaptive, Learned Admission Policy for Content Delivery Network Caching SLAP:内容传递网络缓存的自适应学习许可策略

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00053

Ke Liu, Kan Wu, Hua Wang, Ke Zhou, Ji Zhang, Cong Li

"Learned" admission policies have shown promise in improving Content Delivery Network (CDN) cache performance and lowering operational costs. Unfortunately, existing learned policies are optimized with a few fixed cache sizes while in reality, cache sizes often vary over time in an unpredictable manner. As a result, existing solutions cannot provide consistent benefits in production settings.We present SLAP, a learned CDN cache admission approach based on segmented object reuse time prediction. SLAP predicts an object’s reuse time range using the Long-Short-Term-Memory model and admits objects that will be reused (before eviction) given the current cache size. SLAP separates model training from cache size, allowing it to adapt to arbitrary sizes. The key to our solution is a novel segmented labeling scheme that enables SLAP to precisely predict object reuse time. To further make SLAP a practical and efficient solution, we propose aggressive reusing of computation and training on sampled traces to optimize model training, and a specialized predictor architecture that overlaps prediction computation with miss object fetching to optimize model inference. Our experiments with production CDN traces show that SLAP achieves significantly lower write traffic (38%-59%), longer SSDs service life (104%-178%), a consistently higher hit rate (3.2%-11.7%), and requires no effort to adapt to changing cache sizes, outperforming existing policies.

“习得”准入政策在改善内容分发网络(CDN)缓存性能和降低运营成本方面显示出了希望。不幸的是，现有的学习策略是使用一些固定的缓存大小进行优化的，而实际上，缓存大小经常以不可预测的方式随时间变化。因此，现有的解决方案无法在生产环境中提供一致的效益。提出了一种基于分段对象重用时间预测的CDN缓存接收方法SLAP。SLAP使用长短期内存模型预测对象的重用时间范围，并在给定当前缓存大小的情况下承认将被重用(在回收之前)的对象。SLAP将模型训练与缓存大小分开，允许它适应任意大小。我们的解决方案的关键是一种新的分段标记方案，该方案使SLAP能够精确地预测对象重用时间。为了进一步使SLAP成为一种实用和高效的解决方案，我们提出了在采样轨迹上积极重用计算和训练来优化模型训练，并提出了一个专门的预测器架构，该架构将预测计算与缺失对象提取重叠以优化模型推理。我们对生产CDN跟踪的实验表明，SLAP可以显著降低写流量(38%-59%)，延长ssd服务寿命(104%-178%)，始终保持较高的命中率(3.2%-11.7%)，并且无需努力适应不断变化的缓存大小，性能优于现有策略。

{"title":"SLAP: An Adaptive, Learned Admission Policy for Content Delivery Network Caching","authors":"Ke Liu, Kan Wu, Hua Wang, Ke Zhou, Ji Zhang, Cong Li","doi":"10.1109/IPDPS54959.2023.00053","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00053","url":null,"abstract":"\"Learned\" admission policies have shown promise in improving Content Delivery Network (CDN) cache performance and lowering operational costs. Unfortunately, existing learned policies are optimized with a few fixed cache sizes while in reality, cache sizes often vary over time in an unpredictable manner. As a result, existing solutions cannot provide consistent benefits in production settings.We present SLAP, a learned CDN cache admission approach based on segmented object reuse time prediction. SLAP predicts an object’s reuse time range using the Long-Short-Term-Memory model and admits objects that will be reused (before eviction) given the current cache size. SLAP separates model training from cache size, allowing it to adapt to arbitrary sizes. The key to our solution is a novel segmented labeling scheme that enables SLAP to precisely predict object reuse time. To further make SLAP a practical and efficient solution, we propose aggressive reusing of computation and training on sampled traces to optimize model training, and a specialized predictor architecture that overlaps prediction computation with miss object fetching to optimize model inference. Our experiments with production CDN traces show that SLAP achieves significantly lower write traffic (38%-59%), longer SSDs service life (104%-178%), a consistently higher hit rate (3.2%-11.7%), and requires no effort to adapt to changing cache sizes, outperforming existing policies.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"108 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116014884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SRC: Mitigate I/O Throughput Degradation in Network Congestion Control of Disaggregated Storage Systems 分类存储系统网络拥塞控制中的I/O吞吐量降低

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00035

Danlin Jia, Yiming Xie, Li Wang, Xiaoqian Zhang, Allen Yang, Xuebin Yao, Mahsa Bayati, Pradeep Subedi, B. Sheng, N. Mi

The industry has adopted disaggregated storage systems to provide high-quality services for hyper-scale architectures. This infrastructure enables organizations to access storage resources that can be independently managed, configured, and scaled. It is supported by the recent advances of all-flash arrays and NVMe-over-Fabric protocol, enabling remote access to NVMe devices over different network fabrics. A surge of research has been proposed to mitigate network congestion in traditional remote direct memory access protocol (RDMA). However, NVMe-oF raises new challenges in congestion control for disaggregated storage systems.In this work, we investigate the performance degradation of the read throughput on storage nodes caused by traditional network congestion control mechanisms. We design a storage-side rate control (SRC) to relieve network congestion while avoiding performance degradation on storage nodes. First, we design an I/O throughput control mechanism in the NVMe driver layer to enable throughput control on storage nodes. Second, we construct a throughput prediction model to learn a mapping function between workload characteristics and I/O throughput. Third, we deploy SRC on storage nodes to cooperate with traditional network congestion control on an NVMe-over-RDMA architecture. Finally, we evaluate SRC with varying workloads, SSD configurations, and network topologies. The experimental results show that SRC achieves significant performance improvement.

业界普遍采用分散式存储系统，为超大规模架构提供高质量的服务。这种基础设施使组织能够访问可以独立管理、配置和扩展的存储资源。它得到了全闪存阵列和NVMe-over- fabric协议的最新进展的支持，可以通过不同的网络结构远程访问NVMe设备。针对传统远程直接存储器访问协议(RDMA)中的网络拥塞问题，人们提出了大量的研究。然而，NVMe-oF给分布式存储系统的拥塞控制带来了新的挑战。在这项工作中，我们研究了传统网络拥塞控制机制导致的存储节点读吞吐量的性能下降。我们设计了一个存储端速率控制(SRC)来缓解网络拥塞，同时避免存储节点的性能下降。首先，在NVMe驱动层设计I/O吞吐量控制机制，实现对存储节点的吞吐量控制。其次，我们构建了吞吐量预测模型来学习工作负载特征与I/O吞吐量之间的映射函数。第三，我们在存储节点上部署SRC，以配合NVMe-over-RDMA架构下的传统网络拥塞控制。最后，我们用不同的工作负载、SSD配置和网络拓扑来评估SRC。实验结果表明，该算法取得了显著的性能提升。

{"title":"SRC: Mitigate I/O Throughput Degradation in Network Congestion Control of Disaggregated Storage Systems","authors":"Danlin Jia, Yiming Xie, Li Wang, Xiaoqian Zhang, Allen Yang, Xuebin Yao, Mahsa Bayati, Pradeep Subedi, B. Sheng, N. Mi","doi":"10.1109/IPDPS54959.2023.00035","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00035","url":null,"abstract":"The industry has adopted disaggregated storage systems to provide high-quality services for hyper-scale architectures. This infrastructure enables organizations to access storage resources that can be independently managed, configured, and scaled. It is supported by the recent advances of all-flash arrays and NVMe-over-Fabric protocol, enabling remote access to NVMe devices over different network fabrics. A surge of research has been proposed to mitigate network congestion in traditional remote direct memory access protocol (RDMA). However, NVMe-oF raises new challenges in congestion control for disaggregated storage systems.In this work, we investigate the performance degradation of the read throughput on storage nodes caused by traditional network congestion control mechanisms. We design a storage-side rate control (SRC) to relieve network congestion while avoiding performance degradation on storage nodes. First, we design an I/O throughput control mechanism in the NVMe driver layer to enable throughput control on storage nodes. Second, we construct a throughput prediction model to learn a mapping function between workload characteristics and I/O throughput. Third, we deploy SRC on storage nodes to cooperate with traditional network congestion control on an NVMe-over-RDMA architecture. Finally, we evaluate SRC with varying workloads, SSD configurations, and network topologies. The experimental results show that SRC achieves significant performance improvement.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116099795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0