Pub Date : 2018-09-01DOI: 10.1109/CAHPC.2018.8645936
Olfa Haggui, C. Tadonki, F. Sayadi, B. Ouni
Lucas-Kanade algorithm is a well-known optical flow estimator widely used in image processing for motion detection and object tracking. As a typical image processing algorithm, the procedure is a series of convolution masks followed by 2×2 linear systems for the optical flow vectors. Since we are dealing with a stencil computation for each stage of the algorithm, the overhead from memory accesses is expected to stand as a serious scalability bottleneck, especially on a NUMA manycore configuration. The objective of this study is therefore to investigate an openMP parallelization of Lucas-kanade algorithm on a NUMA manycore, including the performance impact of NUMA-aware settings at runtime. Experimental results on a dual-socket INTEL Broadwell-EIEP is provided together with the corresponding technical discussions.
{"title":"Evaluation of an OPENMP Parallelization of Lucas-Kanade on a NUMA-Manycore","authors":"Olfa Haggui, C. Tadonki, F. Sayadi, B. Ouni","doi":"10.1109/CAHPC.2018.8645936","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645936","url":null,"abstract":"Lucas-Kanade algorithm is a well-known optical flow estimator widely used in image processing for motion detection and object tracking. As a typical image processing algorithm, the procedure is a series of convolution masks followed by 2×2 linear systems for the optical flow vectors. Since we are dealing with a stencil computation for each stage of the algorithm, the overhead from memory accesses is expected to stand as a serious scalability bottleneck, especially on a NUMA manycore configuration. The objective of this study is therefore to investigate an openMP parallelization of Lucas-kanade algorithm on a NUMA manycore, including the performance impact of NUMA-aware settings at runtime. Experimental results on a dual-socket INTEL Broadwell-EIEP is provided together with the corresponding technical discussions.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131747953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-09-01DOI: 10.1109/CAHPC.2018.8645861
Guojing Cong, Giacomo Domeniconi, Joshua Shapiro, Fan Zhou, Barry Y. Chen
Due to the additional temporal dimension, large-scale video action recognition is even more challenging than image recognition and typically takes days to train on modern GPUs even for modest-sized datasets. We propose algorithms and techniques to accelerate training of deep neural networks for action recognition on a cluster of GPUs. In terms of convergence and scaling, our distributed training algorithm with adaptive batch size is provably superior to popular asynchronous stochastic gradient descent algorithms. The convergence analysis of our algorithm shows it is possible to reduce communication cost and at the same time minimize the number of iterations needed for convergence. We customize the Adam optimizer for our distributed algorithm to improve efficiency. In addition, we employ transfer-learning to further reduce training time while improving validation accuracy. Compared with the base-line single-GPU stochastic gradient descent implementation of the two-stream training approach, our implementation achieves super-linear speedups on 16 GPUs while improving validation accuracy. For the UCFI0l and HMDB51 datasets, the validation accuracies achieved are 93.1 % and 67.9% respectively. As far as we know, these are the highest accuracies achieved with the two-stream approach that does not involve computationally expensive 3D convolutions or pretraining on much larger datasets.
{"title":"Accelerating Deep Neural Network Training for Action Recognition on a Cluster of GPUs","authors":"Guojing Cong, Giacomo Domeniconi, Joshua Shapiro, Fan Zhou, Barry Y. Chen","doi":"10.1109/CAHPC.2018.8645861","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645861","url":null,"abstract":"Due to the additional temporal dimension, large-scale video action recognition is even more challenging than image recognition and typically takes days to train on modern GPUs even for modest-sized datasets. We propose algorithms and techniques to accelerate training of deep neural networks for action recognition on a cluster of GPUs. In terms of convergence and scaling, our distributed training algorithm with adaptive batch size is provably superior to popular asynchronous stochastic gradient descent algorithms. The convergence analysis of our algorithm shows it is possible to reduce communication cost and at the same time minimize the number of iterations needed for convergence. We customize the Adam optimizer for our distributed algorithm to improve efficiency. In addition, we employ transfer-learning to further reduce training time while improving validation accuracy. Compared with the base-line single-GPU stochastic gradient descent implementation of the two-stream training approach, our implementation achieves super-linear speedups on 16 GPUs while improving validation accuracy. For the UCFI0l and HMDB51 datasets, the validation accuracies achieved are 93.1 % and 67.9% respectively. As far as we know, these are the highest accuracies achieved with the two-stream approach that does not involve computationally expensive 3D convolutions or pretraining on much larger datasets.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133396245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-09-01DOI: 10.1109/CAHPC.2018.8645905
João Vieira, N. Roma, P. Tomás, P. Ienne, G. F. P. Fernandes
To reduce the average memory access time, most current processors make use of a multilevel cache subsystem. However, despite the proven benefits of such cache structures in the resulting throughput, conventional operations such as copy, simple maps and reductions still require moving large amounts of data to the processing cores. This imposes significant energy and performance overheads, with most of the execution time being spent moving data across the memory hierarchy. To mitigate this problem, a Cache Compute System (CCS) that targets memory-bound kernels such as map and reduce operations is proposed. The developed CCS takes advantage of long cache lines and data locality to avoid data transfers to the processor and exploits the intrinsic parallelism of vector compute units to accelerate a set of 48 operations commonly used in map and reduce patterns. The CCS was validated by integrating it with an MB-Lite soft-core in a Xilinx Virtex-7 VC709 Development Board. When compared to the MB-Lite core, the proposed CCS presents performance improvements in the execution of the commands ranging from 4x to 408x, and energy efficiency gains from 6x to 328x.
{"title":"Exploiting Compute Caches for Memory Bound Vector Operations","authors":"João Vieira, N. Roma, P. Tomás, P. Ienne, G. F. P. Fernandes","doi":"10.1109/CAHPC.2018.8645905","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645905","url":null,"abstract":"To reduce the average memory access time, most current processors make use of a multilevel cache subsystem. However, despite the proven benefits of such cache structures in the resulting throughput, conventional operations such as copy, simple maps and reductions still require moving large amounts of data to the processing cores. This imposes significant energy and performance overheads, with most of the execution time being spent moving data across the memory hierarchy. To mitigate this problem, a Cache Compute System (CCS) that targets memory-bound kernels such as map and reduce operations is proposed. The developed CCS takes advantage of long cache lines and data locality to avoid data transfers to the processor and exploits the intrinsic parallelism of vector compute units to accelerate a set of 48 operations commonly used in map and reduce patterns. The CCS was validated by integrating it with an MB-Lite soft-core in a Xilinx Virtex-7 VC709 Development Board. When compared to the MB-Lite core, the proposed CCS presents performance improvements in the execution of the commands ranging from 4x to 408x, and energy efficiency gains from 6x to 328x.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"74 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122695635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-09-01DOI: 10.1109/CAHPC.2018.8645848
T. Lloyd, Artem Chikin, Sanket Kedia, D. Jain, J. N. Amaral
Modern supercomputers are increasingly using GPUs to improve performance per watt. Generating GPU code for target regions in openMP 4.0, or later versions, requires the selection of grid geometry to execute the GPU kernel. Existing industrial-strength compilers use a simple heuristic with arbitrary numbers that are constant for all kernels. After characterizing the relationship between region features, grid geometry and performance, we built a machine-learning model that successfully predicts a suitable geometry for such kernels and results in a performance improvement with a geometric mean of 5% across the benchmarks studied. However, this prediction is impractical because the overhead of the predictor is too high. A careful study of the results of the predictor allowed for the development of a practical low-overhead heuristic that resulted in a performance improvement of up to 7 times with a geometric mean of 25.9%. This paper describes the methodology to build the machine-learning model, and the practical low-overhead heuristic that can be used in industry-strong compilers.
{"title":"Automated GPU Grid Geometry Selection for OPENMP Kernels","authors":"T. Lloyd, Artem Chikin, Sanket Kedia, D. Jain, J. N. Amaral","doi":"10.1109/CAHPC.2018.8645848","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645848","url":null,"abstract":"Modern supercomputers are increasingly using GPUs to improve performance per watt. Generating GPU code for target regions in openMP 4.0, or later versions, requires the selection of grid geometry to execute the GPU kernel. Existing industrial-strength compilers use a simple heuristic with arbitrary numbers that are constant for all kernels. After characterizing the relationship between region features, grid geometry and performance, we built a machine-learning model that successfully predicts a suitable geometry for such kernels and results in a performance improvement with a geometric mean of 5% across the benchmarks studied. However, this prediction is impractical because the overhead of the predictor is too high. A careful study of the results of the predictor allowed for the development of a practical low-overhead heuristic that resulted in a performance improvement of up to 7 times with a geometric mean of 25.9%. This paper describes the methodology to build the machine-learning model, and the practical low-overhead heuristic that can be used in industry-strong compilers.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"234 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127521516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-09-01DOI: 10.1109/CAHPC.2018.8645863
Luis Fernando L. Grim, A. Gradvohl
Ensembles of Online Sequential Extreme Learning Machine algorithm are suitable for forecasting Data Streams with Concept Drifts. Nevertheless, data streams forecasting require high-performance implementations due to the high incoming samples rate. In this work, we proposed to tune-up three ensembles, which operates with the Online Sequential Extreme Learning Machine, using high-performance techniques. We reim-plemented them in the C programming language with Intel MKL and MPI libraries. The Intel MKL provides functions that explore the multithread features in multicore CPUs, which expands the parallelism to multiprocessors architectures. The MPI allows us to parallelize tasks with distributed memory on several processes, which can be allocated within a single computational node, or spread over several nodes. In summary, our proposal consists of a two-level parallelization, where we allocated each ensemble model into an MPI process, and we parallelized the internal functions of each model in a set of threads through Intel MKL. Thus, the objective of this work is to verify if our proposals provide a significant improvement in execution time when compared to the respective conventional serial approaches. For the experiments, we used a synthetic and a real dataset. Experimental results showed that, in general, the high-performance ensembles improve the execution time, when compared with its serial version, performing up to 10-fold faster.
{"title":"High-Performance Ensembles of Online Sequential Extreme Learning Machine for Regression and Time Series Forecasting","authors":"Luis Fernando L. Grim, A. Gradvohl","doi":"10.1109/CAHPC.2018.8645863","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645863","url":null,"abstract":"Ensembles of Online Sequential Extreme Learning Machine algorithm are suitable for forecasting Data Streams with Concept Drifts. Nevertheless, data streams forecasting require high-performance implementations due to the high incoming samples rate. In this work, we proposed to tune-up three ensembles, which operates with the Online Sequential Extreme Learning Machine, using high-performance techniques. We reim-plemented them in the C programming language with Intel MKL and MPI libraries. The Intel MKL provides functions that explore the multithread features in multicore CPUs, which expands the parallelism to multiprocessors architectures. The MPI allows us to parallelize tasks with distributed memory on several processes, which can be allocated within a single computational node, or spread over several nodes. In summary, our proposal consists of a two-level parallelization, where we allocated each ensemble model into an MPI process, and we parallelized the internal functions of each model in a set of threads through Intel MKL. Thus, the objective of this work is to verify if our proposals provide a significant improvement in execution time when compared to the respective conventional serial approaches. For the experiments, we used a synthetic and a real dataset. Experimental results showed that, in general, the high-performance ensembles improve the execution time, when compared with its serial version, performing up to 10-fold faster.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"324 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116726066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-09-01DOI: 10.1109/CAHPC.2018.8645884
Chaopeng Guo, J. Pierson
A lot of cloud systems are adopted in industry and academia to face the explosion of the data volume and the arrival of the big data era. Meanwhile, energy efficiency and energy saving become major concerns for data centers where massive cloud systems are deployed. However, energy waste is quite common due to resource over-provisioning. In this paper, using Dynamic Voltage and Frequency Scaling (DVFS), a frequency selection approach is introduced to improve the energy efficiency of cloud systems in terms of resource over-provisioning. In the approach, two algorithms, Genetic Algorithm (GA) and Monte Carlo Tree Search Algorithm (MCTS), are proposed. Cloud database system is taken as an example to evaluate the approach. The results of the experiments show that the algorithms have great scalability which can be applied to a 120-nodes case with high accuracy compared to optimal solutions (up to 99.9% and 99.6% for GA and MCTS respectively). According to an optimality bound analysis, 21 % of energy can be saved at most using our frequency selection approach.
{"title":"Frequency Selection Approach for Energy Aware Cloud Database","authors":"Chaopeng Guo, J. Pierson","doi":"10.1109/CAHPC.2018.8645884","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645884","url":null,"abstract":"A lot of cloud systems are adopted in industry and academia to face the explosion of the data volume and the arrival of the big data era. Meanwhile, energy efficiency and energy saving become major concerns for data centers where massive cloud systems are deployed. However, energy waste is quite common due to resource over-provisioning. In this paper, using Dynamic Voltage and Frequency Scaling (DVFS), a frequency selection approach is introduced to improve the energy efficiency of cloud systems in terms of resource over-provisioning. In the approach, two algorithms, Genetic Algorithm (GA) and Monte Carlo Tree Search Algorithm (MCTS), are proposed. Cloud database system is taken as an example to evaluate the approach. The results of the experiments show that the algorithms have great scalability which can be applied to a 120-nodes case with high accuracy compared to optimal solutions (up to 99.9% and 99.6% for GA and MCTS respectively). According to an optimality bound analysis, 21 % of energy can be saved at most using our frequency selection approach.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130690124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-09-01DOI: 10.1109/CAHPC.2018.8645908
E. Gianniti, Li Zhang, D. Ardagna
Recent years saw an increasing success in the application of deep learning methods across various domains and for tackling different problems, ranging from image recognition and classification to text processing and speech recognition. In this paper we propose and validate an approach to model the execution time for training convolutional neural networks (CNNs) deployed on GPGPUs. We demonstrate that our approach is generally applicable to a variety of CNN models and different types of G PG PU s with high accuracy, aiming at the preliminary design phases for system sizing.
{"title":"Performance Prediction of GPU-Based Deep Learning Applications","authors":"E. Gianniti, Li Zhang, D. Ardagna","doi":"10.1109/CAHPC.2018.8645908","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645908","url":null,"abstract":"Recent years saw an increasing success in the application of deep learning methods across various domains and for tackling different problems, ranging from image recognition and classification to text processing and speech recognition. In this paper we propose and validate an approach to model the execution time for training convolutional neural networks (CNNs) deployed on GPGPUs. We demonstrate that our approach is generally applicable to a variety of CNN models and different types of G PG PU s with high accuracy, aiming at the preliminary design phases for system sizing.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126870385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-09-01DOI: 10.1109/CAHPC.2018.8645879
L. Steffenel
Fog computing extends the Cloud Computing paradigm to the edge of the network, developing a decentralized infrastructure in which services are distributed to locations that best meet the needs of the applications such as low communication latency, data caching or confidentiality. P2P-based platforms are good candidates to host Fog computing, but they usually lack important elements such as controlling where the data is stored and who will handle the computing tasks. As a consequence, controlling where the data is stored becomes as important as controlling who handle it. In this paper we propose different techniques to reinforce data-locality for P2P-based middlewares, and study how these techniques can be implemented. Experimental results demonstrate the interest of data locality on the data access performances.
{"title":"Improving the Performance of Fog Computing Through the Use of Data Locality","authors":"L. Steffenel","doi":"10.1109/CAHPC.2018.8645879","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645879","url":null,"abstract":"Fog computing extends the Cloud Computing paradigm to the edge of the network, developing a decentralized infrastructure in which services are distributed to locations that best meet the needs of the applications such as low communication latency, data caching or confidentiality. P2P-based platforms are good candidates to host Fog computing, but they usually lack important elements such as controlling where the data is stored and who will handle the computing tasks. As a consequence, controlling where the data is stored becomes as important as controlling who handle it. In this paper we propose different techniques to reinforce data-locality for P2P-based middlewares, and study how these techniques can be implemented. Experimental results demonstrate the interest of data locality on the data access performances.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133626680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-09-01DOI: 10.1109/CAHPC.2018.8645954
M. T. Young, Jacob Hinkle, A. Ramanathan, R. Kannan
As machine learning models continue to increase in complexity, so does the potential number of free model parameters commonly known as hyperparameters. While there has been considerable progress toward finding optimal configurations of these hyperparameters, many optimization procedures are treated as black boxes. We believe optimization methods should not only return a set of optimized hyperparameters, but also give insight into the effects of model hyperparameter settings. To this end, we present HyperSpace, a parallel implementation of Bayesian sequential model-based optimization. HyperSpace leverages high performance computing (HPC) resources to better understand unknown, potentially non-convex hyperparameter search spaces. We show that it is possible to learn the dependencies between model hyperparameters through the optimization process. By partitioning large search spaces and running many optimization procedures in parallel, we also show that it is possible to discover families of good hyperparameter settings over a variety of models including unsupervised clustering, regression, and classification tasks.
{"title":"HyperSpace: Distributed Bayesian Hyperparameter Optimization","authors":"M. T. Young, Jacob Hinkle, A. Ramanathan, R. Kannan","doi":"10.1109/CAHPC.2018.8645954","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645954","url":null,"abstract":"As machine learning models continue to increase in complexity, so does the potential number of free model parameters commonly known as hyperparameters. While there has been considerable progress toward finding optimal configurations of these hyperparameters, many optimization procedures are treated as black boxes. We believe optimization methods should not only return a set of optimized hyperparameters, but also give insight into the effects of model hyperparameter settings. To this end, we present HyperSpace, a parallel implementation of Bayesian sequential model-based optimization. HyperSpace leverages high performance computing (HPC) resources to better understand unknown, potentially non-convex hyperparameter search spaces. We show that it is possible to learn the dependencies between model hyperparameters through the optimization process. By partitioning large search spaces and running many optimization procedures in parallel, we also show that it is possible to discover families of good hyperparameter settings over a variety of models including unsupervised clustering, regression, and classification tasks.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"107 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133754587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-09-01DOI: 10.1109/CAHPC.2018.8645874
Ming-Hung Chen, I. Chung, B. Abali, P. Crumley
As computation-intensive tasks such as deep learning and big data analysis take advantage of GPU based accelerators, the interconnection links may become a bottleneck. In this paper, we investigate the upcoming performance bottleneck of multi-accelerator systems, as the number of accelerators equipped with single host grows. We instrumented the host PCIe fabric to measure the data transfer and compared it with the measurements from the software tool. It shows how the data transfer (P2P) helps to avoid the bottleneck on the interconnection links, but multi-GPU performance does not scale up as expected due to the control messages. We quantify the impact of host control messages with suggestions to remedy scalability bottlenecks. We also implement the proposed strategy on Lulesh to validate the concept. The result shows our strategy can save 59.86% time cost of the kernel and 13.32% PCIe H2D payload.
{"title":"Towards a Single-Host Many-GPU System","authors":"Ming-Hung Chen, I. Chung, B. Abali, P. Crumley","doi":"10.1109/CAHPC.2018.8645874","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645874","url":null,"abstract":"As computation-intensive tasks such as deep learning and big data analysis take advantage of GPU based accelerators, the interconnection links may become a bottleneck. In this paper, we investigate the upcoming performance bottleneck of multi-accelerator systems, as the number of accelerators equipped with single host grows. We instrumented the host PCIe fabric to measure the data transfer and compared it with the measurements from the software tool. It shows how the data transfer (P2P) helps to avoid the bottleneck on the interconnection links, but multi-GPU performance does not scale up as expected due to the control messages. We quantify the impact of host control messages with suggestions to remedy scalability bottlenecks. We also implement the proposed strategy on Lulesh to validate the concept. The result shows our strategy can save 59.86% time cost of the kernel and 13.32% PCIe H2D payload.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"130 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121714866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}