首页 > 最新文献

2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)最新文献

英文 中文
Evaluation of an OPENMP Parallelization of Lucas-Kanade on a NUMA-Manycore numa -多核上Lucas-Kanade的OPENMP并行化评价
Olfa Haggui, C. Tadonki, F. Sayadi, B. Ouni
Lucas-Kanade algorithm is a well-known optical flow estimator widely used in image processing for motion detection and object tracking. As a typical image processing algorithm, the procedure is a series of convolution masks followed by 2×2 linear systems for the optical flow vectors. Since we are dealing with a stencil computation for each stage of the algorithm, the overhead from memory accesses is expected to stand as a serious scalability bottleneck, especially on a NUMA manycore configuration. The objective of this study is therefore to investigate an openMP parallelization of Lucas-kanade algorithm on a NUMA manycore, including the performance impact of NUMA-aware settings at runtime. Experimental results on a dual-socket INTEL Broadwell-EIEP is provided together with the corresponding technical discussions.
Lucas-Kanade算法是一种著名的光流估计算法,广泛应用于运动检测和目标跟踪等图像处理领域。作为一种典型的图像处理算法,该程序是一系列卷积掩模,然后是2×2线性系统的光流矢量。由于我们在算法的每个阶段都要处理一个模板计算,因此内存访问的开销预计会成为严重的可伸缩性瓶颈,特别是在NUMA多核配置上。因此,本研究的目的是研究Lucas-kanade算法在NUMA多核上的openMP并行化,包括NUMA感知设置在运行时的性能影响。给出了在INTEL Broadwell-EIEP双插槽上的实验结果,并进行了相应的技术讨论。
{"title":"Evaluation of an OPENMP Parallelization of Lucas-Kanade on a NUMA-Manycore","authors":"Olfa Haggui, C. Tadonki, F. Sayadi, B. Ouni","doi":"10.1109/CAHPC.2018.8645936","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645936","url":null,"abstract":"Lucas-Kanade algorithm is a well-known optical flow estimator widely used in image processing for motion detection and object tracking. As a typical image processing algorithm, the procedure is a series of convolution masks followed by 2×2 linear systems for the optical flow vectors. Since we are dealing with a stencil computation for each stage of the algorithm, the overhead from memory accesses is expected to stand as a serious scalability bottleneck, especially on a NUMA manycore configuration. The objective of this study is therefore to investigate an openMP parallelization of Lucas-kanade algorithm on a NUMA manycore, including the performance impact of NUMA-aware settings at runtime. Experimental results on a dual-socket INTEL Broadwell-EIEP is provided together with the corresponding technical discussions.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131747953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accelerating Deep Neural Network Training for Action Recognition on a Cluster of GPUs 基于gpu集群的动作识别加速深度神经网络训练
Guojing Cong, Giacomo Domeniconi, Joshua Shapiro, Fan Zhou, Barry Y. Chen
Due to the additional temporal dimension, large-scale video action recognition is even more challenging than image recognition and typically takes days to train on modern GPUs even for modest-sized datasets. We propose algorithms and techniques to accelerate training of deep neural networks for action recognition on a cluster of GPUs. In terms of convergence and scaling, our distributed training algorithm with adaptive batch size is provably superior to popular asynchronous stochastic gradient descent algorithms. The convergence analysis of our algorithm shows it is possible to reduce communication cost and at the same time minimize the number of iterations needed for convergence. We customize the Adam optimizer for our distributed algorithm to improve efficiency. In addition, we employ transfer-learning to further reduce training time while improving validation accuracy. Compared with the base-line single-GPU stochastic gradient descent implementation of the two-stream training approach, our implementation achieves super-linear speedups on 16 GPUs while improving validation accuracy. For the UCFI0l and HMDB51 datasets, the validation accuracies achieved are 93.1 % and 67.9% respectively. As far as we know, these are the highest accuracies achieved with the two-stream approach that does not involve computationally expensive 3D convolutions or pretraining on much larger datasets.
由于额外的时间维度,大规模视频动作识别甚至比图像识别更具挑战性,即使对于中等规模的数据集,通常也需要几天的时间在现代gpu上进行训练。我们提出了在gpu集群上加速深度神经网络动作识别训练的算法和技术。在收敛性和可扩展性方面,我们的具有自适应批大小的分布式训练算法优于流行的异步随机梯度下降算法。收敛性分析表明,该算法可以在减少通信开销的同时,最大限度地减少收敛所需的迭代次数。我们为分布式算法定制了Adam优化器,以提高效率。此外,我们采用迁移学习进一步减少训练时间,同时提高验证精度。与双流训练方法的基线单gpu随机梯度下降实现相比,我们的实现在16个gpu上实现了超线性加速,同时提高了验证精度。对于ucfi01和HMDB51数据集,验证准确率分别为93.1%和67.9%。据我们所知,这是两流方法所达到的最高精度,不涉及计算昂贵的3D卷积或在更大的数据集上进行预训练。
{"title":"Accelerating Deep Neural Network Training for Action Recognition on a Cluster of GPUs","authors":"Guojing Cong, Giacomo Domeniconi, Joshua Shapiro, Fan Zhou, Barry Y. Chen","doi":"10.1109/CAHPC.2018.8645861","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645861","url":null,"abstract":"Due to the additional temporal dimension, large-scale video action recognition is even more challenging than image recognition and typically takes days to train on modern GPUs even for modest-sized datasets. We propose algorithms and techniques to accelerate training of deep neural networks for action recognition on a cluster of GPUs. In terms of convergence and scaling, our distributed training algorithm with adaptive batch size is provably superior to popular asynchronous stochastic gradient descent algorithms. The convergence analysis of our algorithm shows it is possible to reduce communication cost and at the same time minimize the number of iterations needed for convergence. We customize the Adam optimizer for our distributed algorithm to improve efficiency. In addition, we employ transfer-learning to further reduce training time while improving validation accuracy. Compared with the base-line single-GPU stochastic gradient descent implementation of the two-stream training approach, our implementation achieves super-linear speedups on 16 GPUs while improving validation accuracy. For the UCFI0l and HMDB51 datasets, the validation accuracies achieved are 93.1 % and 67.9% respectively. As far as we know, these are the highest accuracies achieved with the two-stream approach that does not involve computationally expensive 3D convolutions or pretraining on much larger datasets.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133396245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Exploiting Compute Caches for Memory Bound Vector Operations 利用计算缓存进行内存绑定向量操作
João Vieira, N. Roma, P. Tomás, P. Ienne, G. F. P. Fernandes
To reduce the average memory access time, most current processors make use of a multilevel cache subsystem. However, despite the proven benefits of such cache structures in the resulting throughput, conventional operations such as copy, simple maps and reductions still require moving large amounts of data to the processing cores. This imposes significant energy and performance overheads, with most of the execution time being spent moving data across the memory hierarchy. To mitigate this problem, a Cache Compute System (CCS) that targets memory-bound kernels such as map and reduce operations is proposed. The developed CCS takes advantage of long cache lines and data locality to avoid data transfers to the processor and exploits the intrinsic parallelism of vector compute units to accelerate a set of 48 operations commonly used in map and reduce patterns. The CCS was validated by integrating it with an MB-Lite soft-core in a Xilinx Virtex-7 VC709 Development Board. When compared to the MB-Lite core, the proposed CCS presents performance improvements in the execution of the commands ranging from 4x to 408x, and energy efficiency gains from 6x to 328x.
为了减少平均内存访问时间,大多数当前处理器使用多级缓存子系统。然而,尽管这种缓存结构在产生的吞吐量方面已经证明了好处,但传统的操作(如复制、简单映射和缩减)仍然需要将大量数据移动到处理核心。这增加了大量的能量和性能开销,因为大部分执行时间都花在跨内存层次结构移动数据上。为了缓解这一问题,提出了一种针对map和reduce操作等内存受限内核的缓存计算系统(CCS)。开发的CCS利用长缓存线和数据局部性来避免数据传输到处理器,并利用矢量计算单元的内在并行性来加速map和reduce模式中常用的一组48个操作。通过将CCS与Xilinx Virtex-7 VC709开发板中的MB-Lite软核集成,对其进行了验证。与MB-Lite核心相比,提议的CCS在执行命令方面的性能提高了4倍到408倍,能源效率提高了6倍到328倍。
{"title":"Exploiting Compute Caches for Memory Bound Vector Operations","authors":"João Vieira, N. Roma, P. Tomás, P. Ienne, G. F. P. Fernandes","doi":"10.1109/CAHPC.2018.8645905","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645905","url":null,"abstract":"To reduce the average memory access time, most current processors make use of a multilevel cache subsystem. However, despite the proven benefits of such cache structures in the resulting throughput, conventional operations such as copy, simple maps and reductions still require moving large amounts of data to the processing cores. This imposes significant energy and performance overheads, with most of the execution time being spent moving data across the memory hierarchy. To mitigate this problem, a Cache Compute System (CCS) that targets memory-bound kernels such as map and reduce operations is proposed. The developed CCS takes advantage of long cache lines and data locality to avoid data transfers to the processor and exploits the intrinsic parallelism of vector compute units to accelerate a set of 48 operations commonly used in map and reduce patterns. The CCS was validated by integrating it with an MB-Lite soft-core in a Xilinx Virtex-7 VC709 Development Board. When compared to the MB-Lite core, the proposed CCS presents performance improvements in the execution of the commands ranging from 4x to 408x, and energy efficiency gains from 6x to 328x.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"74 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122695635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Automated GPU Grid Geometry Selection for OPENMP Kernels 自动GPU网格几何选择的OPENMP内核
T. Lloyd, Artem Chikin, Sanket Kedia, D. Jain, J. N. Amaral
Modern supercomputers are increasingly using GPUs to improve performance per watt. Generating GPU code for target regions in openMP 4.0, or later versions, requires the selection of grid geometry to execute the GPU kernel. Existing industrial-strength compilers use a simple heuristic with arbitrary numbers that are constant for all kernels. After characterizing the relationship between region features, grid geometry and performance, we built a machine-learning model that successfully predicts a suitable geometry for such kernels and results in a performance improvement with a geometric mean of 5% across the benchmarks studied. However, this prediction is impractical because the overhead of the predictor is too high. A careful study of the results of the predictor allowed for the development of a practical low-overhead heuristic that resulted in a performance improvement of up to 7 times with a geometric mean of 25.9%. This paper describes the methodology to build the machine-learning model, and the practical low-overhead heuristic that can be used in industry-strong compilers.
现代超级计算机越来越多地使用gpu来提高每瓦特的性能。在openMP 4.0或更高版本中,为目标区域生成GPU代码需要选择网格几何形状来执行GPU内核。现有的工业级编译器使用一种简单的启发式方法,其任意数字对于所有内核都是恒定的。在描述了区域特征、网格几何形状和性能之间的关系之后,我们建立了一个机器学习模型,该模型成功地预测了这些核的合适几何形状,并在所研究的基准测试中获得了5%的几何平均值的性能改进。然而,这种预测是不切实际的,因为预测器的开销太高了。对预测器结果的仔细研究允许开发实用的低开销启发式,该启发式导致性能提高高达7倍,几何平均值为25.9%。本文描述了构建机器学习模型的方法,以及可用于行业强大编译器的实用低开销启发式方法。
{"title":"Automated GPU Grid Geometry Selection for OPENMP Kernels","authors":"T. Lloyd, Artem Chikin, Sanket Kedia, D. Jain, J. N. Amaral","doi":"10.1109/CAHPC.2018.8645848","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645848","url":null,"abstract":"Modern supercomputers are increasingly using GPUs to improve performance per watt. Generating GPU code for target regions in openMP 4.0, or later versions, requires the selection of grid geometry to execute the GPU kernel. Existing industrial-strength compilers use a simple heuristic with arbitrary numbers that are constant for all kernels. After characterizing the relationship between region features, grid geometry and performance, we built a machine-learning model that successfully predicts a suitable geometry for such kernels and results in a performance improvement with a geometric mean of 5% across the benchmarks studied. However, this prediction is impractical because the overhead of the predictor is too high. A careful study of the results of the predictor allowed for the development of a practical low-overhead heuristic that resulted in a performance improvement of up to 7 times with a geometric mean of 25.9%. This paper describes the methodology to build the machine-learning model, and the practical low-overhead heuristic that can be used in industry-strong compilers.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"234 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127521516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
High-Performance Ensembles of Online Sequential Extreme Learning Machine for Regression and Time Series Forecasting 用于回归和时间序列预测的在线序列极限学习机的高性能集成
Luis Fernando L. Grim, A. Gradvohl
Ensembles of Online Sequential Extreme Learning Machine algorithm are suitable for forecasting Data Streams with Concept Drifts. Nevertheless, data streams forecasting require high-performance implementations due to the high incoming samples rate. In this work, we proposed to tune-up three ensembles, which operates with the Online Sequential Extreme Learning Machine, using high-performance techniques. We reim-plemented them in the C programming language with Intel MKL and MPI libraries. The Intel MKL provides functions that explore the multithread features in multicore CPUs, which expands the parallelism to multiprocessors architectures. The MPI allows us to parallelize tasks with distributed memory on several processes, which can be allocated within a single computational node, or spread over several nodes. In summary, our proposal consists of a two-level parallelization, where we allocated each ensemble model into an MPI process, and we parallelized the internal functions of each model in a set of threads through Intel MKL. Thus, the objective of this work is to verify if our proposals provide a significant improvement in execution time when compared to the respective conventional serial approaches. For the experiments, we used a synthetic and a real dataset. Experimental results showed that, in general, the high-performance ensembles improve the execution time, when compared with its serial version, performing up to 10-fold faster.
在线序列极限学习机算法的集成适合于预测带有概念漂移的数据流。然而,由于输入样本率高,数据流预测需要高性能的实现。在这项工作中,我们建议使用高性能技术调整三个集成,这些集成与在线顺序极限学习机一起运行。我们用C语言用Intel MKL和MPI库重新实现了它们。英特尔MKL提供了探索多核cpu多线程特性的函数,将并行性扩展到多处理器架构。MPI允许我们在多个进程上并行处理具有分布式内存的任务,这些内存可以在单个计算节点内分配,也可以分布在多个节点上。总之,我们的建议包括一个两级并行化,其中我们将每个集成模型分配到一个MPI进程中,并通过Intel MKL在一组线程中并行化每个模型的内部功能。因此,这项工作的目的是验证我们的建议与各自的传统串行方法相比,是否在执行时间上提供了显著的改进。在实验中,我们使用了一个合成数据集和一个真实数据集。实验结果表明,总体而言,高性能集成提高了执行时间,与串行版本相比,执行速度提高了10倍。
{"title":"High-Performance Ensembles of Online Sequential Extreme Learning Machine for Regression and Time Series Forecasting","authors":"Luis Fernando L. Grim, A. Gradvohl","doi":"10.1109/CAHPC.2018.8645863","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645863","url":null,"abstract":"Ensembles of Online Sequential Extreme Learning Machine algorithm are suitable for forecasting Data Streams with Concept Drifts. Nevertheless, data streams forecasting require high-performance implementations due to the high incoming samples rate. In this work, we proposed to tune-up three ensembles, which operates with the Online Sequential Extreme Learning Machine, using high-performance techniques. We reim-plemented them in the C programming language with Intel MKL and MPI libraries. The Intel MKL provides functions that explore the multithread features in multicore CPUs, which expands the parallelism to multiprocessors architectures. The MPI allows us to parallelize tasks with distributed memory on several processes, which can be allocated within a single computational node, or spread over several nodes. In summary, our proposal consists of a two-level parallelization, where we allocated each ensemble model into an MPI process, and we parallelized the internal functions of each model in a set of threads through Intel MKL. Thus, the objective of this work is to verify if our proposals provide a significant improvement in execution time when compared to the respective conventional serial approaches. For the experiments, we used a synthetic and a real dataset. Experimental results showed that, in general, the high-performance ensembles improve the execution time, when compared with its serial version, performing up to 10-fold faster.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"324 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116726066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Frequency Selection Approach for Energy Aware Cloud Database 能源感知云数据库的频率选择方法
Chaopeng Guo, J. Pierson
A lot of cloud systems are adopted in industry and academia to face the explosion of the data volume and the arrival of the big data era. Meanwhile, energy efficiency and energy saving become major concerns for data centers where massive cloud systems are deployed. However, energy waste is quite common due to resource over-provisioning. In this paper, using Dynamic Voltage and Frequency Scaling (DVFS), a frequency selection approach is introduced to improve the energy efficiency of cloud systems in terms of resource over-provisioning. In the approach, two algorithms, Genetic Algorithm (GA) and Monte Carlo Tree Search Algorithm (MCTS), are proposed. Cloud database system is taken as an example to evaluate the approach. The results of the experiments show that the algorithms have great scalability which can be applied to a 120-nodes case with high accuracy compared to optimal solutions (up to 99.9% and 99.6% for GA and MCTS respectively). According to an optimality bound analysis, 21 % of energy can be saved at most using our frequency selection approach.
面对数据量的爆炸式增长和大数据时代的到来,工业界和学术界大量采用云系统。同时,能源效率和节能成为部署大量云系统的数据中心的主要关注点。然而,由于资源供应过剩,能源浪费是相当普遍的。本文利用动态电压和频率缩放(DVFS),介绍了一种频率选择方法,以提高云系统在资源过剩方面的能源效率。该方法提出了遗传算法(GA)和蒙特卡罗树搜索算法(MCTS)两种算法。以云数据库系统为例对该方法进行了评价。实验结果表明,该算法具有良好的可扩展性,可以应用于120个节点的情况下,与最优解相比,准确率较高(GA和MCTS分别高达99.9%和99.6%)。根据最优界分析,使用我们的频率选择方法最多可节省21%的能量。
{"title":"Frequency Selection Approach for Energy Aware Cloud Database","authors":"Chaopeng Guo, J. Pierson","doi":"10.1109/CAHPC.2018.8645884","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645884","url":null,"abstract":"A lot of cloud systems are adopted in industry and academia to face the explosion of the data volume and the arrival of the big data era. Meanwhile, energy efficiency and energy saving become major concerns for data centers where massive cloud systems are deployed. However, energy waste is quite common due to resource over-provisioning. In this paper, using Dynamic Voltage and Frequency Scaling (DVFS), a frequency selection approach is introduced to improve the energy efficiency of cloud systems in terms of resource over-provisioning. In the approach, two algorithms, Genetic Algorithm (GA) and Monte Carlo Tree Search Algorithm (MCTS), are proposed. Cloud database system is taken as an example to evaluate the approach. The results of the experiments show that the algorithms have great scalability which can be applied to a 120-nodes case with high accuracy compared to optimal solutions (up to 99.9% and 99.6% for GA and MCTS respectively). According to an optimality bound analysis, 21 % of energy can be saved at most using our frequency selection approach.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130690124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Performance Prediction of GPU-Based Deep Learning Applications 基于gpu的深度学习应用性能预测
E. Gianniti, Li Zhang, D. Ardagna
Recent years saw an increasing success in the application of deep learning methods across various domains and for tackling different problems, ranging from image recognition and classification to text processing and speech recognition. In this paper we propose and validate an approach to model the execution time for training convolutional neural networks (CNNs) deployed on GPGPUs. We demonstrate that our approach is generally applicable to a variety of CNN models and different types of G PG PU s with high accuracy, aiming at the preliminary design phases for system sizing.
近年来,深度学习方法在各个领域的应用越来越成功,并解决了从图像识别和分类到文本处理和语音识别的不同问题。在本文中,我们提出并验证了一种方法来模拟部署在gpgpu上的训练卷积神经网络(cnn)的执行时间。针对系统规模的初步设计阶段,我们证明了我们的方法一般适用于各种CNN模型和不同类型的gpg PU,并且精度很高。
{"title":"Performance Prediction of GPU-Based Deep Learning Applications","authors":"E. Gianniti, Li Zhang, D. Ardagna","doi":"10.1109/CAHPC.2018.8645908","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645908","url":null,"abstract":"Recent years saw an increasing success in the application of deep learning methods across various domains and for tackling different problems, ranging from image recognition and classification to text processing and speech recognition. In this paper we propose and validate an approach to model the execution time for training convolutional neural networks (CNNs) deployed on GPGPUs. We demonstrate that our approach is generally applicable to a variety of CNN models and different types of G PG PU s with high accuracy, aiming at the preliminary design phases for system sizing.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126870385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Improving the Performance of Fog Computing Through the Use of Data Locality 利用数据局部性提高雾计算的性能
L. Steffenel
Fog computing extends the Cloud Computing paradigm to the edge of the network, developing a decentralized infrastructure in which services are distributed to locations that best meet the needs of the applications such as low communication latency, data caching or confidentiality. P2P-based platforms are good candidates to host Fog computing, but they usually lack important elements such as controlling where the data is stored and who will handle the computing tasks. As a consequence, controlling where the data is stored becomes as important as controlling who handle it. In this paper we propose different techniques to reinforce data-locality for P2P-based middlewares, and study how these techniques can be implemented. Experimental results demonstrate the interest of data locality on the data access performances.
雾计算将云计算范式扩展到网络边缘,开发一个分散的基础设施,其中服务分布到最能满足应用程序需求的位置,如低通信延迟、数据缓存或机密性。基于p2p的平台是托管雾计算的理想选择,但它们通常缺乏重要的元素,例如控制数据存储的位置以及谁来处理计算任务。因此,控制数据存储的位置与控制谁来处理数据变得同样重要。在本文中,我们提出了不同的技术来增强基于p2p的中间件的数据局部性,并研究了如何实现这些技术。实验结果证明了数据局部性对数据访问性能的影响。
{"title":"Improving the Performance of Fog Computing Through the Use of Data Locality","authors":"L. Steffenel","doi":"10.1109/CAHPC.2018.8645879","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645879","url":null,"abstract":"Fog computing extends the Cloud Computing paradigm to the edge of the network, developing a decentralized infrastructure in which services are distributed to locations that best meet the needs of the applications such as low communication latency, data caching or confidentiality. P2P-based platforms are good candidates to host Fog computing, but they usually lack important elements such as controlling where the data is stored and who will handle the computing tasks. As a consequence, controlling where the data is stored becomes as important as controlling who handle it. In this paper we propose different techniques to reinforce data-locality for P2P-based middlewares, and study how these techniques can be implemented. Experimental results demonstrate the interest of data locality on the data access performances.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133626680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
HyperSpace: Distributed Bayesian Hyperparameter Optimization 超空间:分布式贝叶斯超参数优化
M. T. Young, Jacob Hinkle, A. Ramanathan, R. Kannan
As machine learning models continue to increase in complexity, so does the potential number of free model parameters commonly known as hyperparameters. While there has been considerable progress toward finding optimal configurations of these hyperparameters, many optimization procedures are treated as black boxes. We believe optimization methods should not only return a set of optimized hyperparameters, but also give insight into the effects of model hyperparameter settings. To this end, we present HyperSpace, a parallel implementation of Bayesian sequential model-based optimization. HyperSpace leverages high performance computing (HPC) resources to better understand unknown, potentially non-convex hyperparameter search spaces. We show that it is possible to learn the dependencies between model hyperparameters through the optimization process. By partitioning large search spaces and running many optimization procedures in parallel, we also show that it is possible to discover families of good hyperparameter settings over a variety of models including unsupervised clustering, regression, and classification tasks.
随着机器学习模型的复杂性不断增加,通常被称为超参数的自由模型参数的潜在数量也在增加。虽然在寻找这些超参数的最佳配置方面已经取得了相当大的进展,但许多优化过程被视为黑盒。我们认为优化方法不仅应该返回一组优化过的超参数,还应该深入了解模型超参数设置的影响。为此,我们提出了HyperSpace,一个基于贝叶斯序列模型优化的并行实现。HyperSpace利用高性能计算(HPC)资源来更好地理解未知的、潜在的非凸超参数搜索空间。我们证明了通过优化过程可以学习模型超参数之间的依赖关系。通过划分大型搜索空间和并行运行许多优化过程,我们还表明,可以在各种模型(包括无监督聚类、回归和分类任务)上发现良好的超参数设置族。
{"title":"HyperSpace: Distributed Bayesian Hyperparameter Optimization","authors":"M. T. Young, Jacob Hinkle, A. Ramanathan, R. Kannan","doi":"10.1109/CAHPC.2018.8645954","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645954","url":null,"abstract":"As machine learning models continue to increase in complexity, so does the potential number of free model parameters commonly known as hyperparameters. While there has been considerable progress toward finding optimal configurations of these hyperparameters, many optimization procedures are treated as black boxes. We believe optimization methods should not only return a set of optimized hyperparameters, but also give insight into the effects of model hyperparameter settings. To this end, we present HyperSpace, a parallel implementation of Bayesian sequential model-based optimization. HyperSpace leverages high performance computing (HPC) resources to better understand unknown, potentially non-convex hyperparameter search spaces. We show that it is possible to learn the dependencies between model hyperparameters through the optimization process. By partitioning large search spaces and running many optimization procedures in parallel, we also show that it is possible to discover families of good hyperparameter settings over a variety of models including unsupervised clustering, regression, and classification tasks.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"107 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133754587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Towards a Single-Host Many-GPU System 迈向单主机多gpu系统
Ming-Hung Chen, I. Chung, B. Abali, P. Crumley
As computation-intensive tasks such as deep learning and big data analysis take advantage of GPU based accelerators, the interconnection links may become a bottleneck. In this paper, we investigate the upcoming performance bottleneck of multi-accelerator systems, as the number of accelerators equipped with single host grows. We instrumented the host PCIe fabric to measure the data transfer and compared it with the measurements from the software tool. It shows how the data transfer (P2P) helps to avoid the bottleneck on the interconnection links, but multi-GPU performance does not scale up as expected due to the control messages. We quantify the impact of host control messages with suggestions to remedy scalability bottlenecks. We also implement the proposed strategy on Lulesh to validate the concept. The result shows our strategy can save 59.86% time cost of the kernel and 13.32% PCIe H2D payload.
随着深度学习和大数据分析等计算密集型任务利用基于GPU的加速器,互连链路可能成为瓶颈。在本文中,我们研究了随着单个主机上的加速器数量的增加,多加速器系统即将出现的性能瓶颈。我们测量了主机PCIe结构来测量数据传输,并将其与软件工具的测量结果进行了比较。它显示了数据传输(P2P)如何帮助避免互连链路上的瓶颈,但由于控制消息,多gpu性能并没有像预期的那样扩展。我们量化了主机控制消息的影响,并提出了补救可伸缩性瓶颈的建议。我们还在Lulesh上实施了所提出的策略来验证这一概念。结果表明,该策略可节省59.86%的内核时间成本和13.32%的PCIe H2D负载。
{"title":"Towards a Single-Host Many-GPU System","authors":"Ming-Hung Chen, I. Chung, B. Abali, P. Crumley","doi":"10.1109/CAHPC.2018.8645874","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645874","url":null,"abstract":"As computation-intensive tasks such as deep learning and big data analysis take advantage of GPU based accelerators, the interconnection links may become a bottleneck. In this paper, we investigate the upcoming performance bottleneck of multi-accelerator systems, as the number of accelerators equipped with single host grows. We instrumented the host PCIe fabric to measure the data transfer and compared it with the measurements from the software tool. It shows how the data transfer (P2P) helps to avoid the bottleneck on the interconnection links, but multi-GPU performance does not scale up as expected due to the control messages. We quantify the impact of host control messages with suggestions to remedy scalability bottlenecks. We also implement the proposed strategy on Lulesh to validate the concept. The result shows our strategy can save 59.86% time cost of the kernel and 13.32% PCIe H2D payload.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"130 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121714866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1