Pub Date : 2019-09-01DOI: 10.1109/HPEC.2019.8916215
Xu T. Liu, J. Firoz, Marcin Zalewski, M. Halappanavar, K. Barker, A. Lumsdaine, A. Gebremedhin
Designing a scalable algorithm for community detection is challenging due to the simultaneous need for both high performance and quality of solution. We propose a new distributed algorithm for community detection based on a novel Label Propagation algorithm. The algorithm is inspired by the direction optimization technique in graph traversal algorithms, relies on the use of frontiers, and alternates between abstractions called label push and label pull. This organization creates flexibility and affords us with opportunities for balancing performance and quality of solution. We implement our algorithm in distributed memory with the active-message based asynchronous many-task runtime AM++. We experiment with two seeding strategies for the initial seeding stage, namely, random seeding and degree seeding. With the Graph Challenge dataset, our distributed implementation, in conjunction with the runtime support, detects the communities in graphs having 20 million vertices in less than one second while achieving reasonably high quality of solution.
{"title":"Distributed Direction-Optimizing Label Propagation for Community Detection","authors":"Xu T. Liu, J. Firoz, Marcin Zalewski, M. Halappanavar, K. Barker, A. Lumsdaine, A. Gebremedhin","doi":"10.1109/HPEC.2019.8916215","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916215","url":null,"abstract":"Designing a scalable algorithm for community detection is challenging due to the simultaneous need for both high performance and quality of solution. We propose a new distributed algorithm for community detection based on a novel Label Propagation algorithm. The algorithm is inspired by the direction optimization technique in graph traversal algorithms, relies on the use of frontiers, and alternates between abstractions called label push and label pull. This organization creates flexibility and affords us with opportunities for balancing performance and quality of solution. We implement our algorithm in distributed memory with the active-message based asynchronous many-task runtime AM++. We experiment with two seeding strategies for the initial seeding stage, namely, random seeding and degree seeding. With the Graph Challenge dataset, our distributed implementation, in conjunction with the runtime support, detects the communities in graphs having 20 million vertices in less than one second while achieving reasonably high quality of solution.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132792541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/HPEC.2019.8916328
Hongkuan Zhou, Ajitesh Srivastava, R. Kannan, V. Prasanna
PageRank is a fundamental graph algorithm to evaluate the importance of vertices in a graph. In this paper, we present an efficient parallel PageRank design based on an edge-centric scatter-gather model. To overcome the poor locality of PageRank and optimize the memory performance, we develop a fast and efficient partitioning technique. We first partition all the vertices into non-overlapping vertex sets such that the data of each vertex set can fit in the cache; then we sort the outgoing edges of each vertex set based on the destination vertices to minimize random memory writes. The partitioning technique significantly reduces random accesses to main memory and improves the sustained memory bandwidth by 3×. It also enables efficient parallel execution on multicore platforms; we use distinct cores to execute the computations of distinct vertex sets in parallel to achieve speedup. We implement our design on a 16-core Intel Xeon processor and use various large-scale real-life and synthetic datasets for evaluation. Compared with the PageRank Pipeline Benchmark, our design achieves 12× to 19× speedup for all the datasets.
运行时可重新配置的软件与可重新配置的硬件相结合是非常可取的,因为这是在不损害可编程性的情况下最大化运行时效率的一种手段。这类软件系统的编译器设计起来极其困难,因为它们必须在运行时利用不同类型的硬件。为了解决与动态可重构硬件相匹配的工作流的静态和动态编译器优化的需要,我们提出了一种针对软件定义硬件的动态软件编译器的中心组件的新设计。我们的综合设计不仅关注静态知识,还关注从程序执行中提取知识的半监督式提取,并开发其性能模型。具体来说,我们的新动态和可扩展知识库1)在工作流执行期间持续收集知识2)在最佳(可用)硬件配置上确定工作流的最佳实现。它在存储来自编译器的其他组件以及人工分析人员的信息并向其提供信息方面起着中心作用。通过丰富的三部分图表示,知识库捕获并学习了有关分解和将代码步骤映射到内核以及将内核映射到可用硬件配置的广泛信息。该知识库使用$ c++ $ Boost库实现,能够快速处理离线和在线查询和更新。我们展示了我们的知识库可以在$1 ms$内回答查询,而不管它存储的工作流的数量。据我们所知,这是支持高级语言编译以利用任意可重构平台的第一个动态和可扩展知识库的设计。
{"title":"Design and Implementation of Knowledge Base for Runtime Management of Software Defined Hardware","authors":"Hongkuan Zhou, Ajitesh Srivastava, R. Kannan, V. Prasanna","doi":"10.1109/HPEC.2019.8916328","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916328","url":null,"abstract":"PageRank is a fundamental graph algorithm to evaluate the importance of vertices in a graph. In this paper, we present an efficient parallel PageRank design based on an edge-centric scatter-gather model. To overcome the poor locality of PageRank and optimize the memory performance, we develop a fast and efficient partitioning technique. We first partition all the vertices into non-overlapping vertex sets such that the data of each vertex set can fit in the cache; then we sort the outgoing edges of each vertex set based on the destination vertices to minimize random memory writes. The partitioning technique significantly reduces random accesses to main memory and improves the sustained memory bandwidth by 3×. It also enables efficient parallel execution on multicore platforms; we use distinct cores to execute the computations of distinct vertex sets in parallel to achieve speedup. We implement our design on a 16-core Intel Xeon processor and use various large-scale real-life and synthetic datasets for evaluation. Compared with the PageRank Pipeline Benchmark, our design achieves 12× to 19× speedup for all the datasets.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131738217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/HPEC.2019.8916301
A. Gjersvik, Robert J. Moss
With a rapidly growing and evolving National Airspace System (NAS), ACAS X is intended to be the nextgeneration airborne collision avoidance system that can meet the demands its predecessor could not. The ACAS X algorithms are developed in the Julia programming language and are exercised in simulation environments tailored to test different characteristics of the system. Massive parallelization of these simulation environments has been implemented on the Lincoln Laboratory Supercomputing Center cluster in order to expedite the design and performance optimization of the system. This work outlines the approach to parallelization of one of our simulation tools and presents the resulting simulation speedups as well as a discussion on how it will enhance system characterization and design. Parallelization has made our simulation environment 33 times faster, which has greatly sped up the development process of ACAS X.
{"title":"A Parallel Simulation Approach to ACAS X Development","authors":"A. Gjersvik, Robert J. Moss","doi":"10.1109/HPEC.2019.8916301","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916301","url":null,"abstract":"With a rapidly growing and evolving National Airspace System (NAS), ACAS X is intended to be the nextgeneration airborne collision avoidance system that can meet the demands its predecessor could not. The ACAS X algorithms are developed in the Julia programming language and are exercised in simulation environments tailored to test different characteristics of the system. Massive parallelization of these simulation environments has been implemented on the Lincoln Laboratory Supercomputing Center cluster in order to expedite the design and performance optimization of the system. This work outlines the approach to parallelization of one of our simulation tools and presents the resulting simulation speedups as well as a discussion on how it will enhance system characterization and design. Parallelization has made our simulation environment 33 times faster, which has greatly sped up the development process of ACAS X.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133462714","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/HPEC.2019.8916443
B. R. Jordan, David Barrett, David Burke, Patrick Jardin, Amelia Littrell, P. Monticciolo, Michael Newey, J. Piou, Kara Warner
Software deployments in general, and deep learning applications in particular, suffer from difficulty in reproducible results. The use of containers to mitigate these issues is becoming a common practice. Singularity is a container technology which targets the unique issues present in High Performance Computing (HPC) Centers. This paper characterizes the impact of using Singularity for both Training and Inference on deep learning applications.
{"title":"Singularity for Machine Learning Applications - Analysis of Performance Impact","authors":"B. R. Jordan, David Barrett, David Burke, Patrick Jardin, Amelia Littrell, P. Monticciolo, Michael Newey, J. Piou, Kara Warner","doi":"10.1109/HPEC.2019.8916443","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916443","url":null,"abstract":"Software deployments in general, and deep learning applications in particular, suffer from difficulty in reproducible results. The use of containers to mitigate these issues is becoming a common practice. Singularity is a container technology which targets the unique issues present in High Performance Computing (HPC) Centers. This paper characterizes the impact of using Singularity for both Training and Inference on deep learning applications.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"209 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115552403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/HPEC.2019.8916307
Xiaojing An, Kasimir Gabert, James Fox, Oded Green, David A. Bader
Counting common neighbors between all vertex pairs in a graph is a fundamental operation, with uses in similarity measures, link prediction, graph compression, community detection, and more. Current shared-memory approaches either rely on set intersections or are not readily parallelizable. We introduce a new efficient and parallelizable algorithm to count common neighbors: starting at a wedge endpoint, we iterate through all wedges in the graph, and increment the common neighbor count for each endpoint pair. This exactly counts the common neighbors between all pairs without using set intersections, and as such attains an asymptotic improvement in runtime. Furthermore, our algorithm is simple to implement and only slight modifications are required for existing implementations to use our results. We provide an OpenMP implementation and evaluate it on real-world and synthetic graphs, demonstrating no loss of scalability and an asymptotic improvement. We show intersections are neither necessary nor helpful for computing all pairs common neighbor counts.
{"title":"Skip the Intersection: Quickly Counting Common Neighbors on Shared-Memory Systems","authors":"Xiaojing An, Kasimir Gabert, James Fox, Oded Green, David A. Bader","doi":"10.1109/HPEC.2019.8916307","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916307","url":null,"abstract":"Counting common neighbors between all vertex pairs in a graph is a fundamental operation, with uses in similarity measures, link prediction, graph compression, community detection, and more. Current shared-memory approaches either rely on set intersections or are not readily parallelizable. We introduce a new efficient and parallelizable algorithm to count common neighbors: starting at a wedge endpoint, we iterate through all wedges in the graph, and increment the common neighbor count for each endpoint pair. This exactly counts the common neighbors between all pairs without using set intersections, and as such attains an asymptotic improvement in runtime. Furthermore, our algorithm is simple to implement and only slight modifications are required for existing implementations to use our results. We provide an OpenMP implementation and evaluate it on real-world and synthetic graphs, demonstrating no loss of scalability and an asymptotic improvement. We show intersections are neither necessary nor helpful for computing all pairs common neighbor counts.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126588492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/HPEC.2019.8916397
Michael Nolan, Mark Hernandez, Philip Fremont-Smith, A. Swiston, K. Claypool
Electrocardiogram (ECG) signal features (e.g. Heart rate, intrapeak interval times) are data commonly used in physiological assessment. Commercial off-the-shelf (COTS) software solutions for ECG data processing are available, but are often developed for serialized data processing which scale poorly for large datasets. To address this issue, we’ve developed a Matlab code library for parallelized ECG feature generation. This library uses the pMatlab and MatMPI interfaces to distribute computing tasks over supercomputing clusters using the Simple Linux Utility for Resource Management (SLURM). To profile its performance as a function of parallelization scale, the ECG processing code was executed on a non-human primate dataset on the Lincoln Laboratory Supercomputing TXGreen cluster. Feature processing jobs were deployed over a range of processor counts and processor types to assess the overall reduction in job computation time. We show that individual process times decrease according to a 1/n relationship to the number of processors used, while total computation times accounting for deployment and data aggregation impose diminishing returns of time against processor count. A maximum mean reduction in overall file processing time of 99% is shown.
{"title":"ECG Feature Processing Performance Acceleration on SLURM Compute Systems","authors":"Michael Nolan, Mark Hernandez, Philip Fremont-Smith, A. Swiston, K. Claypool","doi":"10.1109/HPEC.2019.8916397","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916397","url":null,"abstract":"Electrocardiogram (ECG) signal features (e.g. Heart rate, intrapeak interval times) are data commonly used in physiological assessment. Commercial off-the-shelf (COTS) software solutions for ECG data processing are available, but are often developed for serialized data processing which scale poorly for large datasets. To address this issue, we’ve developed a Matlab code library for parallelized ECG feature generation. This library uses the pMatlab and MatMPI interfaces to distribute computing tasks over supercomputing clusters using the Simple Linux Utility for Resource Management (SLURM). To profile its performance as a function of parallelization scale, the ECG processing code was executed on a non-human primate dataset on the Lincoln Laboratory Supercomputing TXGreen cluster. Feature processing jobs were deployed over a range of processor counts and processor types to assess the overall reduction in job computation time. We show that individual process times decrease according to a 1/n relationship to the number of processors used, while total computation times accounting for deployment and data aggregation impose diminishing returns of time against processor count. A maximum mean reduction in overall file processing time of 99% is shown.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124857800","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/HPEC.2019.8916560
M. Ilić, Rupamathi Jaddivada
With recent trends in computation and communication architecture, it is becoming possible to simulate complex networked dynamical systems by employing high-fidelity models. The inherent spatial and temporal complexity of these systems, however, still acts as a roadblock. It is thus desirable to have adaptive platform design facilitating zooming-in and out of the models to emulate time-evolution of processes at a desired spatial and temporal granularity. In this paper, we propose new computing and networking abstractions, that can embrace physical dynamics and computations in a unified manner, by taking advantage of the inherent structure. We further design multi-rate numerical methods that can be implemented by computing architectures to facilitate adaptive zooming-in and out of the models spanning multiple spatial and temporal layers. These methods are all embedded in a platform called Dynamic Monitoring and Decision Systems (DyMonDS). We introduce a new service model of cloud computing called DyMonDS-as-a-Service (DyMaas), for use by operators at various spatial granularities to efficiently emulate the interconnection of IoT devices. The usage of this platform is described in the context of an electric microgrid system emulation.
{"title":"Introducing DyMonDS-as-a-Service (DyMaaS) for Internet of Things","authors":"M. Ilić, Rupamathi Jaddivada","doi":"10.1109/HPEC.2019.8916560","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916560","url":null,"abstract":"With recent trends in computation and communication architecture, it is becoming possible to simulate complex networked dynamical systems by employing high-fidelity models. The inherent spatial and temporal complexity of these systems, however, still acts as a roadblock. It is thus desirable to have adaptive platform design facilitating zooming-in and out of the models to emulate time-evolution of processes at a desired spatial and temporal granularity. In this paper, we propose new computing and networking abstractions, that can embrace physical dynamics and computations in a unified manner, by taking advantage of the inherent structure. We further design multi-rate numerical methods that can be implemented by computing architectures to facilitate adaptive zooming-in and out of the models spanning multiple spatial and temporal layers. These methods are all embedded in a platform called Dynamic Monitoring and Decision Systems (DyMonDS). We introduce a new service model of cloud computing called DyMonDS-as-a-Service (DyMaas), for use by operators at various spatial granularities to efficiently emulate the interconnection of IoT devices. The usage of this platform is described in the context of an electric microgrid system emulation.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131278892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/HPEC.2019.8916542
Frank Wanye, Vitaliy Gleyzer, Wu-chun Feng
Community detection in graphs, also known as graph partitioning, is a well-studied NP-hard problem. Various heuristic approaches have been adopted to tackle this problem in polynomial time. One such approach, as outlined in the IEEE HPEC Graph Challenge, is Bayesian statistics-based stochastic block partitioning. This method delivers high-quality partitions in sub-quadratic runtime, but it fails to scale to very large graphs. In this paper, we present sampling as an avenue for speeding up the algorithm on large graphs. We first show that existing sampling techniques can preserve a graph’s community structure. We then show that sampling for stochastic block partitioning can be used to produce a speedup of between $2.18 times$ and $7.26 times$ for graph sizes between 5,000 and 50,000 vertices without a significant loss in the accuracy of community detection.
{"title":"Fast Stochastic Block Partitioning via Sampling","authors":"Frank Wanye, Vitaliy Gleyzer, Wu-chun Feng","doi":"10.1109/HPEC.2019.8916542","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916542","url":null,"abstract":"Community detection in graphs, also known as graph partitioning, is a well-studied NP-hard problem. Various heuristic approaches have been adopted to tackle this problem in polynomial time. One such approach, as outlined in the IEEE HPEC Graph Challenge, is Bayesian statistics-based stochastic block partitioning. This method delivers high-quality partitions in sub-quadratic runtime, but it fails to scale to very large graphs. In this paper, we present sampling as an avenue for speeding up the algorithm on large graphs. We first show that existing sampling techniques can preserve a graph’s community structure. We then show that sampling for stochastic block partitioning can be used to produce a speedup of between $2.18 times$ and $7.26 times$ for graph sizes between 5,000 and 50,000 vertices without a significant loss in the accuracy of community detection.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130442985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The neural-network based deep learning is the key technology that enables many powerful applications, which include self-driving vehicles, computer vision, and natural language processing. Although various algorithms focus on different directions, generally, they mainly employ an iteration by iteration training and evaluating the process. Each iteration aims to find a parameter set, which minimizes a loss function defined by the learning model. When completing the training process, the global minimum is achieved with a set of optimized parameters. At this stage, deep learning applications can be shipped with a trained model to provide services. While deep learning applications are reshaping our daily life, obtaining a good learning model is an expensive task. Training deep learning models is, usually, time-consuming and requires lots of resources, e.g. CPU and GPU. In a multi-tenancy system, however, limited resources are shared by multiple clients that lead to severe resource contention. Therefore, a carefully designed resource management scheme is required to improve the overall performance. In this project, we propose a target based scheduling scheme named TRADL. In TRADL, developers have options to specify a two-tier target. If the accuracy of the model reaches a target, it can be delivered to clients while the training is still going on to continue improving the quality. The experiments show that TRADL is able to significantly reduce the time cost, as much as 48.2%, for reaching the target.
{"title":"Target-based Resource Allocation for Deep Learning Applications in a Multi-tenancy System","authors":"Wenjia Zheng, Yun Song, Zihao Guo, Yongcheng Cui, Suwen Gu, Ying Mao, Long Cheng","doi":"10.1109/HPEC.2019.8916403","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916403","url":null,"abstract":"The neural-network based deep learning is the key technology that enables many powerful applications, which include self-driving vehicles, computer vision, and natural language processing. Although various algorithms focus on different directions, generally, they mainly employ an iteration by iteration training and evaluating the process. Each iteration aims to find a parameter set, which minimizes a loss function defined by the learning model. When completing the training process, the global minimum is achieved with a set of optimized parameters. At this stage, deep learning applications can be shipped with a trained model to provide services. While deep learning applications are reshaping our daily life, obtaining a good learning model is an expensive task. Training deep learning models is, usually, time-consuming and requires lots of resources, e.g. CPU and GPU. In a multi-tenancy system, however, limited resources are shared by multiple clients that lead to severe resource contention. Therefore, a carefully designed resource management scheme is required to improve the overall performance. In this project, we propose a target based scheduling scheme named TRADL. In TRADL, developers have options to specify a two-tier target. If the accuracy of the model reaches a target, it can be delivered to clients while the training is still going on to continue improving the quality. The experiments show that TRADL is able to significantly reduce the time cost, as much as 48.2%, for reaching the target.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115043470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/HPEC.2019.8916286
Jialing Zhang, Aekyeung Moon, Xiaoyan Zhuo, S. Son
As the size and amount of data produced by high-performance computing (HPC) applications grow exponentially, an effective data reduction technique is becoming critical to mitigating time and space burden. Lossy compression techniques, which have been widely used in image and video compression, hold promise to fulfill such data reduction need. However, they are seldom adopted in HPC datasets because of their difficulty in quantifying the amount of information loss and data reduction. In this paper, we explore a lossy compression strategy by revisiting the energy compaction properties of discrete transforms on HPC datasets. Specifically, we apply block-based transforms to HPC datasets, obtain the minimum number of coefficients containing the maximum energy (or information) compaction rate, and quantize remaining non-dominant coefficients using a binning mechanism to minimize information loss expressed in a distortion measure. We implement the proposed approach and evaluate it using six real-world HPC datasets. Our experimental results show that, on average, only 6.67 bits are required to preserve an optimal energy compaction rate on our evaluated datasets. Moreover, our knee detection algorithm improves the distortion in terms of peak signal-to-noise ratio by 2.46 dB on average.
{"title":"Towards Improving Rate-Distortion Performance of Transform-Based Lossy Compression for HPC Datasets","authors":"Jialing Zhang, Aekyeung Moon, Xiaoyan Zhuo, S. Son","doi":"10.1109/HPEC.2019.8916286","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916286","url":null,"abstract":"As the size and amount of data produced by high-performance computing (HPC) applications grow exponentially, an effective data reduction technique is becoming critical to mitigating time and space burden. Lossy compression techniques, which have been widely used in image and video compression, hold promise to fulfill such data reduction need. However, they are seldom adopted in HPC datasets because of their difficulty in quantifying the amount of information loss and data reduction. In this paper, we explore a lossy compression strategy by revisiting the energy compaction properties of discrete transforms on HPC datasets. Specifically, we apply block-based transforms to HPC datasets, obtain the minimum number of coefficients containing the maximum energy (or information) compaction rate, and quantize remaining non-dominant coefficients using a binning mechanism to minimize information loss expressed in a distortion measure. We implement the proposed approach and evaluate it using six real-world HPC datasets. Our experimental results show that, on average, only 6.67 bits are required to preserve an optimal energy compaction rate on our evaluated datasets. Moreover, our knee detection algorithm improves the distortion in terms of peak signal-to-noise ratio by 2.46 dB on average.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133765375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}