2019 IEEE High Performance Extreme Computing Conference (HPEC)最新文献

英文中文

Distributed Direction-Optimizing Label Propagation for Community Detection 面向社区检测的分布式方向优化标签传播

2019 IEEE High Performance Extreme Computing Conference (HPEC)

Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916215

Xu T. Liu, J. Firoz, Marcin Zalewski, M. Halappanavar, K. Barker, A. Lumsdaine, A. Gebremedhin

Designing a scalable algorithm for community detection is challenging due to the simultaneous need for both high performance and quality of solution. We propose a new distributed algorithm for community detection based on a novel Label Propagation algorithm. The algorithm is inspired by the direction optimization technique in graph traversal algorithms, relies on the use of frontiers, and alternates between abstractions called label push and label pull. This organization creates flexibility and affords us with opportunities for balancing performance and quality of solution. We implement our algorithm in distributed memory with the active-message based asynchronous many-task runtime AM++. We experiment with two seeding strategies for the initial seeding stage, namely, random seeding and degree seeding. With the Graph Challenge dataset, our distributed implementation, in conjunction with the runtime support, detects the communities in graphs having 20 million vertices in less than one second while achieving reasonably high quality of solution.

设计一种可扩展的社区检测算法具有挑战性，因为同时需要高性能和高质量的解决方案。在标签传播算法的基础上，提出了一种新的分布式社区检测算法。该算法受图遍历算法中的方向优化技术的启发，依赖于边界的使用，并在称为标签推和标签拉的抽象之间交替。这种组织创造了灵活性，并为我们提供了平衡解决方案性能和质量的机会。我们使用基于活动消息的异步多任务运行时am++在分布式内存中实现了我们的算法。在初始播种阶段，采用随机播种和程度播种两种播种策略进行了试验。使用Graph Challenge数据集，我们的分布式实现与运行时支持一起，在不到一秒的时间内检测到具有2000万个顶点的图中的社区，同时获得相当高质量的解决方案。

引用次数: 7

Design and Implementation of Knowledge Base for Runtime Management of Software Deﬁned Hardware 软件定义硬件运行时管理知识库的设计与实现

2019 IEEE High Performance Extreme Computing Conference (HPEC)

Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916328

Hongkuan Zhou, Ajitesh Srivastava, R. Kannan, V. Prasanna

PageRank is a fundamental graph algorithm to evaluate the importance of vertices in a graph. In this paper, we present an efficient parallel PageRank design based on an edge-centric scatter-gather model. To overcome the poor locality of PageRank and optimize the memory performance, we develop a fast and efficient partitioning technique. We first partition all the vertices into non-overlapping vertex sets such that the data of each vertex set can fit in the cache; then we sort the outgoing edges of each vertex set based on the destination vertices to minimize random memory writes. The partitioning technique significantly reduces random accesses to main memory and improves the sustained memory bandwidth by 3×. It also enables efficient parallel execution on multicore platforms; we use distinct cores to execute the computations of distinct vertex sets in parallel to achieve speedup. We implement our design on a 16-core Intel Xeon processor and use various large-scale real-life and synthetic datasets for evaluation. Compared with the PageRank Pipeline Benchmark, our design achieves 12× to 19× speedup for all the datasets.

运行时可重新配置的软件与可重新配置的硬件相结合是非常可取的，因为这是在不损害可编程性的情况下最大化运行时效率的一种手段。这类软件系统的编译器设计起来极其困难，因为它们必须在运行时利用不同类型的硬件。为了解决与动态可重构硬件相匹配的工作流的静态和动态编译器优化的需要，我们提出了一种针对软件定义硬件的动态软件编译器的中心组件的新设计。我们的综合设计不仅关注静态知识，还关注从程序执行中提取知识的半监督式提取，并开发其性能模型。具体来说，我们的新动态和可扩展知识库1)在工作流执行期间持续收集知识2)在最佳(可用)硬件配置上确定工作流的最佳实现。它在存储来自编译器的其他组件以及人工分析人员的信息并向其提供信息方面起着中心作用。通过丰富的三部分图表示，知识库捕获并学习了有关分解和将代码步骤映射到内核以及将内核映射到可用硬件配置的广泛信息。该知识库使用$ c++ $ Boost库实现，能够快速处理离线和在线查询和更新。我们展示了我们的知识库可以在$1 ms$内回答查询，而不管它存储的工作流的数量。据我们所知，这是支持高级语言编译以利用任意可重构平台的第一个动态和可扩展知识库的设计。

{"title":"Design and Implementation of Knowledge Base for Runtime Management of Software Deﬁned Hardware","authors":"Hongkuan Zhou, Ajitesh Srivastava, R. Kannan, V. Prasanna","doi":"10.1109/HPEC.2019.8916328","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916328","url":null,"abstract":"PageRank is a fundamental graph algorithm to evaluate the importance of vertices in a graph. In this paper, we present an efficient parallel PageRank design based on an edge-centric scatter-gather model. To overcome the poor locality of PageRank and optimize the memory performance, we develop a fast and efficient partitioning technique. We first partition all the vertices into non-overlapping vertex sets such that the data of each vertex set can fit in the cache; then we sort the outgoing edges of each vertex set based on the destination vertices to minimize random memory writes. The partitioning technique significantly reduces random accesses to main memory and improves the sustained memory bandwidth by 3×. It also enables efficient parallel execution on multicore platforms; we use distinct cores to execute the computations of distinct vertex sets in parallel to achieve speedup. We implement our design on a 16-core Intel Xeon processor and use various large-scale real-life and synthetic datasets for evaluation. Compared with the PageRank Pipeline Benchmark, our design achieves 12× to 19× speedup for all the datasets.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131738217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

A Parallel Simulation Approach to ACAS X Development ACAS X开发的并行仿真方法

2019 IEEE High Performance Extreme Computing Conference (HPEC)

Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916301

A. Gjersvik, Robert J. Moss

With a rapidly growing and evolving National Airspace System (NAS), ACAS X is intended to be the nextgeneration airborne collision avoidance system that can meet the demands its predecessor could not. The ACAS X algorithms are developed in the Julia programming language and are exercised in simulation environments tailored to test different characteristics of the system. Massive parallelization of these simulation environments has been implemented on the Lincoln Laboratory Supercomputing Center cluster in order to expedite the design and performance optimization of the system. This work outlines the approach to parallelization of one of our simulation tools and presents the resulting simulation speedups as well as a discussion on how it will enhance system characterization and design. Parallelization has made our simulation environment 33 times faster, which has greatly sped up the development process of ACAS X.

随着国家空域系统(NAS)的快速发展和演进，ACAS X旨在成为下一代机载防撞系统，能够满足其前身无法满足的需求。ACAS X算法是用Julia编程语言开发的，并在定制的仿真环境中进行练习，以测试系统的不同特性。这些模拟环境的大规模并行化已经在林肯实验室超级计算中心集群上实现，以加快系统的设计和性能优化。这项工作概述了我们的仿真工具之一的并行化方法，并介绍了由此产生的仿真加速以及关于它将如何增强系统表征和设计的讨论。并行化使我们的仿真环境速度提高了33倍，大大加快了ACAS X的开发进程。

引用次数: 0

Singularity for Machine Learning Applications - Analysis of Performance Impact 机器学习应用的奇点-性能影响分析

2019 IEEE High Performance Extreme Computing Conference (HPEC)

Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916443

B. R. Jordan, David Barrett, David Burke, Patrick Jardin, Amelia Littrell, P. Monticciolo, Michael Newey, J. Piou, Kara Warner

Software deployments in general, and deep learning applications in particular, suffer from difficulty in reproducible results. The use of containers to mitigate these issues is becoming a common practice. Singularity is a container technology which targets the unique issues present in High Performance Computing (HPC) Centers. This paper characterizes the impact of using Singularity for both Training and Inference on deep learning applications.

一般情况下，软件部署，尤其是深度学习应用，都难以获得可重复的结果。使用容器来缓解这些问题正在成为一种常见的做法。奇点是一种容器技术，针对高性能计算(HPC)中心中存在的独特问题。本文描述了在深度学习应用中同时使用奇点进行训练和推理的影响。

引用次数: 2

Skip the Intersection: Quickly Counting Common Neighbors on Shared-Memory Systems 跳过交叉点:快速计算共享内存系统上的共同邻居

2019 IEEE High Performance Extreme Computing Conference (HPEC)

Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916307

Xiaojing An, Kasimir Gabert, James Fox, Oded Green, David A. Bader

Counting common neighbors between all vertex pairs in a graph is a fundamental operation, with uses in similarity measures, link prediction, graph compression, community detection, and more. Current shared-memory approaches either rely on set intersections or are not readily parallelizable. We introduce a new efficient and parallelizable algorithm to count common neighbors: starting at a wedge endpoint, we iterate through all wedges in the graph, and increment the common neighbor count for each endpoint pair. This exactly counts the common neighbors between all pairs without using set intersections, and as such attains an asymptotic improvement in runtime. Furthermore, our algorithm is simple to implement and only slight modifications are required for existing implementations to use our results. We provide an OpenMP implementation and evaluate it on real-world and synthetic graphs, demonstrating no loss of scalability and an asymptotic improvement. We show intersections are neither necessary nor helpful for computing all pairs common neighbor counts.

计算图中所有顶点对之间的共同邻居是一项基本操作，用于相似性度量、链接预测、图压缩、社区检测等。当前的共享内存方法要么依赖于集合交叉点，要么不容易并行化。我们引入了一种新的高效且可并行化的算法来计算共同邻居:我们从一个楔形端点开始，迭代图中的所有楔形，并增加每个端点对的共同邻居计数。这在不使用集合交点的情况下准确地计算了所有对之间的共同邻域，因此在运行时获得了渐近改进。此外，我们的算法易于实现，现有的实现只需要稍加修改就可以使用我们的结果。我们提供了一个OpenMP实现，并在真实世界和合成图上对其进行了评估，证明了没有损失可伸缩性和渐近改进。我们证明交集对于计算所有对的共同邻居计数既没有必要也没有帮助。

引用次数: 2

ECG Feature Processing Performance Acceleration on SLURM Compute Systems 基于SLURM计算系统的心电特征处理性能加速

2019 IEEE High Performance Extreme Computing Conference (HPEC)

Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916397

Michael Nolan, Mark Hernandez, Philip Fremont-Smith, A. Swiston, K. Claypool

Electrocardiogram (ECG) signal features (e.g. Heart rate, intrapeak interval times) are data commonly used in physiological assessment. Commercial off-the-shelf (COTS) software solutions for ECG data processing are available, but are often developed for serialized data processing which scale poorly for large datasets. To address this issue, we’ve developed a Matlab code library for parallelized ECG feature generation. This library uses the pMatlab and MatMPI interfaces to distribute computing tasks over supercomputing clusters using the Simple Linux Utility for Resource Management (SLURM). To profile its performance as a function of parallelization scale, the ECG processing code was executed on a non-human primate dataset on the Lincoln Laboratory Supercomputing TXGreen cluster. Feature processing jobs were deployed over a range of processor counts and processor types to assess the overall reduction in job computation time. We show that individual process times decrease according to a 1/n relationship to the number of processors used, while total computation times accounting for deployment and data aggregation impose diminishing returns of time against processor count. A maximum mean reduction in overall file processing time of 99% is shown.

心电图(ECG)信号特征(如心率、峰内间隔时间)是生理评估中常用的数据。商用现货(COTS)软件解决方案可用于心电数据处理，但通常是为序列化数据处理而开发的，这对于大型数据集的可扩展性很差。为了解决这个问题，我们开发了一个并行心电特征生成的Matlab代码库。该库使用pMatlab和MatMPI接口，使用简单Linux资源管理实用程序(SLURM)在超级计算集群上分配计算任务。为了分析其作为并行化规模函数的性能，在林肯实验室超级计算TXGreen集群上的非人类灵长类动物数据集上执行心电处理代码。特征处理作业部署在一系列处理器数量和处理器类型上，以评估作业计算时间的总体减少情况。我们表明，单个处理时间与所使用的处理器数量呈1/n关系减少，而考虑部署和数据聚合的总计算时间对处理器数量的回报递减。总体文件处理时间的最大平均减少率为99%。

{"title":"ECG Feature Processing Performance Acceleration on SLURM Compute Systems","authors":"Michael Nolan, Mark Hernandez, Philip Fremont-Smith, A. Swiston, K. Claypool","doi":"10.1109/HPEC.2019.8916397","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916397","url":null,"abstract":"Electrocardiogram (ECG) signal features (e.g. Heart rate, intrapeak interval times) are data commonly used in physiological assessment. Commercial off-the-shelf (COTS) software solutions for ECG data processing are available, but are often developed for serialized data processing which scale poorly for large datasets. To address this issue, we’ve developed a Matlab code library for parallelized ECG feature generation. This library uses the pMatlab and MatMPI interfaces to distribute computing tasks over supercomputing clusters using the Simple Linux Utility for Resource Management (SLURM). To profile its performance as a function of parallelization scale, the ECG processing code was executed on a non-human primate dataset on the Lincoln Laboratory Supercomputing TXGreen cluster. Feature processing jobs were deployed over a range of processor counts and processor types to assess the overall reduction in job computation time. We show that individual process times decrease according to a 1/n relationship to the number of processors used, while total computation times accounting for deployment and data aggregation impose diminishing returns of time against processor count. A maximum mean reduction in overall file processing time of 99% is shown.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124857800","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Introducing DyMonDS-as-a-Service (DyMaaS) for Internet of Things 为物联网引入DyMonDS-as-a-Service (DyMaaS)

2019 IEEE High Performance Extreme Computing Conference (HPEC)

Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916560

M. Ilić, Rupamathi Jaddivada

With recent trends in computation and communication architecture, it is becoming possible to simulate complex networked dynamical systems by employing high-fidelity models. The inherent spatial and temporal complexity of these systems, however, still acts as a roadblock. It is thus desirable to have adaptive platform design facilitating zooming-in and out of the models to emulate time-evolution of processes at a desired spatial and temporal granularity. In this paper, we propose new computing and networking abstractions, that can embrace physical dynamics and computations in a unified manner, by taking advantage of the inherent structure. We further design multi-rate numerical methods that can be implemented by computing architectures to facilitate adaptive zooming-in and out of the models spanning multiple spatial and temporal layers. These methods are all embedded in a platform called Dynamic Monitoring and Decision Systems (DyMonDS). We introduce a new service model of cloud computing called DyMonDS-as-a-Service (DyMaas), for use by operators at various spatial granularities to efficiently emulate the interconnection of IoT devices. The usage of this platform is described in the context of an electric microgrid system emulation.

随着计算和通信体系结构的发展，利用高保真模型来模拟复杂的网络动态系统成为可能。然而，这些系统固有的空间和时间复杂性仍然是一个障碍。因此，需要有自适应的平台设计，以便在所需的空间和时间粒度上放大和缩小模型，以模拟过程的时间演化。在本文中，我们提出了新的计算和网络抽象，它可以利用固有的结构，以统一的方式包含物理动力学和计算。我们进一步设计了可以通过计算架构实现的多速率数值方法，以促进跨越多个空间和时间层的模型的自适应放大和缩小。这些方法都嵌入在一个名为动态监测和决策系统(DyMonDS)的平台中。我们引入了一种新的云计算服务模型，称为DyMonDS-as-a-Service (DyMaas)，供不同空间粒度的运营商使用，以有效地模拟物联网设备的互连。以某微电网系统仿真为背景，介绍了该平台的使用方法。

{"title":"Introducing DyMonDS-as-a-Service (DyMaaS) for Internet of Things","authors":"M. Ilić, Rupamathi Jaddivada","doi":"10.1109/HPEC.2019.8916560","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916560","url":null,"abstract":"With recent trends in computation and communication architecture, it is becoming possible to simulate complex networked dynamical systems by employing high-fidelity models. The inherent spatial and temporal complexity of these systems, however, still acts as a roadblock. It is thus desirable to have adaptive platform design facilitating zooming-in and out of the models to emulate time-evolution of processes at a desired spatial and temporal granularity. In this paper, we propose new computing and networking abstractions, that can embrace physical dynamics and computations in a unified manner, by taking advantage of the inherent structure. We further design multi-rate numerical methods that can be implemented by computing architectures to facilitate adaptive zooming-in and out of the models spanning multiple spatial and temporal layers. These methods are all embedded in a platform called Dynamic Monitoring and Decision Systems (DyMonDS). We introduce a new service model of cloud computing called DyMonDS-as-a-Service (DyMaas), for use by operators at various spatial granularities to efficiently emulate the interconnection of IoT devices. The usage of this platform is described in the context of an electric microgrid system emulation.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131278892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Fast Stochastic Block Partitioning via Sampling 基于采样的快速随机块分区

2019 IEEE High Performance Extreme Computing Conference (HPEC)

Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916542

Frank Wanye, Vitaliy Gleyzer, Wu-chun Feng

Community detection in graphs, also known as graph partitioning, is a well-studied NP-hard problem. Various heuristic approaches have been adopted to tackle this problem in polynomial time. One such approach, as outlined in the IEEE HPEC Graph Challenge, is Bayesian statistics-based stochastic block partitioning. This method delivers high-quality partitions in sub-quadratic runtime, but it fails to scale to very large graphs. In this paper, we present sampling as an avenue for speeding up the algorithm on large graphs. We first show that existing sampling techniques can preserve a graph’s community structure. We then show that sampling for stochastic block partitioning can be used to produce a speedup of between $2.18 times$ and $7.26 times$ for graph sizes between 5,000 and 50,000 vertices without a significant loss in the accuracy of community detection.

图中的社区检测，也称为图划分，是一个研究得很好的np困难问题。在多项式时间内采用了各种启发式方法来解决这个问题。正如IEEE HPEC图挑战中概述的那样，其中一种方法是基于贝叶斯统计的随机块划分。这种方法在次二次运行时提供高质量的分区，但它无法扩展到非常大的图。在本文中，我们提出了采样作为在大图上加速算法的一种途径。我们首先证明了现有的采样技术可以保持图的群落结构。然后，我们证明了随机块分区的采样可以用于在5000到50000个顶点之间的图大小上产生2.18到7.26倍的加速，而不会显著降低社区检测的准确性。

引用次数: 5

Target-based Resource Allocation for Deep Learning Applications in a Multi-tenancy System 基于目标的多租户深度学习应用资源分配

2019 IEEE High Performance Extreme Computing Conference (HPEC)

Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916403

Wenjia Zheng, Yun Song, Zihao Guo, Yongcheng Cui, Suwen Gu, Ying Mao, Long Cheng

The neural-network based deep learning is the key technology that enables many powerful applications, which include self-driving vehicles, computer vision, and natural language processing. Although various algorithms focus on different directions, generally, they mainly employ an iteration by iteration training and evaluating the process. Each iteration aims to find a parameter set, which minimizes a loss function defined by the learning model. When completing the training process, the global minimum is achieved with a set of optimized parameters. At this stage, deep learning applications can be shipped with a trained model to provide services. While deep learning applications are reshaping our daily life, obtaining a good learning model is an expensive task. Training deep learning models is, usually, time-consuming and requires lots of resources, e.g. CPU and GPU. In a multi-tenancy system, however, limited resources are shared by multiple clients that lead to severe resource contention. Therefore, a carefully designed resource management scheme is required to improve the overall performance. In this project, we propose a target based scheduling scheme named TRADL. In TRADL, developers have options to specify a two-tier target. If the accuracy of the model reaches a target, it can be delivered to clients while the training is still going on to continue improving the quality. The experiments show that TRADL is able to significantly reduce the time cost, as much as 48.2%, for reaching the target.

基于神经网络的深度学习是实现许多强大应用的关键技术，包括自动驾驶汽车、计算机视觉和自然语言处理。虽然各种算法关注的方向不同，但一般来说，它们主要是通过迭代训练和评估过程来进行迭代。每次迭代的目标是找到一个参数集，该参数集使学习模型定义的损失函数最小化。当完成训练过程时，用一组优化参数达到全局最小值。在这个阶段，深度学习应用程序可以与经过训练的模型一起提供服务。虽然深度学习应用正在重塑我们的日常生活，但获得一个好的学习模型是一项昂贵的任务。训练深度学习模型通常是耗时的，并且需要大量的资源，例如CPU和GPU。然而，在多租户系统中，有限的资源由多个客户机共享，从而导致严重的资源争用。因此，需要精心设计资源管理方案来提高整体性能。在本课题中，我们提出了一种基于目标的调度方案TRADL。在TRADL中，开发人员可以选择指定两层目标。如果模型的准确性达到目标，就可以在培训继续进行的同时交付给客户，继续提高质量。实验表明，TRADL能够显著降低达到目标的时间成本，达到48.2%。

{"title":"Target-based Resource Allocation for Deep Learning Applications in a Multi-tenancy System","authors":"Wenjia Zheng, Yun Song, Zihao Guo, Yongcheng Cui, Suwen Gu, Ying Mao, Long Cheng","doi":"10.1109/HPEC.2019.8916403","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916403","url":null,"abstract":"The neural-network based deep learning is the key technology that enables many powerful applications, which include self-driving vehicles, computer vision, and natural language processing. Although various algorithms focus on different directions, generally, they mainly employ an iteration by iteration training and evaluating the process. Each iteration aims to find a parameter set, which minimizes a loss function defined by the learning model. When completing the training process, the global minimum is achieved with a set of optimized parameters. At this stage, deep learning applications can be shipped with a trained model to provide services. While deep learning applications are reshaping our daily life, obtaining a good learning model is an expensive task. Training deep learning models is, usually, time-consuming and requires lots of resources, e.g. CPU and GPU. In a multi-tenancy system, however, limited resources are shared by multiple clients that lead to severe resource contention. Therefore, a carefully designed resource management scheme is required to improve the overall performance. In this project, we propose a target based scheduling scheme named TRADL. In TRADL, developers have options to specify a two-tier target. If the accuracy of the model reaches a target, it can be delivered to clients while the training is still going on to continue improving the quality. The experiments show that TRADL is able to significantly reduce the time cost, as much as 48.2%, for reaching the target.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115043470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

Towards Improving Rate-Distortion Performance of Transform-Based Lossy Compression for HPC Datasets 改进基于变换的HPC数据有损压缩的率失真性能研究

2019 IEEE High Performance Extreme Computing Conference (HPEC)

Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916286

Jialing Zhang, Aekyeung Moon, Xiaoyan Zhuo, S. Son

As the size and amount of data produced by high-performance computing (HPC) applications grow exponentially, an effective data reduction technique is becoming critical to mitigating time and space burden. Lossy compression techniques, which have been widely used in image and video compression, hold promise to fulfill such data reduction need. However, they are seldom adopted in HPC datasets because of their difficulty in quantifying the amount of information loss and data reduction. In this paper, we explore a lossy compression strategy by revisiting the energy compaction properties of discrete transforms on HPC datasets. Specifically, we apply block-based transforms to HPC datasets, obtain the minimum number of coefficients containing the maximum energy (or information) compaction rate, and quantize remaining non-dominant coefficients using a binning mechanism to minimize information loss expressed in a distortion measure. We implement the proposed approach and evaluate it using six real-world HPC datasets. Our experimental results show that, on average, only 6.67 bits are required to preserve an optimal energy compaction rate on our evaluated datasets. Moreover, our knee detection algorithm improves the distortion in terms of peak signal-to-noise ratio by 2.46 dB on average.

随着高性能计算(HPC)应用程序产生的数据大小和数量呈指数级增长，有效的数据缩减技术对于减轻时间和空间负担变得至关重要。有损压缩技术已广泛应用于图像和视频压缩，有望满足这种数据缩减需求。然而，由于难以量化信息丢失量和数据缩减量，在HPC数据集中很少采用。在本文中，我们通过回顾HPC数据集上离散变换的能量压缩特性来探索有损压缩策略。具体来说，我们将基于块的变换应用于HPC数据集，获得包含最大能量(或信息)压缩率的最小数量的系数，并使用分箱机制量化剩余的非主导系数，以最小化在失真度量中表示的信息损失。我们实现了所提出的方法，并使用六个真实的HPC数据集对其进行了评估。我们的实验结果表明，在我们评估的数据集上，平均只需要6.67比特来保持最佳的能量压缩率。此外，我们的膝盖检测算法在峰值信噪比方面平均改善了2.46 dB的失真。

{"title":"Towards Improving Rate-Distortion Performance of Transform-Based Lossy Compression for HPC Datasets","authors":"Jialing Zhang, Aekyeung Moon, Xiaoyan Zhuo, S. Son","doi":"10.1109/HPEC.2019.8916286","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916286","url":null,"abstract":"As the size and amount of data produced by high-performance computing (HPC) applications grow exponentially, an effective data reduction technique is becoming critical to mitigating time and space burden. Lossy compression techniques, which have been widely used in image and video compression, hold promise to fulfill such data reduction need. However, they are seldom adopted in HPC datasets because of their difficulty in quantifying the amount of information loss and data reduction. In this paper, we explore a lossy compression strategy by revisiting the energy compaction properties of discrete transforms on HPC datasets. Specifically, we apply block-based transforms to HPC datasets, obtain the minimum number of coefficients containing the maximum energy (or information) compaction rate, and quantize remaining non-dominant coefficients using a binning mechanism to minimize information loss expressed in a distortion measure. We implement the proposed approach and evaluate it using six real-world HPC datasets. Our experimental results show that, on average, only 6.67 bits are required to preserve an optimal energy compaction rate on our evaluated datasets. Moreover, our knee detection algorithm improves the distortion in terms of peak signal-to-noise ratio by 2.46 dB on average.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133765375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2019 IEEE High Performance Extreme Computing Conference (HPEC)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀