首页 > 最新文献

2019 IEEE High Performance Extreme Computing Conference (HPEC)最新文献

英文 中文
A Novel Design of Adaptive and Hierarchical Convolutional Neural Networks using Partial Reconfiguration on FPGA 基于FPGA的部分重构自适应分层卷积神经网络设计
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916237
Mohammad Farhadi, Mehdi Ghasemi, Yezhou Yang
Nowadays most research in visual recognition using Convolutional Neural Networks (CNNs) follows the “deeper model with deeper confidence” belief to gain a higher recognition accuracy. At the same time, deeper model brings heavier computation. On the other hand, for a large chunk of recognition challenges, a system can classify images correctly using simple models or so-called shallow networks. Moreover, the implementation of CNNs faces with the size, weight, and energy constraints on the embedded devices. In this paper, we implement the adaptive switching between shallow and deep networks to reach the highest throughput on a resource-constrained MPSoC with CPU and FPGA. To this end, we develop and present a novel architecture for the CNNs where a gate makes the decision whether using the deeper model is beneficial or not. Due to resource limitation on FPGA, the idea of partial reconfiguration has been used to accommodate deep CNNs on the FPGA resources. We report experimental results on CIFAR-10, CIFAR-100, and SVHN datasets to validate our approach. Using confidence metric as the decision making factor, only 69.8%, 71.8%, and 43.8% of the computation in the deepest network is done for CIFAR10, CIFAR-100, and SVHN while it can maintain the desired accuracy with the throughput of around 400 images per second for SVHN dataset. https://github.com/mfarhadi/AHCNN.
目前使用卷积神经网络(cnn)进行视觉识别的研究大多遵循“更深的模型和更深的置信度”的信念来获得更高的识别精度。同时,模型越深,计算量越大。另一方面,对于大量的识别挑战,系统可以使用简单的模型或所谓的浅层网络正确地对图像进行分类。此外,cnn的实现还面临着嵌入式设备的尺寸、重量和能量限制。在本文中,我们实现了浅层和深层网络之间的自适应切换,以在具有CPU和FPGA的资源受限的MPSoC上达到最高吞吐量。为此,我们开发并提出了一种新颖的cnn架构,其中一个门决定是否使用更深的模型是有益的。由于FPGA的资源限制,在FPGA资源上采用局部重构的思想来容纳深度cnn。我们报告了在CIFAR-10、CIFAR-100和SVHN数据集上的实验结果来验证我们的方法。使用置信度作为决策因素,CIFAR10、CIFAR-100和SVHN在深度网络中只完成了69.8%、71.8%和43.8%的计算,而SVHN数据集可以保持所需的精度,吞吐量约为每秒400张图像。https://github.com/mfarhadi/AHCNN。
{"title":"A Novel Design of Adaptive and Hierarchical Convolutional Neural Networks using Partial Reconfiguration on FPGA","authors":"Mohammad Farhadi, Mehdi Ghasemi, Yezhou Yang","doi":"10.1109/HPEC.2019.8916237","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916237","url":null,"abstract":"Nowadays most research in visual recognition using Convolutional Neural Networks (CNNs) follows the “deeper model with deeper confidence” belief to gain a higher recognition accuracy. At the same time, deeper model brings heavier computation. On the other hand, for a large chunk of recognition challenges, a system can classify images correctly using simple models or so-called shallow networks. Moreover, the implementation of CNNs faces with the size, weight, and energy constraints on the embedded devices. In this paper, we implement the adaptive switching between shallow and deep networks to reach the highest throughput on a resource-constrained MPSoC with CPU and FPGA. To this end, we develop and present a novel architecture for the CNNs where a gate makes the decision whether using the deeper model is beneficial or not. Due to resource limitation on FPGA, the idea of partial reconfiguration has been used to accommodate deep CNNs on the FPGA resources. We report experimental results on CIFAR-10, CIFAR-100, and SVHN datasets to validate our approach. Using confidence metric as the decision making factor, only 69.8%, 71.8%, and 43.8% of the computation in the deepest network is done for CIFAR10, CIFAR-100, and SVHN while it can maintain the desired accuracy with the throughput of around 400 images per second for SVHN dataset. https://github.com/mfarhadi/AHCNN.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121533225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
IdPrism: Rapid Analysis of Forensic DNA Samples Using MPS SNP Profiles IdPrism:使用MPS SNP档案快速分析法医DNA样本
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916521
D. Ricke, James Watkins, Philip Fremont-Smith, Adam Michaleas
Massively parallel sequencing (MPS) of large single nucleotide polymorphism (SNP) panels enables identification, analysis of complex DNA mixture samples, and extended kinship predictions. Computational challenges related to SNP allele calling, probability of random man not excluded calculations, and both reference and complex mixture sample comparisons to tens of millions of reference profiles were encountered and resolved when scaling up from thousands to tens of thousands of SNP loci. A MPS SNP analysis pipeline is described for rapid analysis of forensic deoxyribonucleic acid (DNA) samples for thousands to tens of thousands of SNP loci against tens of millions of reference profiles. This pipeline is part of the MIT Lincoln Laboratory (MITLL) IdPrism advanced DNA forensic system.
大型单核苷酸多态性(SNP)面板的大规模平行测序(MPS)可以进行鉴定,分析复杂的DNA混合物样本,并扩展亲属关系预测。当从数千个SNP位点扩展到数万个SNP位点时,遇到了与SNP等位基因调用、随机人不排除计算概率以及参考和复杂混合样本与数千万个参考图谱的比较相关的计算挑战,并解决了这些挑战。MPS SNP分析管道用于法医脱氧核糖核酸(DNA)样品的数千至数万个SNP位点对数千万个参考谱的快速分析。该管道是麻省理工学院林肯实验室(MITLL) IdPrism先进DNA法医系统的一部分。
{"title":"IdPrism: Rapid Analysis of Forensic DNA Samples Using MPS SNP Profiles","authors":"D. Ricke, James Watkins, Philip Fremont-Smith, Adam Michaleas","doi":"10.1109/HPEC.2019.8916521","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916521","url":null,"abstract":"Massively parallel sequencing (MPS) of large single nucleotide polymorphism (SNP) panels enables identification, analysis of complex DNA mixture samples, and extended kinship predictions. Computational challenges related to SNP allele calling, probability of random man not excluded calculations, and both reference and complex mixture sample comparisons to tens of millions of reference profiles were encountered and resolved when scaling up from thousands to tens of thousands of SNP loci. A MPS SNP analysis pipeline is described for rapid analysis of forensic deoxyribonucleic acid (DNA) samples for thousands to tens of thousands of SNP loci against tens of millions of reference profiles. This pipeline is part of the MIT Lincoln Laboratory (MITLL) IdPrism advanced DNA forensic system.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127927347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
[HPEC 2019 Copyright notice] [HPEC 2019版权声明]
Pub Date : 2019-09-01 DOI: 10.1109/hpec.2019.8916557
{"title":"[HPEC 2019 Copyright notice]","authors":"","doi":"10.1109/hpec.2019.8916557","DOIUrl":"https://doi.org/10.1109/hpec.2019.8916557","url":null,"abstract":"","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114012282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Singularity for Machine Learning Applications - Analysis of Performance Impact 机器学习应用的奇点-性能影响分析
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916443
B. R. Jordan, David Barrett, David Burke, Patrick Jardin, Amelia Littrell, P. Monticciolo, Michael Newey, J. Piou, Kara Warner
Software deployments in general, and deep learning applications in particular, suffer from difficulty in reproducible results. The use of containers to mitigate these issues is becoming a common practice. Singularity is a container technology which targets the unique issues present in High Performance Computing (HPC) Centers. This paper characterizes the impact of using Singularity for both Training and Inference on deep learning applications.
一般情况下,软件部署,尤其是深度学习应用,都难以获得可重复的结果。使用容器来缓解这些问题正在成为一种常见的做法。奇点是一种容器技术,针对高性能计算(HPC)中心中存在的独特问题。本文描述了在深度学习应用中同时使用奇点进行训练和推理的影响。
{"title":"Singularity for Machine Learning Applications - Analysis of Performance Impact","authors":"B. R. Jordan, David Barrett, David Burke, Patrick Jardin, Amelia Littrell, P. Monticciolo, Michael Newey, J. Piou, Kara Warner","doi":"10.1109/HPEC.2019.8916443","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916443","url":null,"abstract":"Software deployments in general, and deep learning applications in particular, suffer from difficulty in reproducible results. The use of containers to mitigate these issues is becoming a common practice. Singularity is a container technology which targets the unique issues present in High Performance Computing (HPC) Centers. This paper characterizes the impact of using Singularity for both Training and Inference on deep learning applications.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"209 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115552403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Skip the Intersection: Quickly Counting Common Neighbors on Shared-Memory Systems 跳过交叉点:快速计算共享内存系统上的共同邻居
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916307
Xiaojing An, Kasimir Gabert, James Fox, Oded Green, David A. Bader
Counting common neighbors between all vertex pairs in a graph is a fundamental operation, with uses in similarity measures, link prediction, graph compression, community detection, and more. Current shared-memory approaches either rely on set intersections or are not readily parallelizable. We introduce a new efficient and parallelizable algorithm to count common neighbors: starting at a wedge endpoint, we iterate through all wedges in the graph, and increment the common neighbor count for each endpoint pair. This exactly counts the common neighbors between all pairs without using set intersections, and as such attains an asymptotic improvement in runtime. Furthermore, our algorithm is simple to implement and only slight modifications are required for existing implementations to use our results. We provide an OpenMP implementation and evaluate it on real-world and synthetic graphs, demonstrating no loss of scalability and an asymptotic improvement. We show intersections are neither necessary nor helpful for computing all pairs common neighbor counts.
计算图中所有顶点对之间的共同邻居是一项基本操作,用于相似性度量、链接预测、图压缩、社区检测等。当前的共享内存方法要么依赖于集合交叉点,要么不容易并行化。我们引入了一种新的高效且可并行化的算法来计算共同邻居:我们从一个楔形端点开始,迭代图中的所有楔形,并增加每个端点对的共同邻居计数。这在不使用集合交点的情况下准确地计算了所有对之间的共同邻域,因此在运行时获得了渐近改进。此外,我们的算法易于实现,现有的实现只需要稍加修改就可以使用我们的结果。我们提供了一个OpenMP实现,并在真实世界和合成图上对其进行了评估,证明了没有损失可伸缩性和渐近改进。我们证明交集对于计算所有对的共同邻居计数既没有必要也没有帮助。
{"title":"Skip the Intersection: Quickly Counting Common Neighbors on Shared-Memory Systems","authors":"Xiaojing An, Kasimir Gabert, James Fox, Oded Green, David A. Bader","doi":"10.1109/HPEC.2019.8916307","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916307","url":null,"abstract":"Counting common neighbors between all vertex pairs in a graph is a fundamental operation, with uses in similarity measures, link prediction, graph compression, community detection, and more. Current shared-memory approaches either rely on set intersections or are not readily parallelizable. We introduce a new efficient and parallelizable algorithm to count common neighbors: starting at a wedge endpoint, we iterate through all wedges in the graph, and increment the common neighbor count for each endpoint pair. This exactly counts the common neighbors between all pairs without using set intersections, and as such attains an asymptotic improvement in runtime. Furthermore, our algorithm is simple to implement and only slight modifications are required for existing implementations to use our results. We provide an OpenMP implementation and evaluate it on real-world and synthetic graphs, demonstrating no loss of scalability and an asymptotic improvement. We show intersections are neither necessary nor helpful for computing all pairs common neighbor counts.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126588492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
ECG Feature Processing Performance Acceleration on SLURM Compute Systems 基于SLURM计算系统的心电特征处理性能加速
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916397
Michael Nolan, Mark Hernandez, Philip Fremont-Smith, A. Swiston, K. Claypool
Electrocardiogram (ECG) signal features (e.g. Heart rate, intrapeak interval times) are data commonly used in physiological assessment. Commercial off-the-shelf (COTS) software solutions for ECG data processing are available, but are often developed for serialized data processing which scale poorly for large datasets. To address this issue, we’ve developed a Matlab code library for parallelized ECG feature generation. This library uses the pMatlab and MatMPI interfaces to distribute computing tasks over supercomputing clusters using the Simple Linux Utility for Resource Management (SLURM). To profile its performance as a function of parallelization scale, the ECG processing code was executed on a non-human primate dataset on the Lincoln Laboratory Supercomputing TXGreen cluster. Feature processing jobs were deployed over a range of processor counts and processor types to assess the overall reduction in job computation time. We show that individual process times decrease according to a 1/n relationship to the number of processors used, while total computation times accounting for deployment and data aggregation impose diminishing returns of time against processor count. A maximum mean reduction in overall file processing time of 99% is shown.
心电图(ECG)信号特征(如心率、峰内间隔时间)是生理评估中常用的数据。商用现货(COTS)软件解决方案可用于心电数据处理,但通常是为序列化数据处理而开发的,这对于大型数据集的可扩展性很差。为了解决这个问题,我们开发了一个并行心电特征生成的Matlab代码库。该库使用pMatlab和MatMPI接口,使用简单Linux资源管理实用程序(SLURM)在超级计算集群上分配计算任务。为了分析其作为并行化规模函数的性能,在林肯实验室超级计算TXGreen集群上的非人类灵长类动物数据集上执行心电处理代码。特征处理作业部署在一系列处理器数量和处理器类型上,以评估作业计算时间的总体减少情况。我们表明,单个处理时间与所使用的处理器数量呈1/n关系减少,而考虑部署和数据聚合的总计算时间对处理器数量的回报递减。总体文件处理时间的最大平均减少率为99%。
{"title":"ECG Feature Processing Performance Acceleration on SLURM Compute Systems","authors":"Michael Nolan, Mark Hernandez, Philip Fremont-Smith, A. Swiston, K. Claypool","doi":"10.1109/HPEC.2019.8916397","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916397","url":null,"abstract":"Electrocardiogram (ECG) signal features (e.g. Heart rate, intrapeak interval times) are data commonly used in physiological assessment. Commercial off-the-shelf (COTS) software solutions for ECG data processing are available, but are often developed for serialized data processing which scale poorly for large datasets. To address this issue, we’ve developed a Matlab code library for parallelized ECG feature generation. This library uses the pMatlab and MatMPI interfaces to distribute computing tasks over supercomputing clusters using the Simple Linux Utility for Resource Management (SLURM). To profile its performance as a function of parallelization scale, the ECG processing code was executed on a non-human primate dataset on the Lincoln Laboratory Supercomputing TXGreen cluster. Feature processing jobs were deployed over a range of processor counts and processor types to assess the overall reduction in job computation time. We show that individual process times decrease according to a 1/n relationship to the number of processors used, while total computation times accounting for deployment and data aggregation impose diminishing returns of time against processor count. A maximum mean reduction in overall file processing time of 99% is shown.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124857800","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Introducing DyMonDS-as-a-Service (DyMaaS) for Internet of Things 为物联网引入DyMonDS-as-a-Service (DyMaaS)
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916560
M. Ilić, Rupamathi Jaddivada
With recent trends in computation and communication architecture, it is becoming possible to simulate complex networked dynamical systems by employing high-fidelity models. The inherent spatial and temporal complexity of these systems, however, still acts as a roadblock. It is thus desirable to have adaptive platform design facilitating zooming-in and out of the models to emulate time-evolution of processes at a desired spatial and temporal granularity. In this paper, we propose new computing and networking abstractions, that can embrace physical dynamics and computations in a unified manner, by taking advantage of the inherent structure. We further design multi-rate numerical methods that can be implemented by computing architectures to facilitate adaptive zooming-in and out of the models spanning multiple spatial and temporal layers. These methods are all embedded in a platform called Dynamic Monitoring and Decision Systems (DyMonDS). We introduce a new service model of cloud computing called DyMonDS-as-a-Service (DyMaas), for use by operators at various spatial granularities to efficiently emulate the interconnection of IoT devices. The usage of this platform is described in the context of an electric microgrid system emulation.
随着计算和通信体系结构的发展,利用高保真模型来模拟复杂的网络动态系统成为可能。然而,这些系统固有的空间和时间复杂性仍然是一个障碍。因此,需要有自适应的平台设计,以便在所需的空间和时间粒度上放大和缩小模型,以模拟过程的时间演化。在本文中,我们提出了新的计算和网络抽象,它可以利用固有的结构,以统一的方式包含物理动力学和计算。我们进一步设计了可以通过计算架构实现的多速率数值方法,以促进跨越多个空间和时间层的模型的自适应放大和缩小。这些方法都嵌入在一个名为动态监测和决策系统(DyMonDS)的平台中。我们引入了一种新的云计算服务模型,称为DyMonDS-as-a-Service (DyMaas),供不同空间粒度的运营商使用,以有效地模拟物联网设备的互连。以某微电网系统仿真为背景,介绍了该平台的使用方法。
{"title":"Introducing DyMonDS-as-a-Service (DyMaaS) for Internet of Things","authors":"M. Ilić, Rupamathi Jaddivada","doi":"10.1109/HPEC.2019.8916560","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916560","url":null,"abstract":"With recent trends in computation and communication architecture, it is becoming possible to simulate complex networked dynamical systems by employing high-fidelity models. The inherent spatial and temporal complexity of these systems, however, still acts as a roadblock. It is thus desirable to have adaptive platform design facilitating zooming-in and out of the models to emulate time-evolution of processes at a desired spatial and temporal granularity. In this paper, we propose new computing and networking abstractions, that can embrace physical dynamics and computations in a unified manner, by taking advantage of the inherent structure. We further design multi-rate numerical methods that can be implemented by computing architectures to facilitate adaptive zooming-in and out of the models spanning multiple spatial and temporal layers. These methods are all embedded in a platform called Dynamic Monitoring and Decision Systems (DyMonDS). We introduce a new service model of cloud computing called DyMonDS-as-a-Service (DyMaas), for use by operators at various spatial granularities to efficiently emulate the interconnection of IoT devices. The usage of this platform is described in the context of an electric microgrid system emulation.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131278892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Fast Stochastic Block Partitioning via Sampling 基于采样的快速随机块分区
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916542
Frank Wanye, Vitaliy Gleyzer, Wu-chun Feng
Community detection in graphs, also known as graph partitioning, is a well-studied NP-hard problem. Various heuristic approaches have been adopted to tackle this problem in polynomial time. One such approach, as outlined in the IEEE HPEC Graph Challenge, is Bayesian statistics-based stochastic block partitioning. This method delivers high-quality partitions in sub-quadratic runtime, but it fails to scale to very large graphs. In this paper, we present sampling as an avenue for speeding up the algorithm on large graphs. We first show that existing sampling techniques can preserve a graph’s community structure. We then show that sampling for stochastic block partitioning can be used to produce a speedup of between $2.18 times$ and $7.26 times$ for graph sizes between 5,000 and 50,000 vertices without a significant loss in the accuracy of community detection.
图中的社区检测,也称为图划分,是一个研究得很好的np困难问题。在多项式时间内采用了各种启发式方法来解决这个问题。正如IEEE HPEC图挑战中概述的那样,其中一种方法是基于贝叶斯统计的随机块划分。这种方法在次二次运行时提供高质量的分区,但它无法扩展到非常大的图。在本文中,我们提出了采样作为在大图上加速算法的一种途径。我们首先证明了现有的采样技术可以保持图的群落结构。然后,我们证明了随机块分区的采样可以用于在5000到50000个顶点之间的图大小上产生2.18到7.26倍的加速,而不会显著降低社区检测的准确性。
{"title":"Fast Stochastic Block Partitioning via Sampling","authors":"Frank Wanye, Vitaliy Gleyzer, Wu-chun Feng","doi":"10.1109/HPEC.2019.8916542","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916542","url":null,"abstract":"Community detection in graphs, also known as graph partitioning, is a well-studied NP-hard problem. Various heuristic approaches have been adopted to tackle this problem in polynomial time. One such approach, as outlined in the IEEE HPEC Graph Challenge, is Bayesian statistics-based stochastic block partitioning. This method delivers high-quality partitions in sub-quadratic runtime, but it fails to scale to very large graphs. In this paper, we present sampling as an avenue for speeding up the algorithm on large graphs. We first show that existing sampling techniques can preserve a graph’s community structure. We then show that sampling for stochastic block partitioning can be used to produce a speedup of between $2.18 times$ and $7.26 times$ for graph sizes between 5,000 and 50,000 vertices without a significant loss in the accuracy of community detection.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130442985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Target-based Resource Allocation for Deep Learning Applications in a Multi-tenancy System 基于目标的多租户深度学习应用资源分配
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916403
Wenjia Zheng, Yun Song, Zihao Guo, Yongcheng Cui, Suwen Gu, Ying Mao, Long Cheng
The neural-network based deep learning is the key technology that enables many powerful applications, which include self-driving vehicles, computer vision, and natural language processing. Although various algorithms focus on different directions, generally, they mainly employ an iteration by iteration training and evaluating the process. Each iteration aims to find a parameter set, which minimizes a loss function defined by the learning model. When completing the training process, the global minimum is achieved with a set of optimized parameters. At this stage, deep learning applications can be shipped with a trained model to provide services. While deep learning applications are reshaping our daily life, obtaining a good learning model is an expensive task. Training deep learning models is, usually, time-consuming and requires lots of resources, e.g. CPU and GPU. In a multi-tenancy system, however, limited resources are shared by multiple clients that lead to severe resource contention. Therefore, a carefully designed resource management scheme is required to improve the overall performance. In this project, we propose a target based scheduling scheme named TRADL. In TRADL, developers have options to specify a two-tier target. If the accuracy of the model reaches a target, it can be delivered to clients while the training is still going on to continue improving the quality. The experiments show that TRADL is able to significantly reduce the time cost, as much as 48.2%, for reaching the target.
基于神经网络的深度学习是实现许多强大应用的关键技术,包括自动驾驶汽车、计算机视觉和自然语言处理。虽然各种算法关注的方向不同,但一般来说,它们主要是通过迭代训练和评估过程来进行迭代。每次迭代的目标是找到一个参数集,该参数集使学习模型定义的损失函数最小化。当完成训练过程时,用一组优化参数达到全局最小值。在这个阶段,深度学习应用程序可以与经过训练的模型一起提供服务。虽然深度学习应用正在重塑我们的日常生活,但获得一个好的学习模型是一项昂贵的任务。训练深度学习模型通常是耗时的,并且需要大量的资源,例如CPU和GPU。然而,在多租户系统中,有限的资源由多个客户机共享,从而导致严重的资源争用。因此,需要精心设计资源管理方案来提高整体性能。在本课题中,我们提出了一种基于目标的调度方案TRADL。在TRADL中,开发人员可以选择指定两层目标。如果模型的准确性达到目标,就可以在培训继续进行的同时交付给客户,继续提高质量。实验表明,TRADL能够显著降低达到目标的时间成本,达到48.2%。
{"title":"Target-based Resource Allocation for Deep Learning Applications in a Multi-tenancy System","authors":"Wenjia Zheng, Yun Song, Zihao Guo, Yongcheng Cui, Suwen Gu, Ying Mao, Long Cheng","doi":"10.1109/HPEC.2019.8916403","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916403","url":null,"abstract":"The neural-network based deep learning is the key technology that enables many powerful applications, which include self-driving vehicles, computer vision, and natural language processing. Although various algorithms focus on different directions, generally, they mainly employ an iteration by iteration training and evaluating the process. Each iteration aims to find a parameter set, which minimizes a loss function defined by the learning model. When completing the training process, the global minimum is achieved with a set of optimized parameters. At this stage, deep learning applications can be shipped with a trained model to provide services. While deep learning applications are reshaping our daily life, obtaining a good learning model is an expensive task. Training deep learning models is, usually, time-consuming and requires lots of resources, e.g. CPU and GPU. In a multi-tenancy system, however, limited resources are shared by multiple clients that lead to severe resource contention. Therefore, a carefully designed resource management scheme is required to improve the overall performance. In this project, we propose a target based scheduling scheme named TRADL. In TRADL, developers have options to specify a two-tier target. If the accuracy of the model reaches a target, it can be delivered to clients while the training is still going on to continue improving the quality. The experiments show that TRADL is able to significantly reduce the time cost, as much as 48.2%, for reaching the target.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115043470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Towards Improving Rate-Distortion Performance of Transform-Based Lossy Compression for HPC Datasets 改进基于变换的HPC数据有损压缩的率失真性能研究
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916286
Jialing Zhang, Aekyeung Moon, Xiaoyan Zhuo, S. Son
As the size and amount of data produced by high-performance computing (HPC) applications grow exponentially, an effective data reduction technique is becoming critical to mitigating time and space burden. Lossy compression techniques, which have been widely used in image and video compression, hold promise to fulfill such data reduction need. However, they are seldom adopted in HPC datasets because of their difficulty in quantifying the amount of information loss and data reduction. In this paper, we explore a lossy compression strategy by revisiting the energy compaction properties of discrete transforms on HPC datasets. Specifically, we apply block-based transforms to HPC datasets, obtain the minimum number of coefficients containing the maximum energy (or information) compaction rate, and quantize remaining non-dominant coefficients using a binning mechanism to minimize information loss expressed in a distortion measure. We implement the proposed approach and evaluate it using six real-world HPC datasets. Our experimental results show that, on average, only 6.67 bits are required to preserve an optimal energy compaction rate on our evaluated datasets. Moreover, our knee detection algorithm improves the distortion in terms of peak signal-to-noise ratio by 2.46 dB on average.
随着高性能计算(HPC)应用程序产生的数据大小和数量呈指数级增长,有效的数据缩减技术对于减轻时间和空间负担变得至关重要。有损压缩技术已广泛应用于图像和视频压缩,有望满足这种数据缩减需求。然而,由于难以量化信息丢失量和数据缩减量,在HPC数据集中很少采用。在本文中,我们通过回顾HPC数据集上离散变换的能量压缩特性来探索有损压缩策略。具体来说,我们将基于块的变换应用于HPC数据集,获得包含最大能量(或信息)压缩率的最小数量的系数,并使用分箱机制量化剩余的非主导系数,以最小化在失真度量中表示的信息损失。我们实现了所提出的方法,并使用六个真实的HPC数据集对其进行了评估。我们的实验结果表明,在我们评估的数据集上,平均只需要6.67比特来保持最佳的能量压缩率。此外,我们的膝盖检测算法在峰值信噪比方面平均改善了2.46 dB的失真。
{"title":"Towards Improving Rate-Distortion Performance of Transform-Based Lossy Compression for HPC Datasets","authors":"Jialing Zhang, Aekyeung Moon, Xiaoyan Zhuo, S. Son","doi":"10.1109/HPEC.2019.8916286","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916286","url":null,"abstract":"As the size and amount of data produced by high-performance computing (HPC) applications grow exponentially, an effective data reduction technique is becoming critical to mitigating time and space burden. Lossy compression techniques, which have been widely used in image and video compression, hold promise to fulfill such data reduction need. However, they are seldom adopted in HPC datasets because of their difficulty in quantifying the amount of information loss and data reduction. In this paper, we explore a lossy compression strategy by revisiting the energy compaction properties of discrete transforms on HPC datasets. Specifically, we apply block-based transforms to HPC datasets, obtain the minimum number of coefficients containing the maximum energy (or information) compaction rate, and quantize remaining non-dominant coefficients using a binning mechanism to minimize information loss expressed in a distortion measure. We implement the proposed approach and evaluate it using six real-world HPC datasets. Our experimental results show that, on average, only 6.67 bits are required to preserve an optimal energy compaction rate on our evaluated datasets. Moreover, our knee detection algorithm improves the distortion in terms of peak signal-to-noise ratio by 2.46 dB on average.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133765375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
期刊
2019 IEEE High Performance Extreme Computing Conference (HPEC)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1