首页 > 最新文献

Proceedings of the 16th ACM International Conference on Computing Frontiers最新文献

英文 中文
Fixed point exploitation via compiler analyses and transformations: POSTER 通过编译器分析和转换的定点开发:POSTER
Pub Date : 2019-04-30 DOI: 10.1145/3310273.3323424
Daniele Cattaneo, Antonio Di Bello, M. Chiari, Stefano Cherubin, G. Agosta
Fixed point computation represents a key feature in the design process of embedded applications. It is also exploited as a mean to data size tuning for HPC tasks [2]. Since the conversion from floating point to fixed point is generally performed manually, it is time-consuming and error-prone. However, the full automation of such task is currently unfeasible, as existing open source tools are not mature enough for industry adoption. To bridge this gap, we introduce our Tuning Assistant for Floating point to Fixed point Optimization (TAFFO). TAFFO is a toolset of LLVM compiler plugins that automatically converts computations from floating point to fixed point. TAFFO leverages programmer hints to understand the characteristics of the input data, and then performs the code conversion using the most appropriate data types. TAFFO allows programmers to equally apply fine-grained precision tuning to a wide range of programming languages, whereas most current competitors are limited to C. Moreover, it is easily applicable to most embedded [1] and high performance applications [10, 11], and it allows easy maintenance and extensions.
定点计算是嵌入式应用程序设计过程中的一个重要特征。它还被用作HPC任务[2]的数据大小调优方法。由于从浮点数到定点的转换通常是手动执行的,因此既耗时又容易出错。然而,这种任务的完全自动化目前是不可行的,因为现有的开源工具还不够成熟,无法被行业采用。为了弥补这一差距,我们引入了浮点到定点优化的调优助手(TAFFO)。TAFFO是LLVM编译器插件的工具集,可以自动将计算从浮点转换为定点。TAFFO利用程序员提示来理解输入数据的特征,然后使用最合适的数据类型执行代码转换。TAFFO允许程序员同样地将细粒度的精确调优应用到广泛的编程语言中,而目前大多数竞争对手都局限于c语言。此外,它很容易适用于大多数嵌入式bb0和高性能应用[10,11],并且它允许易于维护和扩展。
{"title":"Fixed point exploitation via compiler analyses and transformations: POSTER","authors":"Daniele Cattaneo, Antonio Di Bello, M. Chiari, Stefano Cherubin, G. Agosta","doi":"10.1145/3310273.3323424","DOIUrl":"https://doi.org/10.1145/3310273.3323424","url":null,"abstract":"Fixed point computation represents a key feature in the design process of embedded applications. It is also exploited as a mean to data size tuning for HPC tasks [2]. Since the conversion from floating point to fixed point is generally performed manually, it is time-consuming and error-prone. However, the full automation of such task is currently unfeasible, as existing open source tools are not mature enough for industry adoption. To bridge this gap, we introduce our Tuning Assistant for Floating point to Fixed point Optimization (TAFFO). TAFFO is a toolset of LLVM compiler plugins that automatically converts computations from floating point to fixed point. TAFFO leverages programmer hints to understand the characteristics of the input data, and then performs the code conversion using the most appropriate data types. TAFFO allows programmers to equally apply fine-grained precision tuning to a wide range of programming languages, whereas most current competitors are limited to C. Moreover, it is easily applicable to most embedded [1] and high performance applications [10, 11], and it allows easy maintenance and extensions.","PeriodicalId":431860,"journal":{"name":"Proceedings of the 16th ACM International Conference on Computing Frontiers","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131377818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Abstracting parallel program specification: a case study on k-means clustering 摘要并行程序规范:以k-means聚类为例
Pub Date : 2019-04-30 DOI: 10.1145/3310273.3322828
A. Hommelberg, K. Rietveld, H. Wijshoff
The Forelem framework was first introduced to optimize database queries using compiler techniques. Since its introduction, Forelem has proven to be more versatile and to be applicable beyond database applications. In this paper we show that Forelem can be used to specify parallel programs at an abstract level whilst still guaranteeing efficient parallel execution. This is achieved by a sequence of transformations that can be directly implemented as an optimizing compiler toolchain. To demonstrate this, a case study is described, k-Means clustering, for which four implementations are mechanically generated that improve standard MPI C/C++ and outperform state-of-the-art Hadoop implementations.
首先引入Forelem框架是为了使用编译器技术优化数据库查询。自引入以来,Forelem已被证明是更通用的,并且适用于数据库应用程序之外的应用程序。在本文中,我们证明了Forelem可以用于在抽象层次上指定并行程序,同时仍然保证有效的并行执行。这是通过一系列转换来实现的,这些转换可以直接实现为优化编译器工具链。为了证明这一点,本文描述了一个案例研究,k-Means聚类,其中机械地生成了四种实现,它们改进了标准的MPI C/ c++,并且优于最先进的Hadoop实现。
{"title":"Abstracting parallel program specification: a case study on k-means clustering","authors":"A. Hommelberg, K. Rietveld, H. Wijshoff","doi":"10.1145/3310273.3322828","DOIUrl":"https://doi.org/10.1145/3310273.3322828","url":null,"abstract":"The Forelem framework was first introduced to optimize database queries using compiler techniques. Since its introduction, Forelem has proven to be more versatile and to be applicable beyond database applications. In this paper we show that Forelem can be used to specify parallel programs at an abstract level whilst still guaranteeing efficient parallel execution. This is achieved by a sequence of transformations that can be directly implemented as an optimizing compiler toolchain. To demonstrate this, a case study is described, k-Means clustering, for which four implementations are mechanically generated that improve standard MPI C/C++ and outperform state-of-the-art Hadoop implementations.","PeriodicalId":431860,"journal":{"name":"Proceedings of the 16th ACM International Conference on Computing Frontiers","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130092869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Toward a graph-based dependence analysis framework for high level design verification 面向高层次设计验证的基于图的依赖性分析框架
Pub Date : 2019-04-30 DOI: 10.1145/3310273.3323433
John D. Leidel, Frank Conlon
Recent efforts to deploy FPGA's and application-specific accelerator devices in scalable data center environments has led to a resurgence in research associated with high level synthesis and design verification. The goal of this research has been to accelerate the initial design, verification and deployment process for abstract accelerator platforms. While the research associated with high level synthesis flows has provided significant gains in design acceleration, research in the verification of these designs has largely been based upon augmenting traditional methodologies. This work introduces the CoreGen high level design verification infrastructure. The goal of the CoreGen infrastructure is to provide a rapid, high level design verification infrastructure for complex, heterogeneous hardware architectures. Unlike traditional high-level verification strategies, CoreGen utilizes an intermediate representation (IR) for the target design constructed using a directed acyclic graph (DAG). CoreGen then applies classic compiler dependence analysis techniques using a multitude of graph inference and combinatorial logic solvers. The application of traditional compiler dependence analysis using directed acyclic graphs provides the ability to optimize the performance of the high level verification pipeline regardless of the target design complexity. We highlight this capability by demonstrating the verification performance scaling using a complex, heterogeneous design input. Our results indicate performance competitive with traditional optimizing compilers.
最近在可扩展数据中心环境中部署FPGA和特定应用加速器设备的努力导致了与高级综合和设计验证相关的研究的复苏。本研究的目标是加速抽象加速器平台的初始设计、验证和部署过程。虽然与高层次综合流程相关的研究在设计加速方面取得了重大进展,但对这些设计的验证研究在很大程度上是基于对传统方法的扩展。本文介绍了CoreGen高级设计验证基础架构。CoreGen基础架构的目标是为复杂、异构的硬件架构提供快速、高层次的设计验证基础架构。与传统的高级验证策略不同,CoreGen使用有向无环图(DAG)构建目标设计的中间表示(IR)。然后,CoreGen应用经典的编译器依赖分析技术,使用大量的图推理和组合逻辑求解器。使用有向无环图的传统编译器依赖性分析的应用提供了优化高级验证管道性能的能力,而不考虑目标设计的复杂性。我们通过使用一个复杂的、异构的设计输入来演示验证性能的可伸缩性来强调这个能力。我们的结果表明性能与传统的优化编译器相当。
{"title":"Toward a graph-based dependence analysis framework for high level design verification","authors":"John D. Leidel, Frank Conlon","doi":"10.1145/3310273.3323433","DOIUrl":"https://doi.org/10.1145/3310273.3323433","url":null,"abstract":"Recent efforts to deploy FPGA's and application-specific accelerator devices in scalable data center environments has led to a resurgence in research associated with high level synthesis and design verification. The goal of this research has been to accelerate the initial design, verification and deployment process for abstract accelerator platforms. While the research associated with high level synthesis flows has provided significant gains in design acceleration, research in the verification of these designs has largely been based upon augmenting traditional methodologies. This work introduces the CoreGen high level design verification infrastructure. The goal of the CoreGen infrastructure is to provide a rapid, high level design verification infrastructure for complex, heterogeneous hardware architectures. Unlike traditional high-level verification strategies, CoreGen utilizes an intermediate representation (IR) for the target design constructed using a directed acyclic graph (DAG). CoreGen then applies classic compiler dependence analysis techniques using a multitude of graph inference and combinatorial logic solvers. The application of traditional compiler dependence analysis using directed acyclic graphs provides the ability to optimize the performance of the high level verification pipeline regardless of the target design complexity. We highlight this capability by demonstrating the verification performance scaling using a complex, heterogeneous design input. Our results indicate performance competitive with traditional optimizing compilers.","PeriodicalId":431860,"journal":{"name":"Proceedings of the 16th ACM International Conference on Computing Frontiers","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121180096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PGAS for graph analytics: can one sided communications break the scalability barrier? 用于图形分析的PGAS:单向通信能否打破可扩展性障碍?
Pub Date : 2019-04-30 DOI: 10.1145/3310273.3324293
J. Langguth
As the world is becoming increasingly interconnected and systems increasingly complex. Therefore, technologies that can analyze connected systems and their dynamic characteristics become indispensable. Consequently, the last decade has seen increasing interest in graph analytics, which allows obtaining insights from such connected data. Parallel graph analytics can reveal the workings of intricate systems and networks at massive scales, which are found in diverse areas such as social networks, economic transactions, and protein interactions. While sequential graph algorithms have been studied for decades, the recent availability of massive datasets has given rise to the need for parallel graph processing, which poses unique challenges.
随着世界日益相互联系,系统日益复杂。因此,能够分析互联系统及其动态特性的技术变得不可或缺。因此,在过去十年中,人们对图形分析的兴趣越来越大,这使得人们可以从这些相互关联的数据中获得见解。并行图分析可以揭示大规模复杂系统和网络的运作,这些系统和网络存在于社会网络、经济交易和蛋白质相互作用等不同领域。虽然顺序图算法已经研究了几十年,但最近大量数据集的可用性引起了对并行图处理的需求,这带来了独特的挑战。
{"title":"PGAS for graph analytics: can one sided communications break the scalability barrier?","authors":"J. Langguth","doi":"10.1145/3310273.3324293","DOIUrl":"https://doi.org/10.1145/3310273.3324293","url":null,"abstract":"As the world is becoming increasingly interconnected and systems increasingly complex. Therefore, technologies that can analyze connected systems and their dynamic characteristics become indispensable. Consequently, the last decade has seen increasing interest in graph analytics, which allows obtaining insights from such connected data. Parallel graph analytics can reveal the workings of intricate systems and networks at massive scales, which are found in diverse areas such as social networks, economic transactions, and protein interactions. While sequential graph algorithms have been studied for decades, the recent availability of massive datasets has given rise to the need for parallel graph processing, which poses unique challenges.","PeriodicalId":431860,"journal":{"name":"Proceedings of the 16th ACM International Conference on Computing Frontiers","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125565234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
NeuPow: artificial neural networks for power and behavioral modeling of arithmetic components in 45nm ASICs technology NeuPow:用于45nm asic技术中算法组件功率和行为建模的人工神经网络
Pub Date : 2019-04-30 DOI: 10.1145/3310273.3322820
Y. Nasser, Carlo Sau, Jean-Christophe Prévotet, Tiziana Fanni, F. Palumbo, M. Hélard, L. Raffo
In this paper, we present a flexible, simple and accurate power modeling technique that can be used to estimate the power consumption of modern technology devices. We exploit Artificial Neural Networks for power and behavioral estimation in Application Specific Integrated Circuits. Our method, called NeuPow, relies on propagating the predictors between the connected neural models to estimate the dynamic power consumption of the individual components. As a first proof of concept, to study the effectiveness of NeuPow, we run both component level and system level tests on the Open GPDK 45 nm technology from Cadence, achieving errors below 1.5% and 9% respectively for component and system level. In addition, NeuPow demonstrated a speed up factor of 2490X.
本文提出了一种灵活、简单、准确的功率建模技术,可用于估算现代技术器件的功耗。我们利用人工神经网络在专用集成电路中进行功率和行为估计。我们的方法,称为NeuPow,依靠在连接的神经模型之间传播预测器来估计单个组件的动态功耗。作为第一个概念验证,为了研究NeuPow的有效性,我们在Cadence的Open GPDK 45 nm技术上运行了组件级和系统级测试,组件级和系统级的误差分别低于1.5%和9%。此外,NeuPow还展示了2490X的加速系数。
{"title":"NeuPow: artificial neural networks for power and behavioral modeling of arithmetic components in 45nm ASICs technology","authors":"Y. Nasser, Carlo Sau, Jean-Christophe Prévotet, Tiziana Fanni, F. Palumbo, M. Hélard, L. Raffo","doi":"10.1145/3310273.3322820","DOIUrl":"https://doi.org/10.1145/3310273.3322820","url":null,"abstract":"In this paper, we present a flexible, simple and accurate power modeling technique that can be used to estimate the power consumption of modern technology devices. We exploit Artificial Neural Networks for power and behavioral estimation in Application Specific Integrated Circuits. Our method, called NeuPow, relies on propagating the predictors between the connected neural models to estimate the dynamic power consumption of the individual components. As a first proof of concept, to study the effectiveness of NeuPow, we run both component level and system level tests on the Open GPDK 45 nm technology from Cadence, achieving errors below 1.5% and 9% respectively for component and system level. In addition, NeuPow demonstrated a speed up factor of 2490X.","PeriodicalId":431860,"journal":{"name":"Proceedings of the 16th ACM International Conference on Computing Frontiers","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122376291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Accelerating tile low-rank GEMM on sunway architecture: POSTER 加速瓷砖低等级宝石在双威建筑:海报
Pub Date : 2019-04-30 DOI: 10.1145/3310273.3323425
Qingchang Han, Hailong Yang, Zhongzhi Luan, D. Qian
Tile Low-Rank (TLR) GEMM can significantly reduce the amount of computation and memory footprint for matrix multiplication while preserving the same level of accuracy [1]. TLR-GEMM is based on the TLR data format, which is an efficient method to store large-scale sparse matrix. The large matrix is divided into several blocks also known as tile, and non-diagonal tile is compressed into the product of two tall and skinny matrices (in low-rank data format). TLR-GEMM performs the multiplication of TLR matrix A and B to obtain matrix C. TLR-GEMM can be implemented in batch mode, that is, multiple threads are started, and each thread applies the operations onto its corresponding tiles, including dense GEMM, SVD and QR decomposition. One research challenge in the field of TLR-GEMM is that modern high-performance processors often use diverse architectures, which requires adapting to the unique architecture features to achieve better performance.
Tile Low-Rank (TLR) GEMM可以显著减少矩阵乘法的计算量和内存占用,同时保持相同的精度水平[1]。TLR- gemm基于TLR数据格式,是一种存储大规模稀疏矩阵的有效方法。大矩阵被分成几个块(也称为tile),非对角线的tile被压缩成两个又高又瘦的矩阵的乘积(以低秩数据格式)。TLR-GEMM将TLR矩阵A和B相乘得到矩阵c。TLR-GEMM可以采用批处理的方式实现,即启动多个线程,每个线程将这些操作应用到对应的tile上,包括密集GEMM、SVD和QR分解。TLR-GEMM领域的一个研究挑战是,现代高性能处理器通常使用多种架构,这需要适应独特的架构特征以获得更好的性能。
{"title":"Accelerating tile low-rank GEMM on sunway architecture: POSTER","authors":"Qingchang Han, Hailong Yang, Zhongzhi Luan, D. Qian","doi":"10.1145/3310273.3323425","DOIUrl":"https://doi.org/10.1145/3310273.3323425","url":null,"abstract":"Tile Low-Rank (TLR) GEMM can significantly reduce the amount of computation and memory footprint for matrix multiplication while preserving the same level of accuracy [1]. TLR-GEMM is based on the TLR data format, which is an efficient method to store large-scale sparse matrix. The large matrix is divided into several blocks also known as tile, and non-diagonal tile is compressed into the product of two tall and skinny matrices (in low-rank data format). TLR-GEMM performs the multiplication of TLR matrix A and B to obtain matrix C. TLR-GEMM can be implemented in batch mode, that is, multiple threads are started, and each thread applies the operations onto its corresponding tiles, including dense GEMM, SVD and QR decomposition. One research challenge in the field of TLR-GEMM is that modern high-performance processors often use diverse architectures, which requires adapting to the unique architecture features to achieve better performance.","PeriodicalId":431860,"journal":{"name":"Proceedings of the 16th ACM International Conference on Computing Frontiers","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121193306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Data and model convergence: a case for software defined architectures 数据和模型融合:软件定义架构的一个案例
Pub Date : 2019-04-30 DOI: 10.1145/3310273.3323438
Antonino Tumeo
High Performance Computing, data analytics, and machine learning are often considered three separate and different approaches. Applications, software and now hardware stacks are typically designed to only address one of the areas at a time. This creates a false distinction across the three different areas. In reality, domain scientists need to exercise all the three approaches in an integrated way. For example, large scale simulations generate enormous amount of data, to which Big Data Analytics techniques can be applied. Or, as scientist seek to use data analytics as well as simulation for discovery, machine learning can play an important role in making sense of the disparate source's information. Pacific Northwest National Laboratory is launching a new Laboratory Directed Research and Development (LDRD) Initiative to investigate the integration of the three techniques at all level of the high-performance computing stack, the Data-Model Convergence (DMC) Initiative. The DMC Initiative aims to increase scientist productivity by enabling purpose-built software and hardware and domain-aware ML techniques. In this talk, I will present the objectives of PNNL's DMC Initiative, highlighting the research that will be performed to enable the integration of vastly different programming paradigms and mental models. I will then make the case for how reconfigurable architectures could represent a great opportunity to address the challenges of DMC. In principle, the possibility to dynamically modify the architecture during runtime could provide a way to address the requirement of workloads that have significantly diverse behaviors across phases, without losing too much flexibility or programmer productivity, with respect to highly heterogeneous architectures composed by sea of fixed application specific accelerators. Reconfigurable architectures have been explored since long time ago, and arguably new software breakthroughs are required to make them successful. I will thus present the efforts that the DMC initiative is launching to design a productive toolchain for upcoming novel reconfigurable systems.
高性能计算、数据分析和机器学习通常被认为是三种不同的方法。应用程序、软件和现在的硬件堆栈通常被设计为一次只处理一个领域。这在三个不同的领域造成了错误的区分。实际上,领域科学家需要综合运用这三种方法。例如,大规模模拟产生大量数据,大数据分析技术可以应用于这些数据。或者,当科学家试图使用数据分析和模拟来发现时,机器学习可以在理解不同来源的信息方面发挥重要作用。太平洋西北国家实验室正在启动一项新的实验室指导研究与开发(LDRD)计划,以研究在高性能计算堆栈的所有级别上集成三种技术,即数据模型融合(DMC)计划。DMC计划旨在通过启用专用软件和硬件以及领域感知ML技术来提高科学家的生产力。在这次演讲中,我将介绍PNNL DMC计划的目标,重点介绍将进行的研究,以实现不同编程范式和心智模型的集成。然后,我将说明可重构架构如何代表一个解决DMC挑战的巨大机会。原则上,在运行时期间动态修改体系结构的可能性可以提供一种方法来解决跨阶段具有明显不同行为的工作负载的需求,而不会失去太多的灵活性或程序员的生产力,相对于由大量固定的应用程序特定加速器组成的高度异构的体系结构。可重构架构很久以前就已经被探索过了,并且可以说,要使它们成功,需要新的软件突破。因此,我将介绍DMC计划为即将到来的新型可重构系统设计高效工具链所做的努力。
{"title":"Data and model convergence: a case for software defined architectures","authors":"Antonino Tumeo","doi":"10.1145/3310273.3323438","DOIUrl":"https://doi.org/10.1145/3310273.3323438","url":null,"abstract":"High Performance Computing, data analytics, and machine learning are often considered three separate and different approaches. Applications, software and now hardware stacks are typically designed to only address one of the areas at a time. This creates a false distinction across the three different areas. In reality, domain scientists need to exercise all the three approaches in an integrated way. For example, large scale simulations generate enormous amount of data, to which Big Data Analytics techniques can be applied. Or, as scientist seek to use data analytics as well as simulation for discovery, machine learning can play an important role in making sense of the disparate source's information. Pacific Northwest National Laboratory is launching a new Laboratory Directed Research and Development (LDRD) Initiative to investigate the integration of the three techniques at all level of the high-performance computing stack, the Data-Model Convergence (DMC) Initiative. The DMC Initiative aims to increase scientist productivity by enabling purpose-built software and hardware and domain-aware ML techniques. In this talk, I will present the objectives of PNNL's DMC Initiative, highlighting the research that will be performed to enable the integration of vastly different programming paradigms and mental models. I will then make the case for how reconfigurable architectures could represent a great opportunity to address the challenges of DMC. In principle, the possibility to dynamically modify the architecture during runtime could provide a way to address the requirement of workloads that have significantly diverse behaviors across phases, without losing too much flexibility or programmer productivity, with respect to highly heterogeneous architectures composed by sea of fixed application specific accelerators. Reconfigurable architectures have been explored since long time ago, and arguably new software breakthroughs are required to make them successful. I will thus present the efforts that the DMC initiative is launching to design a productive toolchain for upcoming novel reconfigurable systems.","PeriodicalId":431860,"journal":{"name":"Proceedings of the 16th ACM International Conference on Computing Frontiers","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122933406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Parallel edge-based sampling for static and dynamic graphs 静态和动态图形的并行边缘采样
Pub Date : 2019-04-30 DOI: 10.1145/3310273.3323052
Kartik Lakhotia, R. Kannan, Aditya Gaur, Ajitesh Srivastava, V. Prasanna
Graph sampling is an important tool to obtain small and manageable subgraphs from large real-world graphs. Prior research has shown that Induced Edge Sampling (IES) outperforms other sampling methods in terms of the quality of subgraph obtained. Even though fast sampling is crucial for several workflows, there has been little work on parallel sampling algorithms in the past. In this paper, we present parIES - a framework for parallel Induced Edge Sampling on shared-memory parallel machines. parIES, equipped with optimized load balancing and synchronization avoiding strategies, can sample both static and streaming dynamic graphs, while achieving high scalability and parallel efficiency. We develop a lightweight concurrent hash table coupled with a space-efficient dynamic graph data structure to overcome the challenges and memory constraints of sampling streaming dynamic graphs. We evaluate parIES on a 16-core (32 threads) Intel server using 7 large synthetic and real-world networks. From a static graph, parIES can sample a subgraph with > 1.4B edges in < 2.5s and achieve upto 15.5X parallel speedup. For dynamic streaming graphs, parIES can process upto 86.7M edges per second achieving 15X parallel speedup.
图采样是从现实世界的大图中获得小而可管理的子图的重要工具。已有研究表明,诱导边缘采样(IES)在获得子图质量方面优于其他采样方法。尽管快速采样对几个工作流至关重要,但过去并行采样算法的研究很少。在本文中,我们提出了parIES -一个在共享内存并行机器上并行诱导边缘采样的框架。parIES采用了优化的负载平衡和同步避免策略,可以对静态和流式动态图进行采样,同时具有很高的可扩展性和并行效率。我们开发了一个轻量级的并发哈希表,结合了一个空间高效的动态图数据结构,以克服采样流动态图的挑战和内存限制。我们在一台16核(32线程)的英特尔服务器上使用7个大型合成网络和真实世界的网络对party进行了评估。从静态图中,缔约方可以在< 2.5s内采样具有> 1.4B条边的子图,并实现高达15.5倍的并行加速。对于动态流图,缔约方每秒可以处理多达86.7万条边,实现15倍的并行加速。
{"title":"Parallel edge-based sampling for static and dynamic graphs","authors":"Kartik Lakhotia, R. Kannan, Aditya Gaur, Ajitesh Srivastava, V. Prasanna","doi":"10.1145/3310273.3323052","DOIUrl":"https://doi.org/10.1145/3310273.3323052","url":null,"abstract":"Graph sampling is an important tool to obtain small and manageable subgraphs from large real-world graphs. Prior research has shown that Induced Edge Sampling (IES) outperforms other sampling methods in terms of the quality of subgraph obtained. Even though fast sampling is crucial for several workflows, there has been little work on parallel sampling algorithms in the past. In this paper, we present parIES - a framework for parallel Induced Edge Sampling on shared-memory parallel machines. parIES, equipped with optimized load balancing and synchronization avoiding strategies, can sample both static and streaming dynamic graphs, while achieving high scalability and parallel efficiency. We develop a lightweight concurrent hash table coupled with a space-efficient dynamic graph data structure to overcome the challenges and memory constraints of sampling streaming dynamic graphs. We evaluate parIES on a 16-core (32 threads) Intel server using 7 large synthetic and real-world networks. From a static graph, parIES can sample a subgraph with > 1.4B edges in < 2.5s and achieve upto 15.5X parallel speedup. For dynamic streaming graphs, parIES can process upto 86.7M edges per second achieving 15X parallel speedup.","PeriodicalId":431860,"journal":{"name":"Proceedings of the 16th ACM International Conference on Computing Frontiers","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128558050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Go green radio astronomy: Approximate Computing Perspective: Opportunities and Challenges: POSTER 走向绿色射电天文学:近似计算视角:机遇与挑战:海报
Pub Date : 2019-04-30 DOI: 10.1145/3310273.3323427
G. Gillani, A. Kokkeler
Modern radio telescopes require highly energy/power-efficient computing systems. Signal processing pipelines of such radio telescopes are dominated by accumulation based iterative processes. As the input signal received at a radio telescope is regarded as Gaussian noise, employing approximate computing looks promising. Therefore, we present opportunities and challenges offered by the approximate computing paradigm to achieve the required efficiency targets.
现代射电望远镜需要高能效的计算系统。这种射电望远镜的信号处理管道主要是基于积累的迭代过程。由于射电望远镜接收到的输入信号被认为是高斯噪声,采用近似计算看起来很有希望。因此,我们提出了近似计算范式提供的机会和挑战,以实现所需的效率目标。
{"title":"Go green radio astronomy: Approximate Computing Perspective: Opportunities and Challenges: POSTER","authors":"G. Gillani, A. Kokkeler","doi":"10.1145/3310273.3323427","DOIUrl":"https://doi.org/10.1145/3310273.3323427","url":null,"abstract":"Modern radio telescopes require highly energy/power-efficient computing systems. Signal processing pipelines of such radio telescopes are dominated by accumulation based iterative processes. As the input signal received at a radio telescope is regarded as Gaussian noise, employing approximate computing looks promising. Therefore, we present opportunities and challenges offered by the approximate computing paradigm to achieve the required efficiency targets.","PeriodicalId":431860,"journal":{"name":"Proceedings of the 16th ACM International Conference on Computing Frontiers","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114609727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
High performance, power efficient hardware accelerators: emerging devices, circuits and architecture co-design 高性能、节能硬件加速器:新兴器件、电路和架构协同设计
Pub Date : 2019-04-30 DOI: 10.1145/3310273.3324055
Catherine E. Graves
General-purpose digital systems have long benefited from favorable scaling, but performance improvements have slowed dramatically in the last decade. Computing is therefore returning to custom and specialized systems, frequently using heterogeneous accelerators. Particularly driven by the data-centric workloads of machine learning and deep learning, an intense development of conventional accelerators (GPUs, FPGAs, CMOS ASICs) but also unconventional accelerators using novel circuits and devices beyond CMOS is currently underway. In this talk, I will discuss some common characteristics of high-performance and power-efficient accelerators in this diverse space and the ecosystem development (such as new interconnects) needed for them to thrive. To illustrate accelerator characteristics and their potential, I will describe our group's efforts to co-design from algorithms and architectures down to novel devices for gains in speed and power. We have developed architectures leveraging the analog and non-volatile nature of memristors (tunable resistance switches) assembled in crossbar arrays to accelerate machine learning, image and signal processing. We have also developed new circuits and assembled architectures to accelerate Finite Automata, enabling rapid pattern matching used in applications from security to genomics. Significant improvements over CPUs, GPUs, and custom digital ASICs are forecasted in both such systems, highlighting the potential for unconventional accelerators in future high-performance computing systems.
长期以来,通用数字系统一直受益于有利的可扩展性,但在过去十年中,性能改进的速度大幅放缓。因此,计算正在回归到定制和专门的系统,经常使用异构加速器。特别是在以数据为中心的机器学习和深度学习工作负载的驱动下,传统加速器(gpu、fpga、CMOS asic)以及使用CMOS以外的新型电路和器件的非常规加速器正在大力发展。在这次演讲中,我将讨论在这个多样化的空间中高性能和节能加速器的一些共同特征,以及它们茁壮成长所需的生态系统发展(例如新的互连)。为了说明加速器的特性及其潜力,我将描述我们团队在从算法和架构到新设备的共同设计方面所做的努力,以提高速度和功率。我们开发了利用在交叉棒阵列中组装的忆阻器(可调电阻开关)的模拟和非易失性的架构,以加速机器学习,图像和信号处理。我们还开发了新的电路和组装架构来加速有限自动机,从而实现从安全到基因组学等应用中的快速模式匹配。预计这两种系统都将对cpu、gpu和定制数字asic进行重大改进,突出了未来高性能计算系统中非常规加速器的潜力。
{"title":"High performance, power efficient hardware accelerators: emerging devices, circuits and architecture co-design","authors":"Catherine E. Graves","doi":"10.1145/3310273.3324055","DOIUrl":"https://doi.org/10.1145/3310273.3324055","url":null,"abstract":"General-purpose digital systems have long benefited from favorable scaling, but performance improvements have slowed dramatically in the last decade. Computing is therefore returning to custom and specialized systems, frequently using heterogeneous accelerators. Particularly driven by the data-centric workloads of machine learning and deep learning, an intense development of conventional accelerators (GPUs, FPGAs, CMOS ASICs) but also unconventional accelerators using novel circuits and devices beyond CMOS is currently underway. In this talk, I will discuss some common characteristics of high-performance and power-efficient accelerators in this diverse space and the ecosystem development (such as new interconnects) needed for them to thrive. To illustrate accelerator characteristics and their potential, I will describe our group's efforts to co-design from algorithms and architectures down to novel devices for gains in speed and power. We have developed architectures leveraging the analog and non-volatile nature of memristors (tunable resistance switches) assembled in crossbar arrays to accelerate machine learning, image and signal processing. We have also developed new circuits and assembled architectures to accelerate Finite Automata, enabling rapid pattern matching used in applications from security to genomics. Significant improvements over CPUs, GPUs, and custom digital ASICs are forecasted in both such systems, highlighting the potential for unconventional accelerators in future high-performance computing systems.","PeriodicalId":431860,"journal":{"name":"Proceedings of the 16th ACM International Conference on Computing Frontiers","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130995549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Proceedings of the 16th ACM International Conference on Computing Frontiers
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1