首页 > 最新文献

2017 IEEE High Performance Extreme Computing Conference (HPEC)最新文献

英文 中文
Model-based compute orchestration for resource-constrained repeating flows 用于资源受限重复流的基于模型的计算编排
Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091089
Nazario Irizarry
Designing controllers to orchestrate repetitive compute flows in both embedded and multi-node heterogeneous compute systems can be a tedious activity that gets increasingly difficult as more constraints are placed on compute elements and the system and as internal connections get more complex. It becomes difficult to manually analyze the timing characteristics and resource utilization profiles for the most beneficial flow solutions when there are multiple busses, networks, data buffers, and processor choices. The controller design must consider the sequencing of the operations, the movement of data, the utilization of limited resources, and the mechanics of controlling the system while satisfying system limitations. This paper presents a model for expressing resources, constraints, and flows, then automatically finding a flow solution and generating a controller. Automation frees the engineer to analyze timing profiles and to implement generic interfaces that the generated controller can use to interact with and command the system automatically.
在嵌入式和多节点异构计算系统中设计控制器来编排重复的计算流可能是一项乏味的活动,随着对计算元素和系统的约束越来越多,以及内部连接变得越来越复杂,它会变得越来越困难。当存在多个总线、网络、数据缓冲区和处理器选择时,很难手动分析最有利的流解决方案的时序特征和资源利用配置文件。控制器设计必须考虑操作的顺序、数据的移动、有限资源的利用以及在满足系统限制的情况下控制系统的机制。本文提出了一个表达资源、约束和流,然后自动寻找流解和生成控制器的模型。自动化使工程师可以自由地分析时序配置文件,并实现生成的控制器可以用来与系统交互和自动命令系统的通用接口。
{"title":"Model-based compute orchestration for resource-constrained repeating flows","authors":"Nazario Irizarry","doi":"10.1109/HPEC.2017.8091089","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091089","url":null,"abstract":"Designing controllers to orchestrate repetitive compute flows in both embedded and multi-node heterogeneous compute systems can be a tedious activity that gets increasingly difficult as more constraints are placed on compute elements and the system and as internal connections get more complex. It becomes difficult to manually analyze the timing characteristics and resource utilization profiles for the most beneficial flow solutions when there are multiple busses, networks, data buffers, and processor choices. The controller design must consider the sequencing of the operations, the movement of data, the utilization of limited resources, and the mechanics of controlling the system while satisfying system limitations. This paper presents a model for expressing resources, constraints, and flows, then automatically finding a flow solution and generating a controller. Automation frees the engineer to analyze timing profiles and to implement generic interfaces that the generated controller can use to interact with and command the system automatically.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127758466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploiting half precision arithmetic in Nvidia GPUs 利用Nvidia gpu的半精度算法
Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091072
Nhut-Minh Ho, W. Wong
With the growing importance of deep learning and energy-saving approximate computing, half precision floating point arithmetic (FP16) is fast gaining popularity. Nvidia's recent Pascal architecture was the first GPU that offered FP16 support. However, when actual products were shipped, programmers soon realized that a naïve replacement of single precision (FP32) code with half precision led to disappointing performance results, even if they are willing to tolerate the increase in error precision reduction brings. In this paper, we developed an automated conversion framework to help users migrate their CUDA code to better exploit Pascal's half precision capability. Using our tools and techniques, we successfully convert many benchmarks from single precision arithmetic to half precision equivalent, and achieved significant speedup improvement in many cases. In the best case, a 3× speedup over the FP32 version was achieved. We shall also discuss some new issues and opportunities that the Pascal GPUs brought.
随着深度学习和节能近似计算的日益重要,半精度浮点算法(FP16)迅速得到普及。英伟达最近的Pascal架构是第一个提供FP16支持的GPU。然而,当实际产品发布时,程序员很快意识到naïve用半精度替换单精度(FP32)代码会导致令人失望的性能结果,即使他们愿意忍受精度降低带来的误差增加。在本文中,我们开发了一个自动转换框架来帮助用户迁移他们的CUDA代码,以更好地利用Pascal的半精度能力。使用我们的工具和技术,我们成功地将许多基准从单精度算法转换为半精度等效算法,并在许多情况下取得了显着的加速改进。在最好的情况下,比FP32版本实现了3倍的加速。我们还将讨论Pascal gpu带来的一些新问题和机遇。
{"title":"Exploiting half precision arithmetic in Nvidia GPUs","authors":"Nhut-Minh Ho, W. Wong","doi":"10.1109/HPEC.2017.8091072","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091072","url":null,"abstract":"With the growing importance of deep learning and energy-saving approximate computing, half precision floating point arithmetic (FP16) is fast gaining popularity. Nvidia's recent Pascal architecture was the first GPU that offered FP16 support. However, when actual products were shipped, programmers soon realized that a naïve replacement of single precision (FP32) code with half precision led to disappointing performance results, even if they are willing to tolerate the increase in error precision reduction brings. In this paper, we developed an automated conversion framework to help users migrate their CUDA code to better exploit Pascal's half precision capability. Using our tools and techniques, we successfully convert many benchmarks from single precision arithmetic to half precision equivalent, and achieved significant speedup improvement in many cases. In the best case, a 3× speedup over the FP32 version was achieved. We shall also discuss some new issues and opportunities that the Pascal GPUs brought.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125882934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 47
TriX: Triangle counting at extreme scale 三角:在极端尺度上的三角形计数
Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091036
Yang Hu, P. Kumar, Guy Swope, Huimin Huang
Triangle counting is widely used in many applications including spam detection, link recommendation, and social network analysis. The DARPA Graph Challenge seeks a scalable solution for triangle counting on big graphs. In this paper we present TriX, a scalable triangle counting framework, which is comprised of a 2-D graph partition strategy and a binary search based intersection algorithm designed for GPUs. The 2-D partition provides balanced work division among multiple GPUs. On the other hand, binary search based intersection achieves fine-grained parallelism on GPUs via intra-warp scheduling and coalesced memory access. TriX is able to scale to a large number of GPUs, and count triangles on billion-node graph (2 billion node, 64 billion edges) within 35 minutes, achieving over 16 million traverse edges per second (TEPS).
三角计数广泛应用于垃圾邮件检测、链接推荐和社会网络分析等领域。DARPA图形挑战赛寻求在大图形上进行三角形计数的可扩展解决方案。本文提出了一个可扩展的三角形计数框架TriX,该框架由二维图划分策略和基于二分搜索的交叉算法组成。二维分区为多个gpu提供均衡的工作分配。另一方面,基于二进制搜索的交集通过warp内调度和合并内存访问在gpu上实现了细粒度的并行性。TriX能够扩展到大量gpu,并在35分钟内对十亿节点图(20亿个节点,640亿个边)上的三角形进行计数,实现每秒超过1600万遍历边(TEPS)。
{"title":"TriX: Triangle counting at extreme scale","authors":"Yang Hu, P. Kumar, Guy Swope, Huimin Huang","doi":"10.1109/HPEC.2017.8091036","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091036","url":null,"abstract":"Triangle counting is widely used in many applications including spam detection, link recommendation, and social network analysis. The DARPA Graph Challenge seeks a scalable solution for triangle counting on big graphs. In this paper we present TriX, a scalable triangle counting framework, which is comprised of a 2-D graph partition strategy and a binary search based intersection algorithm designed for GPUs. The 2-D partition provides balanced work division among multiple GPUs. On the other hand, binary search based intersection achieves fine-grained parallelism on GPUs via intra-warp scheduling and coalesced memory access. TriX is able to scale to a large number of GPUs, and count triangles on billion-node graph (2 billion node, 64 billion edges) within 35 minutes, achieving over 16 million traverse edges per second (TEPS).","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130586647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
A top-down scheme of descriptive time series data analysis for healthy life: Introducing a fuzzy amended interaction network 一种自上而下的健康生活描述性时间序列数据分析方案:引入模糊修正交互网络
Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091065
R. Rajaei, B. Shafai, A. Ramezani
Not only networks are ubiquitous in real world, but also networked dynamics provide a more precise scheme required to better understanding of surrounding phenomena and data. This network-centric approach can be applied to analyze time series data of any type. An abundant prevalence of time series observations demand inference of causality in addition to accurate prediction. In this paper, a fuzzy improved interaction network underlying generalized Lotka-Volterra dynamics is introduced and referred to as FuzzIN. FuzzIN offers a top-down method to predict and describe potential connectivity information embedded in time series. Using FuzzIN, the current paper tries to study the effects of healthcare systems in population health across 21 OECD countries between 1999 and 2012 via OECD Health Data. It is shown that FuzzIN performs well due to its capability of handling nonlinearities, complex interconnectivities and uncertainties in the observed data and excels compared statistical methods. Hence, the relationships are inferred and healthcare systems' performance is discussed by FuzzIN parameters and rules. These estimates can be used to highlight health indicators and problems and to make awareness of development and implementation of effective, targeted public health policies and activities.
网络不仅在现实世界中无处不在,而且网络动力学提供了更好地理解周围现象和数据所需的更精确的方案。这种以网络为中心的方法可以应用于分析任何类型的时间序列数据。时间序列观测的大量流行除了需要准确的预测外,还需要因果关系的推断。本文引入了一种基于广义Lotka-Volterra动力学的模糊改进交互网络,称为FuzzIN。FuzzIN提供了一种自上而下的方法来预测和描述嵌入在时间序列中的潜在连接信息。利用FuzzIN,本文试图通过经合组织健康数据研究1999年至2012年间21个经合组织国家的医疗保健系统对人口健康的影响。结果表明,该方法具有处理观测数据非线性、复杂互联性和不确定性的能力,优于其他统计方法。因此,推导了这些关系,并通过FuzzIN参数和规则讨论了医疗保健系统的性能。这些估计数可用于突出卫生指标和问题,并使人们了解有效、有针对性的公共卫生政策和活动的制定和执行情况。
{"title":"A top-down scheme of descriptive time series data analysis for healthy life: Introducing a fuzzy amended interaction network","authors":"R. Rajaei, B. Shafai, A. Ramezani","doi":"10.1109/HPEC.2017.8091065","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091065","url":null,"abstract":"Not only networks are ubiquitous in real world, but also networked dynamics provide a more precise scheme required to better understanding of surrounding phenomena and data. This network-centric approach can be applied to analyze time series data of any type. An abundant prevalence of time series observations demand inference of causality in addition to accurate prediction. In this paper, a fuzzy improved interaction network underlying generalized Lotka-Volterra dynamics is introduced and referred to as FuzzIN. FuzzIN offers a top-down method to predict and describe potential connectivity information embedded in time series. Using FuzzIN, the current paper tries to study the effects of healthcare systems in population health across 21 OECD countries between 1999 and 2012 via OECD Health Data. It is shown that FuzzIN performs well due to its capability of handling nonlinearities, complex interconnectivities and uncertainties in the observed data and excels compared statistical methods. Hence, the relationships are inferred and healthcare systems' performance is discussed by FuzzIN parameters and rules. These estimates can be used to highlight health indicators and problems and to make awareness of development and implementation of effective, targeted public health policies and activities.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"595 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130866702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Parallel triangle counting and k-truss identification using graph-centric methods 使用以图为中心的方法进行平行三角形计数和k-桁架识别
Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091037
C. Voegele, Yi-Shan Lu, Sreepathi Pai, K. Pingali
We describe CPU and GPU implementations of parallel triangle-counting and k-truss identification in the Galois and IrGL systems. Both systems are based on a graph-centric abstraction called the operator formulation of algorithms. Depending on the input graph, our implementations are two to three orders of magnitude faster than the reference implementations provided by the IEEE HPEC static graph challenge.
我们描述了并行三角形计数和k-桁架识别在Galois和IrGL系统中的CPU和GPU实现。这两个系统都基于一种以图为中心的抽象,称为算法的运算符公式。根据输入图的不同,我们的实现比IEEE HPEC静态图挑战提供的参考实现快两到三个数量级。
{"title":"Parallel triangle counting and k-truss identification using graph-centric methods","authors":"C. Voegele, Yi-Shan Lu, Sreepathi Pai, K. Pingali","doi":"10.1109/HPEC.2017.8091037","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091037","url":null,"abstract":"We describe CPU and GPU implementations of parallel triangle-counting and k-truss identification in the Galois and IrGL systems. Both systems are based on a graph-centric abstraction called the operator formulation of algorithms. Depending on the input graph, our implementations are two to three orders of magnitude faster than the reference implementations provided by the IEEE HPEC static graph challenge.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131911007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 38
Autonomous, independent management of dynamic graphs on GPUs 自主、独立地管理gpu上的动态图形
Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091058
Martin Winter, Rhaleb Zayer, M. Steinberger
In this paper, we present a new, dynamic graph data structure, built to deliver high update rates while keeping a low memory footprint using autonomous memory management directly on the GPU. By transferring the memory management to the GPU, efficient updating of the graph structure and fast initialization times are enabled as no additional memory allocation calls or reallocation procedures are necessary since they are handled directly on the device. In comparison to previous work, this optimized approach allows for significantly lower initialization times (up to 300× faster) and much higher update rates for significant changes to the graph structure and equal rates for small changes. The framework provides different update implementations tailored specifically to different graph properties, enabling over 100 million of updates per second and keeping tens of millions of vertices and hundreds of millions of edges in memory without transferring data back and forth between device and host.
在本文中,我们提出了一种新的动态图形数据结构,该结构旨在提供高更新率,同时直接在GPU上使用自主内存管理来保持低内存占用。通过将内存管理转移到GPU,可以有效地更新图形结构和快速初始化时间,因为不需要额外的内存分配调用或重新分配过程,因为它们直接在设备上处理。与以前的工作相比,这种优化的方法允许更低的初始化时间(最多快300倍)和更高的更新率,用于对图结构的重大更改和对小更改的相同速率。该框架提供了针对不同图形属性量身定制的不同更新实现,支持每秒超过1亿次更新,并在内存中保留数千万个顶点和数亿个边,而无需在设备和主机之间来回传输数据。
{"title":"Autonomous, independent management of dynamic graphs on GPUs","authors":"Martin Winter, Rhaleb Zayer, M. Steinberger","doi":"10.1109/HPEC.2017.8091058","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091058","url":null,"abstract":"In this paper, we present a new, dynamic graph data structure, built to deliver high update rates while keeping a low memory footprint using autonomous memory management directly on the GPU. By transferring the memory management to the GPU, efficient updating of the graph structure and fast initialization times are enabled as no additional memory allocation calls or reallocation procedures are necessary since they are handled directly on the device. In comparison to previous work, this optimized approach allows for significantly lower initialization times (up to 300× faster) and much higher update rates for significant changes to the graph structure and equal rates for small changes. The framework provides different update implementations tailored specifically to different graph properties, enabling over 100 million of updates per second and keeping tens of millions of vertices and hundreds of millions of edges in memory without transferring data back and forth between device and host.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129545146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
Triangle counting via vectorized set intersection 三角形计数通过矢量化集合交集
Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091053
Shahir Mowlaei
In this paper we propose a vectorized sorted set intersection approach for the task of counting the exact number of triangles of a graph on CPU cores. The computation is factorized into reordering and counting kernels where the reordering kernel builds upon the Reverse Cuthill-McKee heuristic.
在本文中,我们提出了一种矢量化排序集交集方法,用于计算CPU内核上图形三角形的确切数量。计算被分解为重排序和计数核,其中重排序核建立在逆向Cuthill-McKee启发式上。
{"title":"Triangle counting via vectorized set intersection","authors":"Shahir Mowlaei","doi":"10.1109/HPEC.2017.8091053","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091053","url":null,"abstract":"In this paper we propose a vectorized sorted set intersection approach for the task of counting the exact number of triangles of a graph on CPU cores. The computation is factorized into reordering and counting kernels where the reordering kernel builds upon the Reverse Cuthill-McKee heuristic.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131338419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Optimal data layout for block-level random accesses to scratchpad 块级随机访问刮板的最佳数据布局
Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091088
Shreyas G. Singapura, R. Kannan, V. Prasanna
3D memory is becoming an increasingly popular technology to overcome the performance gap between memory and processors. It has led to the development of new architectures with scratchpad memory, which offer high bandwidth and user-controlled access features. The ideal performance of this scratchpad memory is peak bandwidth for any random block access. However, 3D memories come with their constraints on the "ideal" access patterns for which high bandwidth is guaranteed and the actual bandwidth is significantly lower for other access patterns. In this paper, we address the challenge of achieving high bandwidth for random block accesses to 3D memory. We present optimal data layout which achieves maximum bandwidth for each vault irrespective of the block accessed in a vault. Our data layout expressed as a mapping function determined by the architecture parameters exploits inter-layer pipelining to map the elements of each block among various layers of a vault in a specific pattern. By doing so, our data layout can absorb the latency of accesses to banks in the same layer and more importantly, hide the latency of accesses to different rows in the same bank irrespective of the block being accessed. We compare the performance of our proposed data layout with existing data layout using PARSEC 2.0 benchmarks. Our experimental results demonstrate as high as 56% improvement in access time in comparison with the existing data layout across various workloads.
3D存储器正在成为一种日益流行的技术,以克服存储器和处理器之间的性能差距。它导致了带有刮刮板存储器的新架构的发展,这些架构提供高带宽和用户控制的访问功能。这种刮刮板存储器的理想性能是任何随机块访问的峰值带宽。然而,3D存储器在保证高带宽的“理想”访问模式上受到限制,而其他访问模式的实际带宽要低得多。在本文中,我们解决了实现随机块访问3D存储器的高带宽的挑战。我们提出了最优的数据布局,实现了每个保险库的最大带宽,而不考虑在保险库中访问的块。我们的数据布局表示为由体系结构参数确定的映射函数,利用层间流水线以特定模式在保险库的各个层之间映射每个块的元素。通过这样做,我们的数据布局可以吸收访问同一层银行的延迟,更重要的是,隐藏访问同一银行中不同行的延迟,而不管访问的是哪个块。我们使用PARSEC 2.0基准测试比较了我们提出的数据布局与现有数据布局的性能。我们的实验结果表明,与跨各种工作负载的现有数据布局相比,访问时间提高了56%。
{"title":"Optimal data layout for block-level random accesses to scratchpad","authors":"Shreyas G. Singapura, R. Kannan, V. Prasanna","doi":"10.1109/HPEC.2017.8091088","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091088","url":null,"abstract":"3D memory is becoming an increasingly popular technology to overcome the performance gap between memory and processors. It has led to the development of new architectures with scratchpad memory, which offer high bandwidth and user-controlled access features. The ideal performance of this scratchpad memory is peak bandwidth for any random block access. However, 3D memories come with their constraints on the \"ideal\" access patterns for which high bandwidth is guaranteed and the actual bandwidth is significantly lower for other access patterns. In this paper, we address the challenge of achieving high bandwidth for random block accesses to 3D memory. We present optimal data layout which achieves maximum bandwidth for each vault irrespective of the block accessed in a vault. Our data layout expressed as a mapping function determined by the architecture parameters exploits inter-layer pipelining to map the elements of each block among various layers of a vault in a specific pattern. By doing so, our data layout can absorb the latency of accesses to banks in the same layer and more importantly, hide the latency of accesses to different rows in the same bank irrespective of the block being accessed. We compare the performance of our proposed data layout with existing data layout using PARSEC 2.0 benchmarks. Our experimental results demonstrate as high as 56% improvement in access time in comparison with the existing data layout across various workloads.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122233811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Mixed data layout kernels for vectorized complex arithmetic 矢量化复算法的混合数据布局核
Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091024
Doru-Thom Popovici, F. Franchetti, Tze Meng Low
Implementing complex arithmetic routines with Single Instruction Multiple Data (SIMD) instructions requires the use of instructions that are usually not found in their real arithmetic counter-parts. These instructions, such as shuffles and addsub, are often bottlenecks for many complex arithmetic kernels as modern architectures usually can perform more real arithmetic operations than execute instructions for complex arithmetic. In this work, we focus on using a variety of data layouts (mixed format) for storing complex numbers at different stages of the computation so as to limit the use of these instructions. Using complex matrix multiplication and Fast Fourier Transforms (FFTs) as our examples, we demonstrate that performance improvements of up to 2× can be attained with mixed format within the computational routines. We also described how existing algorithms can be easily modified to implement the mixed format complex layout.
使用单指令多数据(SIMD)指令实现复杂的算术例程需要使用通常在其实际算术对应部分中找不到的指令。这些指令(例如shuffles和addsub)通常是许多复杂算术内核的瓶颈,因为现代体系结构通常可以执行比执行复杂算术指令更多的实际算术操作。在这项工作中,我们着重于在计算的不同阶段使用各种数据布局(混合格式)来存储复数,以限制这些指令的使用。以复矩阵乘法和快速傅里叶变换(fft)为例,我们证明了在计算例程中使用混合格式可以获得高达2倍的性能改进。我们还描述了如何轻松修改现有算法以实现混合格式复杂布局。
{"title":"Mixed data layout kernels for vectorized complex arithmetic","authors":"Doru-Thom Popovici, F. Franchetti, Tze Meng Low","doi":"10.1109/HPEC.2017.8091024","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091024","url":null,"abstract":"Implementing complex arithmetic routines with Single Instruction Multiple Data (SIMD) instructions requires the use of instructions that are usually not found in their real arithmetic counter-parts. These instructions, such as shuffles and addsub, are often bottlenecks for many complex arithmetic kernels as modern architectures usually can perform more real arithmetic operations than execute instructions for complex arithmetic. In this work, we focus on using a variety of data layouts (mixed format) for storing complex numbers at different stages of the computation so as to limit the use of these instructions. Using complex matrix multiplication and Fast Fourier Transforms (FFTs) as our examples, we demonstrate that performance improvements of up to 2× can be attained with mixed format within the computational routines. We also described how existing algorithms can be easily modified to implement the mixed format complex layout.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123564841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
First look: Linear algebra-based triangle counting without matrix multiplication 第一眼:基于线性代数的三角形计数,没有矩阵乘法
Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091046
Tze Meng Low, Varun Nagaraj Rao, Matthew Kay Fei Lee, Doru-Thom Popovici, F. Franchetti, Scott McMillan
Linear algebra-based approaches to exact triangle counting often require sparse matrix multiplication as a primitive operation. Non-linear algebra approaches to the same problem often assume that the adjacency matrix of the graph is not available. In this paper, we show that both approaches can be unified into a single approach that separates the data format from the algorithm design. By not casting the triangle counting algorithm into matrix multiplication, a different algorithm that counts each triangle exactly once can be identified. In addition, by choosing the appropriate sparse matrix format, we show that the same algorithm is equivalent to the compact-forward algorithm attained assuming that the adjacency matrix of the graph is not available. We show that our approach yields an initial implementation that is between 69 and more than 2000 times faster than the reference implementation. We also show that the initial implementation can be easily parallelized on shared memory systems.
基于线性代数的精确三角形计数方法通常需要将稀疏矩阵乘法作为基本运算。处理相同问题的非线性代数方法通常假设图的邻接矩阵不可用。在本文中,我们证明了这两种方法可以统一为一种将数据格式与算法设计分离的方法。通过不将三角形计数算法转换为矩阵乘法,可以确定对每个三角形精确计数一次的不同算法。此外,通过选择适当的稀疏矩阵格式,我们证明了相同的算法等价于假设图的邻接矩阵不可用的紧前算法。我们表明,我们的方法产生的初始实现比参考实现快69到2000多倍。我们还表明,初始实现可以很容易地在共享内存系统上并行化。
{"title":"First look: Linear algebra-based triangle counting without matrix multiplication","authors":"Tze Meng Low, Varun Nagaraj Rao, Matthew Kay Fei Lee, Doru-Thom Popovici, F. Franchetti, Scott McMillan","doi":"10.1109/HPEC.2017.8091046","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091046","url":null,"abstract":"Linear algebra-based approaches to exact triangle counting often require sparse matrix multiplication as a primitive operation. Non-linear algebra approaches to the same problem often assume that the adjacency matrix of the graph is not available. In this paper, we show that both approaches can be unified into a single approach that separates the data format from the algorithm design. By not casting the triangle counting algorithm into matrix multiplication, a different algorithm that counts each triangle exactly once can be identified. In addition, by choosing the appropriate sparse matrix format, we show that the same algorithm is equivalent to the compact-forward algorithm attained assuming that the adjacency matrix of the graph is not available. We show that our approach yields an initial implementation that is between 69 and more than 2000 times faster than the reference implementation. We also show that the initial implementation can be easily parallelized on shared memory systems.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"122 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123710798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
期刊
2017 IEEE High Performance Extreme Computing Conference (HPEC)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1