首页 > 最新文献

2014 21st International Conference on High Performance Computing (HiPC)最新文献

英文 中文
Parallel AMG solver for three dimensional unstructured grids using GPU 基于GPU的三维非结构化网格并行AMG求解器
Pub Date : 2014-12-01 DOI: 10.1109/HiPC.2014.7116899
K. Tej, N. Sivadasan, Vatsalya Sharma, R. Banerjee
Graphics Processing Units (GPUs) have evolved over the years from being graphics accelerator to scalable coprocessor. We implement an algebraic multigrid solver for three dimensional unstructured grids using GPU. Such a solver has extensive applications in Computational Fluid Dynamics (CFD). Using a combination of vertex coloring, optimized memory representations, multi-grid and improved coarsening techniques, we obtain considerable speedup in our parallel implementation. Our solver provides significant acceleration for solving pressure Poisson equations, which is the most time consuming part while solving Navier-Stokes equations. In our experimental study, we solve pressure Poisson equations for flow over lid driven cavity and for laminar flow past square cylinder. Our implementation achieves 915 times speed up for the lid driven cavity problem on a grid of size 2.6 million and a speed up of 1020 times for the laminar flow past square cylinder problem on a grid of size 1.7 million, compared to serial non-multigrid implementations. For our implementation, we used NVIDIA's CUDA programming model.
多年来,图形处理单元(gpu)已经从图形加速器演变为可扩展的协处理器。利用GPU实现了三维非结构化网格的代数多网格求解器。该求解器在计算流体动力学(CFD)中有着广泛的应用。结合顶点着色、优化的内存表示、多网格和改进的粗化技术,我们在并行实现中获得了相当大的加速。求解压力泊松方程是求解Navier-Stokes方程时最耗时的部分,该求解器为求解压力泊松方程提供了显著的加速。在我们的实验研究中,我们解出了盖驱动腔体的压力泊松方程和经过方形圆柱体的层流方程。与串行非多网格实现相比,我们的实现在大小为260万的网格上实现了915倍的盖子驱动腔问题速度,在大小为170万的网格上实现了1020倍的层流通过方形圆柱体问题速度。对于我们的实现,我们使用了NVIDIA的CUDA编程模型。
{"title":"Parallel AMG solver for three dimensional unstructured grids using GPU","authors":"K. Tej, N. Sivadasan, Vatsalya Sharma, R. Banerjee","doi":"10.1109/HiPC.2014.7116899","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116899","url":null,"abstract":"Graphics Processing Units (GPUs) have evolved over the years from being graphics accelerator to scalable coprocessor. We implement an algebraic multigrid solver for three dimensional unstructured grids using GPU. Such a solver has extensive applications in Computational Fluid Dynamics (CFD). Using a combination of vertex coloring, optimized memory representations, multi-grid and improved coarsening techniques, we obtain considerable speedup in our parallel implementation. Our solver provides significant acceleration for solving pressure Poisson equations, which is the most time consuming part while solving Navier-Stokes equations. In our experimental study, we solve pressure Poisson equations for flow over lid driven cavity and for laminar flow past square cylinder. Our implementation achieves 915 times speed up for the lid driven cavity problem on a grid of size 2.6 million and a speed up of 1020 times for the laminar flow past square cylinder problem on a grid of size 1.7 million, compared to serial non-multigrid implementations. For our implementation, we used NVIDIA's CUDA programming model.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123613325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Optimizing shared data accesses in distributed-memory X10 systems 优化分布式内存X10系统中的共享数据访问
Pub Date : 2014-12-01 DOI: 10.1109/HiPC.2014.7116889
Jeeva Paudel, O. Tardieu, J. N. Amaral
Prior studies have established the performance impact of coherence protocols optimized for specific patterns of shared-data accesses in Non-Uniform-Memory-Architecture (NUMA) systems. First, this work incorporates a directory-based protocol into the runtime system of X10 - a Partitioned-Global-Address-Space (PGAS) programming language - to manage read-mostly, producer-consumer, stencil, and migratory variables. This protocol complements the existing X10Protocol, which keeps a unique copy of a shared variable and relies on message transfers for all remote accesses. The X10Protocol is effective to manage accumulator, write-mostly and general read-write variables. Then, it introduces a new shared-variable access-pattern profiler that is used by a new coherence-policy manager to decide which protocol should be used for each shared variable. The profiler can be run in both offline and online modes. An evaluation on a 128-core distributed-memory machine reveals that coordination between these protocols does not degrade performance on any of the applications studied, and achieves speedup in the range of 15% to 40% over X10Protocol. The performance is also comparable to carefully hand-written versions of the applications.
先前的研究已经确定了针对非均匀内存体系结构(NUMA)系统中特定模式的共享数据访问优化的一致性协议对性能的影响。首先,这项工作将基于目录的协议集成到X10的运行时系统中——一种分区全局地址空间(PGAS)编程语言——以管理read-most、生产者-消费者、模板和迁移变量。该协议补充了现有的X10Protocol,后者保留共享变量的唯一副本,并依赖于所有远程访问的消息传输。x10协议对于管理累加器、write-most和一般读写变量是有效的。然后,它引入了一个新的共享变量访问模式分析器,新的一致性策略管理器使用它来决定应该为每个共享变量使用哪个协议。该分析器可以在离线和在线两种模式下运行。在128核分布式内存机器上的评估显示,这些协议之间的协调不会降低所研究的任何应用程序的性能,并且比X10Protocol实现了15%到40%的加速。性能也可以与精心编写的应用程序版本相媲美。
{"title":"Optimizing shared data accesses in distributed-memory X10 systems","authors":"Jeeva Paudel, O. Tardieu, J. N. Amaral","doi":"10.1109/HiPC.2014.7116889","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116889","url":null,"abstract":"Prior studies have established the performance impact of coherence protocols optimized for specific patterns of shared-data accesses in Non-Uniform-Memory-Architecture (NUMA) systems. First, this work incorporates a directory-based protocol into the runtime system of X10 - a Partitioned-Global-Address-Space (PGAS) programming language - to manage read-mostly, producer-consumer, stencil, and migratory variables. This protocol complements the existing X10Protocol, which keeps a unique copy of a shared variable and relies on message transfers for all remote accesses. The X10Protocol is effective to manage accumulator, write-mostly and general read-write variables. Then, it introduces a new shared-variable access-pattern profiler that is used by a new coherence-policy manager to decide which protocol should be used for each shared variable. The profiler can be run in both offline and online modes. An evaluation on a 128-core distributed-memory machine reveals that coordination between these protocols does not degrade performance on any of the applications studied, and achieves speedup in the range of 15% to 40% over X10Protocol. The performance is also comparable to carefully hand-written versions of the applications.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121692878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Fine-grained GPU parallelization of pairwise local sequence alignment 对局部序列对齐的细粒度GPU并行化
Pub Date : 2014-12-01 DOI: 10.1109/HiPC.2014.7116912
Chirag Jain, Subodh Kumar
The Smith-Waterman algorithm is used in Bio-informatics to perform pairwise local alignment between a query sequence and a subject sequence. We present a GPU based parallel version of this algorithm that is able to perform pair-wise alignment faster than previous algorithms. In particular, it parallelizes each alignment, rather than relying on parallelism across multiple pair alignments, which many other proposed GPU algorithms do. As a result it scales better. We further extend our algorithm to work efficiently on a cluster of GPUs. At a high level, our approach subdivides the iterative computation of elements of a matrix among blocks of processors such that each block can simply recompute the data it needs instead of waiting for other processors to compute them. Sometimes this may lead to excessive recomputation, however. We evaluate these cases and employ a hybrid approach, recomputing only limited data and communicating the rest. Our algorithm is also extended to produce not only the best but all `best K' alignments. Our results on SSCA#1 benchmark show that our method is upto 5-24 times faster than previous method.
Smith-Waterman算法用于生物信息学中查询序列和主题序列之间的成对局部比对。我们提出了一个基于GPU的并行版本,该算法能够比以前的算法更快地执行成对对齐。特别是,它将每个对齐并行化,而不是依赖于多对对齐的并行性,而许多其他提出的GPU算法都是这样做的。因此,它的可伸缩性更好。我们进一步扩展了我们的算法,以便在gpu集群上有效地工作。在高层次上,我们的方法在处理器块之间细分矩阵元素的迭代计算,这样每个块可以简单地重新计算它需要的数据,而不是等待其他处理器来计算它们。然而,有时这可能会导致过度的重新计算。我们对这些情况进行评估,并采用混合方法,只重新计算有限的数据,并传达其余的数据。我们的算法也被扩展到不仅产生最好的,而且所有的“最佳K”对齐。我们在SSCA#1基准测试上的结果表明,我们的方法比以前的方法快5-24倍。
{"title":"Fine-grained GPU parallelization of pairwise local sequence alignment","authors":"Chirag Jain, Subodh Kumar","doi":"10.1109/HiPC.2014.7116912","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116912","url":null,"abstract":"The Smith-Waterman algorithm is used in Bio-informatics to perform pairwise local alignment between a query sequence and a subject sequence. We present a GPU based parallel version of this algorithm that is able to perform pair-wise alignment faster than previous algorithms. In particular, it parallelizes each alignment, rather than relying on parallelism across multiple pair alignments, which many other proposed GPU algorithms do. As a result it scales better. We further extend our algorithm to work efficiently on a cluster of GPUs. At a high level, our approach subdivides the iterative computation of elements of a matrix among blocks of processors such that each block can simply recompute the data it needs instead of waiting for other processors to compute them. Sometimes this may lead to excessive recomputation, however. We evaluate these cases and employ a hybrid approach, recomputing only limited data and communicating the rest. Our algorithm is also extended to produce not only the best but all `best K' alignments. Our results on SSCA#1 benchmark show that our method is upto 5-24 times faster than previous method.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130273057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Reducing elimination tree height for parallel LU factorization of sparse unsymmetric matrices 降低稀疏非对称矩阵并行LU分解的消去树高度
Pub Date : 2014-12-01 DOI: 10.1109/HiPC.2014.7116880
Enver Kayaaslan, B. Uçar
The elimination tree for unsymmetric matrices is a recent model playing important roles in sparse LU factorization. This tree captures the dependencies between the tasks of some well-known variants of sparse LU factorization. Therefore, the height of the elimination tree corresponds to the critical path length of the task dependency graph in the corresponding parallel LU factorization methods. We investigate the problem of finding minimum height elimination trees to expose a maximum degree of parallelism by minimizing the critical path length. This problem has recently been shown to be NP-complete. Therefore, we propose heuristics, which generalize the most successful approaches used for symmetric matrices to unsymmetric ones. We test the proposed heuristics on a large set of real world matrices and report 28% reduction in the elimination tree heights with respect to a common method, which exploits the state of the art tools used in Cholesky factorization.
非对称矩阵的消去树是近年来在稀疏LU分解中发挥重要作用的一种模型。该树捕获了一些著名的稀疏LU分解变体的任务之间的依赖关系。因此,消去树的高度对应于相应并行LU分解方法中任务依赖图的关键路径长度。我们研究了寻找最小高度消除树的问题,通过最小化关键路径长度来暴露最大程度的并行性。这个问题最近被证明是np完全的。因此,我们提出了启发式方法,它将用于对称矩阵的最成功的方法推广到非对称矩阵。我们在大量真实世界的矩阵上测试了提出的启发式方法,并报告了与利用Cholesky分解中使用的最先进工具的常见方法相比,消除树高度降低了28%。
{"title":"Reducing elimination tree height for parallel LU factorization of sparse unsymmetric matrices","authors":"Enver Kayaaslan, B. Uçar","doi":"10.1109/HiPC.2014.7116880","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116880","url":null,"abstract":"The elimination tree for unsymmetric matrices is a recent model playing important roles in sparse LU factorization. This tree captures the dependencies between the tasks of some well-known variants of sparse LU factorization. Therefore, the height of the elimination tree corresponds to the critical path length of the task dependency graph in the corresponding parallel LU factorization methods. We investigate the problem of finding minimum height elimination trees to expose a maximum degree of parallelism by minimizing the critical path length. This problem has recently been shown to be NP-complete. Therefore, we propose heuristics, which generalize the most successful approaches used for symmetric matrices to unsymmetric ones. We test the proposed heuristics on a large set of real world matrices and report 28% reduction in the elimination tree heights with respect to a common method, which exploits the state of the art tools used in Cholesky factorization.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"117 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128305752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Distance threshold similarity searches on spatiotemporal trajectories using GPGPU 基于GPGPU的时空轨迹距离阈值相似度搜索
Pub Date : 2014-12-01 DOI: 10.1109/HiPC.2014.7116913
M. Gowanlock, H. Casanova
The processing of moving object trajectories arises in many application domains. We focus on a trajectory similarity search, the distance threshold search, which finds all trajectories within a given distance of a query trajectory over a time interval. A multithreaded CPU implementation that makes use of an in-memory R-tree index can achieve high parallel efficiency. We propose a GPGPU implementation that avoids index-trees altogether and instead features a GPU-friendly indexing scheme. We show that our GPU implementation compares well to the CPU implementation. One interesting question is that of creating efficient query batches (so as to reduce both memory pressure and computation cost on the GPU). We design algorithms for creating such batches, and we find that using fixed-size batches is sufficient in practice. We develop an empirical response time model that can be used to pick a good batch size.
运动物体轨迹的处理在许多应用领域都有涉及。我们专注于轨迹相似性搜索,即距离阈值搜索,它可以在一段时间间隔内找到查询轨迹的给定距离内的所有轨迹。使用内存中的r树索引的多线程CPU实现可以实现高并行效率。我们提出了一个GPGPU实现,它完全避免了索引树,取而代之的是一个gpu友好的索引方案。我们展示了我们的GPU实现比CPU实现好。一个有趣的问题是如何创建高效的查询批处理(从而减少GPU上的内存压力和计算成本)。我们设计了创建此类批的算法,并且我们发现在实践中使用固定大小的批是足够的。我们开发了一个经验响应时间模型,可以用来选择一个好的批大小。
{"title":"Distance threshold similarity searches on spatiotemporal trajectories using GPGPU","authors":"M. Gowanlock, H. Casanova","doi":"10.1109/HiPC.2014.7116913","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116913","url":null,"abstract":"The processing of moving object trajectories arises in many application domains. We focus on a trajectory similarity search, the distance threshold search, which finds all trajectories within a given distance of a query trajectory over a time interval. A multithreaded CPU implementation that makes use of an in-memory R-tree index can achieve high parallel efficiency. We propose a GPGPU implementation that avoids index-trees altogether and instead features a GPU-friendly indexing scheme. We show that our GPU implementation compares well to the CPU implementation. One interesting question is that of creating efficient query batches (so as to reduce both memory pressure and computation cost on the GPU). We design algorithms for creating such batches, and we find that using fixed-size batches is sufficient in practice. We develop an empirical response time model that can be used to pick a good batch size.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130307671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Simple parallel biconnectivity algorithms for multicore platforms 多核平台的简单并行双连接算法
Pub Date : 2014-12-01 DOI: 10.1109/HiPC.2014.7116914
George M. Slota, Kamesh Madduri
We present two new algorithms for finding the biconnected components of a large undirected sparse graph. The first algorithm is based on identifying articulation points and labeling edges using multiple connectivity queries, and the second approach uses the color propagation technique to decompose the graph. Both methods use a breadth-first spanning tree and some auxiliary information computed during Breadth-First Search (BFS). These methods are simpler than the Tarjan-Vishkin PRAM algorithm for biconnectivity and do not require Euler tour computation or any auxiliary graph construction. We identify steps in these algorithms that can be parallelized in a shared-memory environment and develop tuned OpenMP implementations. Using a collection of large-scale real-world graph instances, we show that these methods outperform the state-of-the-art Cong-Bader biconnected components implementation, which is based on the Tarjan-Vishkin algorithm. We achieve up to 7.1× and 4.2× parallel speedup over the serial Hopcroft-Tarjan and parallel Cong-Bader algorithms, respectively, on a 16-core Intel Sandy Bridge system. For some graph instances, due to the fast BFS-based preprocessing step, the single-threaded implementation of our first algorithm is faster than the serial Hopcroft-Tarjan algorithm.
提出了两种求解大型无向稀疏图的双连通分量的新算法。第一种算法基于使用多个连通性查询识别衔接点和标记边缘,第二种方法使用颜色传播技术对图进行分解。这两种方法都使用宽度优先生成树和一些在宽度优先搜索(BFS)期间计算的辅助信息。这些方法比双连通的Tarjan-Vishkin PRAM算法更简单,并且不需要欧拉巡回计算或任何辅助的图构造。我们确定了这些算法中可以在共享内存环境中并行化的步骤,并开发了调优的OpenMP实现。使用大规模真实世界图实例的集合,我们表明这些方法优于基于Tarjan-Vishkin算法的最先进的Cong-Bader双连接组件实现。在16核英特尔Sandy Bridge系统上,我们分别在串行Hopcroft-Tarjan和并行long - bader算法上实现了高达7.1倍和4.2倍的并行加速。对于一些图实例,由于快速的基于bfs的预处理步骤,我们的第一个算法的单线程实现比串行Hopcroft-Tarjan算法快。
{"title":"Simple parallel biconnectivity algorithms for multicore platforms","authors":"George M. Slota, Kamesh Madduri","doi":"10.1109/HiPC.2014.7116914","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116914","url":null,"abstract":"We present two new algorithms for finding the biconnected components of a large undirected sparse graph. The first algorithm is based on identifying articulation points and labeling edges using multiple connectivity queries, and the second approach uses the color propagation technique to decompose the graph. Both methods use a breadth-first spanning tree and some auxiliary information computed during Breadth-First Search (BFS). These methods are simpler than the Tarjan-Vishkin PRAM algorithm for biconnectivity and do not require Euler tour computation or any auxiliary graph construction. We identify steps in these algorithms that can be parallelized in a shared-memory environment and develop tuned OpenMP implementations. Using a collection of large-scale real-world graph instances, we show that these methods outperform the state-of-the-art Cong-Bader biconnected components implementation, which is based on the Tarjan-Vishkin algorithm. We achieve up to 7.1× and 4.2× parallel speedup over the serial Hopcroft-Tarjan and parallel Cong-Bader algorithms, respectively, on a 16-core Intel Sandy Bridge system. For some graph instances, due to the fast BFS-based preprocessing step, the single-threaded implementation of our first algorithm is faster than the serial Hopcroft-Tarjan algorithm.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"311 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128625123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
Scaling graph community detection on the Tilera many-core architecture 基于Tilera多核架构的缩放图社区检测
Pub Date : 2014-12-01 DOI: 10.1109/HiPC.2014.7116708
D. Chavarría-Miranda, M. Halappanavar, A. Kalyanaraman
In an era when power constraints and data movement are proving to be significant barriers for the application of high-end computing, the Tilera many-core architecture offers a low-power platform exhibiting many important characteristics of future systems, including a large number of simple cores, a sophisticated network-on-chip, and fine-grained control over memory and caching policies. While this emerging architecture has been previously studied for structured compute-intensive kernels, benchmarking the platform for data-bound, irregular applications present significant challenges that have remained unexplored. Community detection is an advanced prototypical graph-theoretic operation with applications in numerous scientific domains including life sciences, cyber security, and power systems. In this work, we explore multiple design strategies toward developing a scalable tool for community detection on the Tilera platform. Using several memory layout and work scheduling techniques we demonstrate speedups of up to 47× on 36 cores of the Tilera TileGX36 platform over the best serial implementation, and also show results that have comparable quality and performance to mainstream x86 platforms. To the best of our knowledge this is the first work addressing graph algorithms on the Tilera platform. This study demonstrates that through careful design space exploration, low-power many-core platforms like Tilera can be effectively exploited for graph algorithms that embody all the essential characteristics of an irregular application.
在一个功耗限制和数据移动被证明是高端计算应用的重大障碍的时代,Tilera多核架构提供了一个低功耗平台,展示了未来系统的许多重要特征,包括大量简单核心、复杂的片上网络以及对内存和缓存策略的细粒度控制。虽然这种新兴的体系结构之前已经针对结构化计算密集型内核进行了研究,但对数据绑定、不规则应用程序的平台进行基准测试仍然存在未探索的重大挑战。社区检测是一种先进的原型图理论操作,在许多科学领域都有应用,包括生命科学、网络安全和电力系统。在这项工作中,我们探索了多种设计策略,以开发一个可扩展的工具,用于Tilera平台上的社区检测。通过使用几种内存布局和工作调度技术,我们展示了在最佳串行实现的36核Tilera TileGX36平台上高达47x的速度,并且还展示了与主流x86平台具有相当质量和性能的结果。据我们所知,这是第一个在Tilera平台上解决图形算法的工作。这项研究表明,通过仔细的设计空间探索,像Tilera这样的低功耗多核平台可以有效地用于体现不规则应用程序所有基本特征的图形算法。
{"title":"Scaling graph community detection on the Tilera many-core architecture","authors":"D. Chavarría-Miranda, M. Halappanavar, A. Kalyanaraman","doi":"10.1109/HiPC.2014.7116708","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116708","url":null,"abstract":"In an era when power constraints and data movement are proving to be significant barriers for the application of high-end computing, the Tilera many-core architecture offers a low-power platform exhibiting many important characteristics of future systems, including a large number of simple cores, a sophisticated network-on-chip, and fine-grained control over memory and caching policies. While this emerging architecture has been previously studied for structured compute-intensive kernels, benchmarking the platform for data-bound, irregular applications present significant challenges that have remained unexplored. Community detection is an advanced prototypical graph-theoretic operation with applications in numerous scientific domains including life sciences, cyber security, and power systems. In this work, we explore multiple design strategies toward developing a scalable tool for community detection on the Tilera platform. Using several memory layout and work scheduling techniques we demonstrate speedups of up to 47× on 36 cores of the Tilera TileGX36 platform over the best serial implementation, and also show results that have comparable quality and performance to mainstream x86 platforms. To the best of our knowledge this is the first work addressing graph algorithms on the Tilera platform. This study demonstrates that through careful design space exploration, low-power many-core platforms like Tilera can be effectively exploited for graph algorithms that embody all the essential characteristics of an irregular application.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"352 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114025816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Queueing-based storage performance modeling and placement in OpenStack environments OpenStack环境中基于队列的存储性能建模和放置
Pub Date : 2014-12-01 DOI: 10.1109/HiPC.2014.7116887
Yang Song, Rakesh Jain, R. Routray
In enterprise data centers, reliable performance models on storage devices are desirable for efficient storage management and optimization. However, many cloud environments consist of heterogeneous storage devices, e.g., a mixture of commodity disks, where accurate performance models are of particular challenge to attain. In this paper, we propose a lightweight queueing-based storage performance modeling framework, which is able to infer the maximum IO load that a storage device can sustain, as well as its IO load v.s. response time performance curve. Our inference framework views the underlying storage resources as blackboxes and only utilizes historical measurements of the IO and response time on the devices. In an OpenStack environment, we also develop a new storage volume placement algorithm using our performance inference and modeling framework. Experimental results show that our solution can provide up to 80% increase of the IO throughput, in tandem with a 40% reduction of the average response time, compared to the performance provided by the default OpenStack policy.
在企业数据中心中,可靠的存储设备性能模型是高效管理和优化存储的必要条件。然而,许多云环境由异构存储设备组成,例如,商品磁盘的混合,在这些设备中,获得准确的性能模型是一项特别的挑战。在本文中,我们提出了一个轻量级的基于队列的存储性能建模框架,该框架能够推断存储设备可以承受的最大IO负载,以及它的IO负载与响应时间性能曲线。我们的推理框架将底层存储资源视为黑盒,并且仅利用设备上的IO和响应时间的历史测量值。在OpenStack环境中,我们还使用我们的性能推断和建模框架开发了一个新的存储卷放置算法。实验结果表明,与默认OpenStack策略提供的性能相比,我们的解决方案可以提供高达80%的IO吞吐量增加,同时平均响应时间减少40%。
{"title":"Queueing-based storage performance modeling and placement in OpenStack environments","authors":"Yang Song, Rakesh Jain, R. Routray","doi":"10.1109/HiPC.2014.7116887","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116887","url":null,"abstract":"In enterprise data centers, reliable performance models on storage devices are desirable for efficient storage management and optimization. However, many cloud environments consist of heterogeneous storage devices, e.g., a mixture of commodity disks, where accurate performance models are of particular challenge to attain. In this paper, we propose a lightweight queueing-based storage performance modeling framework, which is able to infer the maximum IO load that a storage device can sustain, as well as its IO load v.s. response time performance curve. Our inference framework views the underlying storage resources as blackboxes and only utilizes historical measurements of the IO and response time on the devices. In an OpenStack environment, we also develop a new storage volume placement algorithm using our performance inference and modeling framework. Experimental results show that our solution can provide up to 80% increase of the IO throughput, in tandem with a 40% reduction of the average response time, compared to the performance provided by the default OpenStack policy.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116071268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
RADIR: Lock-free and wait-free bandwidth allocation models for solid state drives radr:固态硬盘的无锁和无等待带宽分配模型
Pub Date : 2014-12-01 DOI: 10.1109/HiPC.2014.7116908
Pooja Aggarwal, G. Yasa, S. Sarangi
Novel applications such as micro-blogging and algorithmic trading typically place a very high load on the underlying storage system. They are characterized by a stream of very short requests, and thus they require a very high I/O throughput. The traditional solution for supporting such applications is to use an array of hard disks. With the advent of solid state drives (SSDs), storage vendors are increasingly preferring them because their I/O throughput can scale up to a million IOPS (I/O operations per second). In this paper, we design a family of algorithms, RADIR, to schedule requests for such systems. Our algorithms are lock-free/wait-free, lineariz-able, and take the characteristics of requests into account such as the deadlines, request sizes, dependences, and the amount of available redundancy in RAID configurations. We perform simulations with workloads derived from traces provided by Microsoft and demonstrate a scheduling throughput of 900K IOPS on a 64 thread Intel server. Our algorithms are 2-3 orders of magnitude faster than the versions that use locks. We show detailed results for the effect of deadlines, request sizes, and the effect of RAID levels on the quality of the schedule.
微博客和算法交易等新型应用程序通常会给底层存储系统带来非常高的负载。它们的特点是一个非常短的请求流,因此它们需要非常高的I/O吞吐量。支持此类应用程序的传统解决方案是使用一组硬盘。随着固态硬盘(ssd)的出现,存储供应商越来越喜欢它们,因为它们的I/O吞吐量可以扩展到一百万IOPS(每秒I/O操作)。在本文中,我们设计了一组算法,RADIR,来调度这类系统的请求。我们的算法是无锁/无等待的,线性化的,并考虑到请求的特征,如截止日期、请求大小、依赖性和RAID配置中的可用冗余量。我们对来自Microsoft提供的跟踪的工作负载进行了模拟,并在64线程的Intel服务器上演示了900K IOPS的调度吞吐量。我们的算法比使用锁的版本快2-3个数量级。我们展示了截止日期、请求大小和RAID级别对计划质量的影响的详细结果。
{"title":"RADIR: Lock-free and wait-free bandwidth allocation models for solid state drives","authors":"Pooja Aggarwal, G. Yasa, S. Sarangi","doi":"10.1109/HiPC.2014.7116908","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116908","url":null,"abstract":"Novel applications such as micro-blogging and algorithmic trading typically place a very high load on the underlying storage system. They are characterized by a stream of very short requests, and thus they require a very high I/O throughput. The traditional solution for supporting such applications is to use an array of hard disks. With the advent of solid state drives (SSDs), storage vendors are increasingly preferring them because their I/O throughput can scale up to a million IOPS (I/O operations per second). In this paper, we design a family of algorithms, RADIR, to schedule requests for such systems. Our algorithms are lock-free/wait-free, lineariz-able, and take the characteristics of requests into account such as the deadlines, request sizes, dependences, and the amount of available redundancy in RAID configurations. We perform simulations with workloads derived from traces provided by Microsoft and demonstrate a scheduling throughput of 900K IOPS on a 64 thread Intel server. Our algorithms are 2-3 orders of magnitude faster than the versions that use locks. We show detailed results for the effect of deadlines, request sizes, and the effect of RAID levels on the quality of the schedule.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"117 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124689335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Algorithms for power-aware resource activation 功率感知资源激活算法
Pub Date : 2014-12-01 DOI: 10.1109/HiPC.2014.7116891
Sonika Arora, Archita Agarwal, Venkatesan T. Chakaravarthy, Yogish Sabharwal
We study the problem of minimally activating a resource that is shared by multiple jobs. In a power-aware computing environment, the resource needs to be activated (powered-up) so that it can service the jobs. Each job specifies an interval during which its needs the services of the resource and the duration (time length) for which it requires the resource to be active. Our goal is to activate the resource for a minimum amount of time, while satisfying all the jobs. We study two variants of this problem, the contiguous and the non-contiguous cases. In the contiguous case, each job requires that its demand for the resource be serviced with a set of contiguous timeslots whereas in the non-contiguous case, the demand of a job may be serviced with a set of non-contiguous timeslots. For the contiguous case, we present an optimal polynomial time algorithm; this improves the best known result, which is an approximation algorithm having a ratio of 2. For the non-contiguous case, we present efficient algorithms for finding optimal and approximate solutions.
我们研究了最小化激活由多个作业共享的资源的问题。在对功率敏感的计算环境中,需要激活(加电)资源,以便为作业提供服务。每个作业指定一个需要资源服务的时间间隔,以及需要资源处于活动状态的持续时间(时间长度)。我们的目标是在满足所有作业的同时,在最短的时间内激活资源。我们研究了这一问题的两种变体,即连续和非连续情况。在连续的情况下,每个作业要求用一组连续的时隙来满足其对资源的需求,而在非连续的情况下,作业的需求可以用一组不连续的时隙来满足。对于连续情况,我们给出了一个最优多项式时间算法;这改进了最著名的结果,它是一个比率为2的近似算法。对于不连续的情况,我们给出了寻找最优解和近似解的有效算法。
{"title":"Algorithms for power-aware resource activation","authors":"Sonika Arora, Archita Agarwal, Venkatesan T. Chakaravarthy, Yogish Sabharwal","doi":"10.1109/HiPC.2014.7116891","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116891","url":null,"abstract":"We study the problem of minimally activating a resource that is shared by multiple jobs. In a power-aware computing environment, the resource needs to be activated (powered-up) so that it can service the jobs. Each job specifies an interval during which its needs the services of the resource and the duration (time length) for which it requires the resource to be active. Our goal is to activate the resource for a minimum amount of time, while satisfying all the jobs. We study two variants of this problem, the contiguous and the non-contiguous cases. In the contiguous case, each job requires that its demand for the resource be serviced with a set of contiguous timeslots whereas in the non-contiguous case, the demand of a job may be serviced with a set of non-contiguous timeslots. For the contiguous case, we present an optimal polynomial time algorithm; this improves the best known result, which is an approximation algorithm having a ratio of 2. For the non-contiguous case, we present efficient algorithms for finding optimal and approximate solutions.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"255 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127362531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
2014 21st International Conference on High Performance Computing (HiPC)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1