2015 IEEE International Parallel and Distributed Processing Symposium Workshop最新文献

英文中文

Highly Scalable Algorithms for the Sparse Grid Combination Technique 稀疏网格组合技术的高度可扩展算法

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.76

P. Strazdins, Md. Mohsin Ali, B. Harding

Many petascale and exascale scientific simulations involve the time evolution of systems modelled as Partial Differential Equations (PDEs). The sparse grid combination technique (SGCT) is a cost-effective method for solve time-evolving PDEs, especially for higher-dimensional problems. It consists of evolving PDE over a set of grids of differing resolution in each dimension, and then combining the results to approximate the solution of the PDE on a grid of high resolution in all dimensions. It can also be extended to support algorithmic-based fault-tolerance, which is also important for computations at this scale. In this paper, we present two new parallel algorithms for the SGCT that supports the full distributed memory parallelization over the dimensions of the component grids, as well as over the grids as well. The direct algorithm is so called because it directly implements a SGCT combination formula. The second algorithm converts each component grid into their hierarchical surpluses, and then uses the direct algorithm on each of the hierarchical surpluses. The conversion to/from the hierarchical surpluses is also an important algorithm in its own right. An analysis of both indicates the direct algorithm minimizes the number of messages, whereas the hierarchical surplus minimizes memory consumption and offers a reduction in bandwidth by a factor of 1 -- 2 -- d, where d is the dimensionality of the SGCT. However, this is offset by its incomplete parallelism and factor of two load imbalance in practical scenarios. Our analysis also indicates both are suitable in a bandwidth-limiting regime. Experimental results including the strong and weak scalability of the algorithms indicates that, for scenarios of practical interest, both are sufficiently scalable to support large-scale SGCT but the direct algorithm has generally better performance, to within a factor of 2. Hierarchical surplus formation is much less communication intensive, but shows less scalability with increasing core counts.

许多千万亿级和百亿亿级的科学模拟都涉及到用偏微分方程(PDEs)建模的系统的时间演化。稀疏网格组合技术(SGCT)是求解时间演化偏微分方程的一种经济有效的方法，尤其适用于高维问题。它包括在一组不同维度分辨率的网格上演化PDE，然后将结果组合在所有维度的高分辨率网格上近似求解PDE。它还可以扩展为支持基于算法的容错，这对于这种规模的计算也很重要。在本文中，我们提出了两种新的SGCT并行算法，它们支持组件网格维度上的完全分布式内存并行化，以及网格上的完全分布式内存并行化。直接算法之所以被称为直接算法，是因为它直接实现了SGCT组合公式。第二种算法将每个组件网格转换为其分层剩余，然后在每个分层剩余上使用直接算法。从层次盈余到层次盈余的转换本身也是一个重要的算法。对这两种算法的分析表明，直接算法最大限度地减少了消息数量，而分层剩余最小化了内存消耗，并提供了1 - 2 - d的带宽减少，其中d是SGCT的维度。但是，在实际应用中，由于其不完全并行性和双负载不平衡的因素，这些缺点被抵消了。我们的分析还表明，两者都适用于带宽限制制度。实验结果表明，对于实际场景，两种算法都具有足够的可扩展性来支持大规模的SGCT，但直接算法通常具有更好的性能，在2倍之内。分层盈余形成的通信密集程度要低得多，但随着核心数的增加，其可扩展性较差。

{"title":"Highly Scalable Algorithms for the Sparse Grid Combination Technique","authors":"P. Strazdins, Md. Mohsin Ali, B. Harding","doi":"10.1109/IPDPSW.2015.76","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.76","url":null,"abstract":"Many petascale and exascale scientific simulations involve the time evolution of systems modelled as Partial Differential Equations (PDEs). The sparse grid combination technique (SGCT) is a cost-effective method for solve time-evolving PDEs, especially for higher-dimensional problems. It consists of evolving PDE over a set of grids of differing resolution in each dimension, and then combining the results to approximate the solution of the PDE on a grid of high resolution in all dimensions. It can also be extended to support algorithmic-based fault-tolerance, which is also important for computations at this scale. In this paper, we present two new parallel algorithms for the SGCT that supports the full distributed memory parallelization over the dimensions of the component grids, as well as over the grids as well. The direct algorithm is so called because it directly implements a SGCT combination formula. The second algorithm converts each component grid into their hierarchical surpluses, and then uses the direct algorithm on each of the hierarchical surpluses. The conversion to/from the hierarchical surpluses is also an important algorithm in its own right. An analysis of both indicates the direct algorithm minimizes the number of messages, whereas the hierarchical surplus minimizes memory consumption and offers a reduction in bandwidth by a factor of 1 -- 2 -- d, where d is the dimensionality of the SGCT. However, this is offset by its incomplete parallelism and factor of two load imbalance in practical scenarios. Our analysis also indicates both are suitable in a bandwidth-limiting regime. Experimental results including the strong and weak scalability of the algorithms indicates that, for scenarios of practical interest, both are sufficiently scalable to support large-scale SGCT but the direct algorithm has generally better performance, to within a factor of 2. Hierarchical surplus formation is much less communication intensive, but shows less scalability with increasing core counts.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123165615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

An Architecture for Configuring an Effcient Scan Path for a Subset of Elements 为元素子集配置有效扫描路径的体系结构

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.124

Arash Ashrafi, R. Vaidyanathan

Many FPGAs support partial reconfiguration, where the states of a subset of configurable elements is (potentially) altered. However, the configuration bits often enter the chip through a small number of pins. Thus, the time needed to partially reconfigure an FPGA depends, to a large extent, on the number of configuration bits to be input into the chip. This is a key consideration, particularly where partial reconfiguration is performed during the computation. Therefore, it is important that the size of a frame (an atomic configuration unit) be small and the configuration be focused on the bits that truly need to be altered. Suppose C denotes the set of elements that need to be configured during a partial reconfiguration phase, here C is a (small) subset of k frames from a much larger set S of n frames. In this paper we present a method to configure the k elements of C by setting up a configuration path that strings its way through only those frames that require reconfiguration, the configuration bit stream can be now shifted in through this path. Our method also automatically selects (in hardware) a suitable clock speed that can be used to input these configuration bits. If the elements of C show spatial locality, then the configuration time could be made largely independent of n.

许多fpga支持部分重新配置，即(潜在地)改变可配置元素子集的状态。然而，配置位通常通过少量引脚进入芯片。因此，部分重新配置FPGA所需的时间在很大程度上取决于要输入芯片的配置位的数量。这是一个关键的考虑因素，特别是在计算期间执行部分重新配置的情况下。因此，帧(原子配置单元)的大小要小，配置要集中在真正需要更改的位上，这一点很重要。假设C表示在部分重构阶段需要配置的元素集合，这里C是一个大得多的n帧集合S中k帧的(小)子集。在本文中，我们提出了一种配置C的k个元素的方法，通过建立一个配置路径，该路径只通过那些需要重新配置的帧，配置位流现在可以通过该路径移位。我们的方法还自动选择(在硬件中)一个合适的时钟速度，可以用来输入这些配置位。如果C的元素表现出空间局部性，那么构型时间可以很大程度上与n无关。

{"title":"An Architecture for Configuring an Effcient Scan Path for a Subset of Elements","authors":"Arash Ashrafi, R. Vaidyanathan","doi":"10.1109/IPDPSW.2015.124","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.124","url":null,"abstract":"Many FPGAs support partial reconfiguration, where the states of a subset of configurable elements is (potentially) altered. However, the configuration bits often enter the chip through a small number of pins. Thus, the time needed to partially reconfigure an FPGA depends, to a large extent, on the number of configuration bits to be input into the chip. This is a key consideration, particularly where partial reconfiguration is performed during the computation. Therefore, it is important that the size of a frame (an atomic configuration unit) be small and the configuration be focused on the bits that truly need to be altered. Suppose C denotes the set of elements that need to be configured during a partial reconfiguration phase, here C is a (small) subset of k frames from a much larger set S of n frames. In this paper we present a method to configure the k elements of C by setting up a configuration path that strings its way through only those frames that require reconfiguration, the configuration bit stream can be now shifted in through this path. Our method also automatically selects (in hardware) a suitable clock speed that can be used to input these configuration bits. If the elements of C show spatial locality, then the configuration time could be made largely independent of n.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132162316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Message from the HCW Program Committee Chair 来自HCW计划委员会主席的信息

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.155

D. Trystram

Message from the HCW Program Chair

来自HCW项目主席的信息

引用次数: 0

A Distributed Greedy Heuristic for Computing Voronoi Tessellations with Applications Towards Peer-to-Peer Networks 计算Voronoi镶嵌的分布式贪婪启发式算法及其在点对点网络中的应用

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.120

Brendan Benshoof, Andrew Rosen, A. Bourgeois, R. Harrison

Computing Voronoi tessellations in an arbitrary number of dimensions is a computationally difficult task. This problem becomes exacerbated in distributed environments, such as Peer-to-Peer networks and Wireless networks, where Voronoi tessellations have useful applications. We present our Distributed Greedy Voronoi Heuristic, which approximates Voronoi tessellations in distributed environments. Our heuristic is fast, scalable, works in any geometric space with a distance and midpoint function, and has interesting applications in embedding metrics such as latency in the links of a distributed network.

计算任意维数的Voronoi镶嵌是一项计算困难的任务。这个问题在分布式环境中变得更加严重，例如点对点网络和无线网络，在这些环境中Voronoi镶嵌有很有用的应用。我们提出了分布式贪婪Voronoi启发式算法，它近似于分布式环境中的Voronoi镶嵌。我们的启发式算法快速，可扩展，适用于任何具有距离和中点函数的几何空间，并且在嵌入度量(如分布式网络链路中的延迟)方面有有趣的应用。

引用次数: 2

Graphulo: Linear Algebra Graph Kernels for NoSQL Databases Graphulo: NoSQL数据库的线性代数图核

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.19

V. Gadepally, Jake Bolewski, D. Hook, D. Hutchison, B. A. Miller, J. Kepner

Big data and the Internet of Things era continue to challenge computational systems. Several technology solutions such as NoSQL databases have been developed to deal with this challenge. In order to generate meaningful results from large datasets, analysts often use a graph representation which provides an intuitive way to work with the data. Graph vertices can represent users and events, and edges can represent the relationship between vertices. Graph algorithms are used to extract meaningful information from these very large graphs. At MIT, the Graphulo initiative is an effort to perform graph algorithms directly in NoSQL databases such as Apache Accumulo or SciDB, which have an inherently sparse data storage scheme. Sparse matrix operations have a history of efficient implementations and the Graph Basic Linear Algebra Subprogram (Graph BLAS) community has developed a set of key kernels that can be used to develop efficient linear algebra operations. However, in order to use the Graph BLAS kernels, it is important that common graph algorithms be recast using the linear algebra building blocks. In this article, we look at common classes of graph algorithms and recast them into linear algebra operations using the Graph BLAS building blocks.

大数据和物联网时代继续挑战计算系统。已经开发了一些技术解决方案，如NoSQL数据库来应对这一挑战。为了从大型数据集生成有意义的结果，分析师通常使用图形表示，它提供了一种直观的处理数据的方法。图的顶点可以表示用户和事件，边可以表示顶点之间的关系。图算法用于从这些非常大的图中提取有意义的信息。在麻省理工学院，Graphulo计划旨在直接在Apache Accumulo或SciDB等NoSQL数据库中执行图算法，这些数据库具有固有的稀疏数据存储方案。稀疏矩阵运算具有高效实现的历史，图基本线性代数子程序(Graph BLAS)社区已经开发了一组关键核，可用于开发高效的线性代数运算。然而，为了使用图BLAS核，重要的是使用线性代数构建块对常见的图算法进行重铸。在本文中，我们将查看常见的图算法类，并使用graph BLAS构建块将它们重新转换为线性代数操作。

{"title":"Graphulo: Linear Algebra Graph Kernels for NoSQL Databases","authors":"V. Gadepally, Jake Bolewski, D. Hook, D. Hutchison, B. A. Miller, J. Kepner","doi":"10.1109/IPDPSW.2015.19","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.19","url":null,"abstract":"Big data and the Internet of Things era continue to challenge computational systems. Several technology solutions such as NoSQL databases have been developed to deal with this challenge. In order to generate meaningful results from large datasets, analysts often use a graph representation which provides an intuitive way to work with the data. Graph vertices can represent users and events, and edges can represent the relationship between vertices. Graph algorithms are used to extract meaningful information from these very large graphs. At MIT, the Graphulo initiative is an effort to perform graph algorithms directly in NoSQL databases such as Apache Accumulo or SciDB, which have an inherently sparse data storage scheme. Sparse matrix operations have a history of efficient implementations and the Graph Basic Linear Algebra Subprogram (Graph BLAS) community has developed a set of key kernels that can be used to develop efficient linear algebra operations. However, in order to use the Graph BLAS kernels, it is important that common graph algorithms be recast using the linear algebra building blocks. In this article, we look at common classes of graph algorithms and recast them into linear algebra operations using the Graph BLAS building blocks.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129919097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 44

GPU-based Parallel R-tree Construction and Querying 基于gpu的并行r树构造与查询

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.127

S. Prasad, Michael McDermott, Xi He, S. Puri

An R-tree is a data structure for organizing and querying multi-dimensional non-uniform and overlapping data. Efficient parallelization of R-tree is an important problem due to societal applications such as geographic information systems (GIS), spatial database management systems, and VLSI layout which employ R-trees for spatial analysis tasks such as map-overlay. As graphics processing units (GPUs) have emerged as powerful computing platforms, these R-tree related applications demand efficient R-tree construction and search algorithms on GPUs. This problem is hard both due to (i) non-linear tree topology of the data structure itself and (ii) the unconventional single-instruction multiple-thread (SIMT) architecture of modern GPUs requiring careful engineering of a host of issues. Therefore, the current best parallelizations of R-tree on GPU has limited speedup of only about 20-fold. We present a space-efficient data structure design and a non-trivial bottom-up construction algorithm for R-tree on GPUs. This has yielded the first demonstrated 226-fold speedup in parallel construction of an R-tree on a GPU compared to one-core execution on a CPU. We also present innovative R-tree search algorithms that are designed to overcome GPU's architectural and resource limitations. The best of these algorithms gives a speed up of 91-fold to 180-fold on an R-tree with 16384 base objects for query sizes ranging from 2k to 16k.

r树是一种用于组织和查询多维非均匀和重叠数据的数据结构。由于地理信息系统(GIS)、空间数据库管理系统和超大规模集成电路(VLSI)布局等社会应用都采用r树进行空间分析任务(如地图覆盖)，r树的高效并行化是一个重要的问题。随着图形处理单元(gpu)作为强大的计算平台的出现，这些与r树相关的应用需要在gpu上高效的r树构建和搜索算法。这个问题很难解决，因为(i)数据结构本身的非线性树状拓扑结构和(ii)现代gpu的非常规单指令多线程(SIMT)架构需要仔细设计大量问题。因此，目前r树在GPU上的最佳并行化只有大约20倍的有限加速。在gpu上提出了一种空间高效的r树数据结构设计和一种非平凡的自底向上构造算法。这使得在GPU上并行构建r树的速度比在CPU上单核执行的速度提高了226倍。我们还提出了创新的r树搜索算法，旨在克服GPU的架构和资源限制。对于查询大小从2k到16k的16384个基本对象的R-tree，这些算法中的最佳算法可以将速度提高91倍到180倍。

{"title":"GPU-based Parallel R-tree Construction and Querying","authors":"S. Prasad, Michael McDermott, Xi He, S. Puri","doi":"10.1109/IPDPSW.2015.127","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.127","url":null,"abstract":"An R-tree is a data structure for organizing and querying multi-dimensional non-uniform and overlapping data. Efficient parallelization of R-tree is an important problem due to societal applications such as geographic information systems (GIS), spatial database management systems, and VLSI layout which employ R-trees for spatial analysis tasks such as map-overlay. As graphics processing units (GPUs) have emerged as powerful computing platforms, these R-tree related applications demand efficient R-tree construction and search algorithms on GPUs. This problem is hard both due to (i) non-linear tree topology of the data structure itself and (ii) the unconventional single-instruction multiple-thread (SIMT) architecture of modern GPUs requiring careful engineering of a host of issues. Therefore, the current best parallelizations of R-tree on GPU has limited speedup of only about 20-fold. We present a space-efficient data structure design and a non-trivial bottom-up construction algorithm for R-tree on GPUs. This has yielded the first demonstrated 226-fold speedup in parallel construction of an R-tree on a GPU compared to one-core execution on a CPU. We also present innovative R-tree search algorithms that are designed to overcome GPU's architectural and resource limitations. The best of these algorithms gives a speed up of 91-fold to 180-fold on an R-tree with 16384 base objects for query sizes ranging from 2k to 16k.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125390418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27

Cost-Driven Scheduling for Deadline-Constrained Workflow on Multi-clouds 多云环境下期限约束工作流的成本驱动调度

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.56

Bing Lin, Wenzhong Guo, Guolong Chen, N. Xiong, Rongrong Li

The tremendous parallel computing ability of Cloud computing as a new service provisioning paradigm encourages investigators to research its drawbacks and advantages on processing large-scale scientific applications such as workflows. The current Cloud market is composed of numerous diverse Cloud providers and workflow scheduling is one of the biggest challenges on Multi-Clouds. However, the existing works fail to either satisfy the Quality of Service (QoS) requirements of end users or involve some fundamental principles of Cloud computing such as pay-as-you-go pricing model and heterogeneous computing resources. In this paper, we adapt the Partial Critical Paths algorithm (PCPA) for the multi-cloud environment and propose a scheduling strategy for scientific workflow, called Multi-Cloud Partial Critical Paths (MCPCP), which aims to minimize the execution cost of workflow while satisfying the defined deadline constrain. Our approach takes into account the essential characteristics on Multi-Clouds such as charge per time interval, various instance types from different Cloud providers as well as homogeneous intra-bandwidth vs. Heterogeneous inter-bandwidth. Various well-know workflows are used for evaluating our strategy and the experimental results show that the proposed approach has a good performance on Multi-Clouds.

云计算作为一种新的服务提供范式，其巨大的并行计算能力鼓励研究者研究其在处理大规模科学应用(如工作流)方面的优缺点。当前的云市场由众多不同的云提供商组成，工作流调度是多云最大的挑战之一。然而，现有的工作既不能满足最终用户对服务质量(QoS)的要求，也不能涉及云计算的一些基本原则，如按需付费的定价模式和异构计算资源。本文将部分关键路径算法(PCPA)应用于多云环境，提出了一种科学工作流调度策略——多云部分关键路径(MCPCP)，该策略旨在使工作流的执行成本最小化，同时满足定义的时间约束。我们的方法考虑了多云的基本特征，比如每时间间隔收费、来自不同云提供商的各种实例类型，以及同质带宽内与异构带宽间的区别。实验结果表明，该方法在多云环境下具有良好的性能。

{"title":"Cost-Driven Scheduling for Deadline-Constrained Workflow on Multi-clouds","authors":"Bing Lin, Wenzhong Guo, Guolong Chen, N. Xiong, Rongrong Li","doi":"10.1109/IPDPSW.2015.56","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.56","url":null,"abstract":"The tremendous parallel computing ability of Cloud computing as a new service provisioning paradigm encourages investigators to research its drawbacks and advantages on processing large-scale scientific applications such as workflows. The current Cloud market is composed of numerous diverse Cloud providers and workflow scheduling is one of the biggest challenges on Multi-Clouds. However, the existing works fail to either satisfy the Quality of Service (QoS) requirements of end users or involve some fundamental principles of Cloud computing such as pay-as-you-go pricing model and heterogeneous computing resources. In this paper, we adapt the Partial Critical Paths algorithm (PCPA) for the multi-cloud environment and propose a scheduling strategy for scientific workflow, called Multi-Cloud Partial Critical Paths (MCPCP), which aims to minimize the execution cost of workflow while satisfying the defined deadline constrain. Our approach takes into account the essential characteristics on Multi-Clouds such as charge per time interval, various instance types from different Cloud providers as well as homogeneous intra-bandwidth vs. Heterogeneous inter-bandwidth. Various well-know workflows are used for evaluating our strategy and the experimental results show that the proposed approach has a good performance on Multi-Clouds.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126175835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

Perfect Hashing Structures for Parallel Similarity Searches 并行相似性搜索的完美哈希结构

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.105

T. T. Tran, Mathieu Giraud, Jean-Stéphane Varré

Seed-based heuristics have proved to be efficient for studying similarity between genetic databases with billions of base pairs. This paper focuses on algorithms and data structures for the filtering phase in seed-based heuristics, with an emphasis on efficient parallel GPU/many cores implementation. We propose a 2-stage index structure which is based on neighborhood indexing and perfect hashing techniques. This structure performs a filtering phase over the neighborhood regions around the seeds in constant time and avoid as much as possible random memory accesses and branch divergences. Moreover, it fits particularly well on parallel SIMD processors, because it requires intensive but homogeneous computational operations. Using this data structure, we developed a fast and sensitive Open CL prototype read mapper.

基于种子的启发式算法已被证明是研究具有数十亿碱基对的遗传数据库之间相似性的有效方法。本文重点研究了基于种子的启发式算法中过滤阶段的算法和数据结构，重点研究了高效的并行GPU/多核实现。提出了一种基于邻域索引和完美哈希技术的两阶段索引结构。该结构在恒定时间内对种子周围的邻域进行滤波，尽可能避免随机存储器访问和分支发散。此外，它特别适合并行SIMD处理器，因为它需要密集但同构的计算操作。利用这种数据结构，我们开发了一个快速、灵敏的Open CL原型读映射器。

引用次数: 2

Scheduling Computational Workflows on Failure-Prone Platforms 在易故障平台上调度计算工作流

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.33

G. Aupy, A. Benoit, H. Casanova, Y. Robert

We study the scheduling of computational workflows on compute resources that experience exponentially distributed failures. When a failure occurs, rollback and recovery is used to resume the execution from the last check pointed state. The scheduling problem is to minimize the expected execution time by deciding in which order to execute the tasks in the workflow and whether to checkpoint or not checkpoint a task after it completes. We give a polynomial-time algorithm for fork graphs and show that the problem is NP-complete with join graphs. Our main result is a polynomial-time algorithm to compute the execution time of a workflow with specified to-be-check pointed tasks. Using this algorithm as a basis, we propose efficient heuristics for solving the scheduling problem. We evaluate these heuristics for representative workflow configurations.

我们研究了在经历指数分布故障的计算资源上的计算工作流调度。发生故障时，使用回滚和恢复从最后一个检查点状态恢复执行。调度问题是通过决定在工作流中以何种顺序执行任务以及是否在任务完成后检查点任务来最小化预期的执行时间。我们给出了分叉图的多项式时间算法，并证明了该问题是有连接图的np完全问题。我们的主要成果是一个多项式时间算法，用于计算具有指定待检查点任务的工作流的执行时间。在此基础上，提出了求解调度问题的有效启发式算法。我们评估了这些启发式的代表性工作流配置。

引用次数: 12

Semi-two-dimensional Partitioning for Parallel Sparse Matrix-Vector Multiplication 并行稀疏矩阵向量乘法的半二维分划

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.20

Enver Kayaaslan, B. Uçar, C. Aykanat

We propose a novel sparse matrix partitioning scheme, called semi-two-dimensional (s2D), for efficient parallelization of sparse matrix-vector multiply (SpMV) operations on distributed memory systems. In s2D, matrix nonzeros are more flexibly distributed among processors than one dimensional (row wise or column wise) partitioning schemes. Yet, there is a constraint which renders s2D less flexible than two-dimensional (nonzero based) partitioning schemes. The constraint is enforced to confine all communication operations in a single phase, as in 1D partition, in a parallel SpMV operation. In a positive view, s2D thus can be seen as being close to 2D partitions in terms of flexibility, and being close 1D partitions in terms of computation/communication organization. We describe two methods that take partitions on the input and output vectors of SpMV and produce s2D partitions while reducing the total communication volume. The first method obtains optimal total communication volume, while the second one heuristically reduces this quantity and takes computational load balance into account. We demonstrate that the proposed partitioning method improves the performance of parallel SpMV operations both in theory and practice with respect to 1D and 2D partitionings.

我们提出了一种新的稀疏矩阵划分方案，称为半二维(s2D)，用于分布式存储系统上稀疏矩阵向量乘法(SpMV)操作的高效并行化。在s2D中，矩阵非零比一维(行或列)分区方案更灵活地分布在处理器之间。然而，有一个约束使得s2D比二维(基于非零的)分区方案更不灵活。强制约束将所有通信操作限制在单个阶段，如在一维分区中，在并行SpMV操作中。从积极的角度来看，s2D因此可以被视为在灵活性方面接近2D分区，而在计算/通信组织方面接近1D分区。我们描述了两种方法，它们在SpMV的输入和输出矢量上进行分区，并在减少总通信量的同时产生s2D分区。第一种方法获得最优的总通信量，第二种方法在考虑计算负载平衡的情况下启发式地减小通信量。我们证明了所提出的划分方法在理论和实践上提高了并行SpMV操作在一维和二维划分方面的性能。

{"title":"Semi-two-dimensional Partitioning for Parallel Sparse Matrix-Vector Multiplication","authors":"Enver Kayaaslan, B. Uçar, C. Aykanat","doi":"10.1109/IPDPSW.2015.20","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.20","url":null,"abstract":"We propose a novel sparse matrix partitioning scheme, called semi-two-dimensional (s2D), for efficient parallelization of sparse matrix-vector multiply (SpMV) operations on distributed memory systems. In s2D, matrix nonzeros are more flexibly distributed among processors than one dimensional (row wise or column wise) partitioning schemes. Yet, there is a constraint which renders s2D less flexible than two-dimensional (nonzero based) partitioning schemes. The constraint is enforced to confine all communication operations in a single phase, as in 1D partition, in a parallel SpMV operation. In a positive view, s2D thus can be seen as being close to 2D partitions in terms of flexibility, and being close 1D partitions in terms of computation/communication organization. We describe two methods that take partitions on the input and output vectors of SpMV and produce s2D partitions while reducing the total communication volume. The first method obtains optimal total communication volume, while the second one heuristically reduces this quantity and takes computational load balance into account. We demonstrate that the proposed partitioning method improves the performance of parallel SpMV operations both in theory and practice with respect to 1D and 2D partitionings.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125957617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀