International Workshop on Data Management on New Hardware最新文献

英文中文

High throughput heavy hitter aggregation for modern SIMD processors 用于现代SIMD处理器的高吞吐量重型聚合

International Workshop on Data Management on New Hardware

Pub Date : 2013-06-24 DOI: 10.1145/2485278.2485284

Orestis Polychroniou, K. A. Ross

Heavy hitters are data items that occur at high frequency in a data set. They are among the most important items for an organization to summarize and understand during analytical processing. In data sets with sufficient skew, the number of heavy hitters can be relatively small. We take advantage of this small footprint to compute aggregate functions for the heavy hitters in fast cache memory in a single pass. We design cache-resident, shared-nothing structures that hold only the most frequent elements. Our algorithm works in three phases. It first samples and picks heavy hitter candidates. It then builds a hash table and computes the exact aggregates of these elements. Finally, a validation step identifies the true heavy hitters from among the candidates. We identify trade-offs between the hash table configuration and performance. Configurations consist of the probing algorithm and the table capacity that determines how many candidates can be aggregated. The probing algorithm can be perfect hashing, cuckoo hashing and bucketized hashing to explore trade-offs between size and speed. We optimize performance by the use of SIMD instructions, utilized in novel ways beyond single vectorized operations, to minimize cache accesses and the instruction footprint.

重磅数据项是在数据集中出现频率很高的数据项。它们是组织在分析处理过程中总结和理解的最重要的项目之一。在具有足够偏度的数据集中，重磅数据的数量可能相对较少。我们利用这种小的内存占用，一次就可以在快速缓存中为重量级对象计算聚合函数。我们设计驻留在缓存中的无共享结构，只保存最频繁的元素。我们的算法分为三个阶段。它首先取样并挑选重量级候选人。然后，它构建一个哈希表并计算这些元素的确切聚合。最后，验证步骤从候选人中识别出真正的重量级人物。我们确定了哈希表配置和性能之间的权衡。配置由探测算法和表容量组成，表容量决定可以聚合多少候选项。探测算法可以是完美哈希、布谷鸟哈希和桶状哈希，以探索大小和速度之间的权衡。我们通过使用SIMD指令来优化性能，以超越单一矢量化操作的新方式使用SIMD指令，以最大限度地减少缓存访问和指令占用。

{"title":"High throughput heavy hitter aggregation for modern SIMD processors","authors":"Orestis Polychroniou, K. A. Ross","doi":"10.1145/2485278.2485284","DOIUrl":"https://doi.org/10.1145/2485278.2485284","url":null,"abstract":"Heavy hitters are data items that occur at high frequency in a data set. They are among the most important items for an organization to summarize and understand during analytical processing. In data sets with sufficient skew, the number of heavy hitters can be relatively small. We take advantage of this small footprint to compute aggregate functions for the heavy hitters in fast cache memory in a single pass.\u0000 We design cache-resident, shared-nothing structures that hold only the most frequent elements. Our algorithm works in three phases. It first samples and picks heavy hitter candidates. It then builds a hash table and computes the exact aggregates of these elements. Finally, a validation step identifies the true heavy hitters from among the candidates.\u0000 We identify trade-offs between the hash table configuration and performance. Configurations consist of the probing algorithm and the table capacity that determines how many candidates can be aggregated. The probing algorithm can be perfect hashing, cuckoo hashing and bucketized hashing to explore trade-offs between size and speed.\u0000 We optimize performance by the use of SIMD instructions, utilized in novel ways beyond single vectorized operations, to minimize cache accesses and the instruction footprint.","PeriodicalId":298901,"journal":{"name":"International Workshop on Data Management on New Hardware","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125976737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 37

Scalable frequent itemset mining on many-core processors 多核处理器上可扩展的频繁项集挖掘

International Workshop on Data Management on New Hardware

Pub Date : 2013-06-24 DOI: 10.1145/2485278.2485281

B. Schlegel, Tomas Karnagel, Tim Kiefer, Wolfgang Lehner

Frequent-itemset mining is an essential part of the association rule mining process, which has many application areas. It is a computation and memory intensive task with many opportunities for optimization. Many efficient sequential and parallel algorithms were proposed in the recent years. Most of the parallel algorithms, however, cannot cope with the huge number of threads that are provided by large multiprocessor or many-core systems. In this paper, we provide a highly parallel version of the well-known Eclat algorithm. It runs on both, multiprocessor systems and many-core coprocessors, and scales well up to a very large number of threads---244 in our experiments. To evaluate mcEclat's performance, we conducted many experiments on realistic datasets. mcEclat achieves high speedups of up to 11.5x and 100x on a 12-core multiprocessor system and a 61-core Xeon Phi many-core coprocessor, respectively. Furthermore, mcEclat is competitive with highly optimized existing frequent-itemset mining implementations taken from the FIMI repository.

频繁项集挖掘是关联规则挖掘过程的重要组成部分，具有广泛的应用领域。这是一个计算和内存密集的任务，有很多优化的机会。近年来，人们提出了许多高效的顺序和并行算法。然而，大多数并行算法无法处理大型多处理器或多核系统所提供的大量线程。在本文中，我们提供了一个著名的Eclat算法的高度并行版本。它可以在多处理器系统和多核协处理器上运行，并且可以很好地扩展到非常多的线程——在我们的实验中有244个线程。为了评估mcEclat的性能，我们在现实数据集上进行了许多实验。mcEclat在12核多处理器系统和61核Xeon Phi多核协处理器上分别实现了高达11.5倍和100倍的高速。此外，mcEclat与来自FIMI存储库的高度优化的现有频繁项集挖掘实现竞争。

引用次数: 32

Energy-proportional query execution using a cluster of wimpy nodes 使用弱节点集群的能量比例查询执行

International Workshop on Data Management on New Hardware

Pub Date : 2013-06-24 DOI: 10.1145/2485278.2485279

D. Schall, T. Härder

Because energy use of single-server systems is far from being energy proportional, we explore whether or not better energy efficiency may be achieved by a cluster of nodes whose size is dynamically adjusted to the current workload demand. As data-intensive workloads, we submit specific TPC-H queries against a distributed shared-nothing DBMS, where time and energy use are captured by specific monitoring and measurement devices. We configure various static clusters of varying sizes and show their influence on energy efficiency and performance. Further, using an EnergyController and a load-aware scheduler, we verify the hypothesis that energy proportionality can be well approximated by dynamic clusters.

由于单服务器系统的能源使用远不是与能源成比例的，因此我们探讨了通过动态调整节点大小以适应当前工作负载需求的集群是否可以实现更好的能源效率。作为数据密集型工作负载，我们针对分布式无共享DBMS提交特定的TPC-H查询，其中时间和能源使用由特定的监控和测量设备捕获。我们配置了不同大小的静态集群，并展示了它们对能源效率和性能的影响。此外，使用EnergyController和负载感知调度器，我们验证了动态集群可以很好地近似能量比例的假设。

引用次数: 26

The HELLS-join: a heterogeneous stream join for extremely large windows hell -join:用于超大窗口的异构流连接

International Workshop on Data Management on New Hardware

Pub Date : 2013-06-24 DOI: 10.1145/2485278.2485280

Tomas Karnagel, Dirk Habich, B. Schlegel, Wolfgang Lehner

Upcoming processors are combining different computing units in a tightly-coupled approach using a unified shared memory hierarchy. This tightly-coupled combination leads to novel properties with regard to cooperation and interaction. This paper demonstrates the advantages of those processors for a stream-join operator as an important data-intensive example. In detail, we propose our HELLS-Join approach employing all heterogeneous devices by outsourcing parts of the algorithm on the appropriate device. Our HELLS-Join performs better than CPU stream joins, allowing wider time windows, higher stream frequencies, and more streams to be joined as before.

即将推出的处理器使用统一的共享内存层次结构以紧密耦合的方式组合不同的计算单元。这种紧密耦合的组合在合作和交互方面产生了新的特性。本文作为一个重要的数据密集型示例，演示了这些处理器对于流连接操作符的优势。详细地说，我们提出了我们的hell - join方法，通过在适当的设备上外包算法的部分来使用所有异构设备。我们的hell - join比CPU流连接性能更好，允许更宽的时间窗口，更高的流频率，和以前一样可以连接更多的流。

引用次数: 28

Enabling efficient OS paging for main-memory OLTP databases 为主存OLTP数据库启用高效的操作系统分页

International Workshop on Data Management on New Hardware

Pub Date : 2013-06-24 DOI: 10.1145/2485278.2485285

R. Stoica, A. Ailamaki

Even though main memory is becoming large enough to fit most OLTP databases, it may not always be the best option. OLTP workloads typically exhibit skewed access patterns where some records are hot (frequently accessed) but many records are cold (infrequently or never accessed). Therefore, it is more economical to store the coldest records on a fast secondary storage device such as a solid-state disk. However, main-memory DBMS have no knowledge of secondary storage, while traditional disk-based databases, designed for workloads where data resides on HDD, introduce too much overhead for the common case where the working set is memory resident. In this paper, we propose a simple and low-overhead technique that enables main-memory databases to efficiently migrate cold data to secondary storage by relying on the OS's virtual memory paging mechanism. We propose to log accesses at the tuple level, process the access traces offline to identify relevant access patterns, and then transparently re-organize the in-memory data structures to reduce paging I/O and improve hit rates. The hot/cold data separation is performed on demand and incrementally through careful memory management, without any change to the underlying data structures. We validate experimentally the data re-organization proposal and show that OS paging can be efficient: a TPC-C database can grow two orders of magnitude larger than the available memory size without a noticeable impact on performance.

即使主内存变得足够大，可以容纳大多数OLTP数据库，但它可能并不总是最好的选择。OLTP工作负载通常表现出倾斜的访问模式，其中一些记录是热的(经常访问)，而许多记录是冷的(不经常访问或从未访问)。因此，将最冷的记录存储在快速辅助存储设备(如固态磁盘)上更为经济。然而，主存DBMS不了解辅助存储，而传统的基于磁盘的数据库是为数据驻留在HDD上的工作负载而设计的，对于工作集驻留在内存中的常见情况，它引入了太多的开销。在本文中，我们提出了一种简单且低开销的技术，该技术使主存数据库能够依靠操作系统的虚拟内存分页机制有效地将冷数据迁移到二级存储。我们建议在元组级别记录访问，脱机处理访问跟踪以识别相关的访问模式，然后透明地重新组织内存中的数据结构，以减少分页I/O并提高命中率。热/冷数据分离是根据需要通过仔细的内存管理逐步执行的，不需要对底层数据结构进行任何更改。我们通过实验验证了数据重组建议，并表明操作系统分页是有效的:TPC-C数据库可以比可用内存大小增长两个数量级，而不会对性能产生明显影响。

{"title":"Enabling efficient OS paging for main-memory OLTP databases","authors":"R. Stoica, A. Ailamaki","doi":"10.1145/2485278.2485285","DOIUrl":"https://doi.org/10.1145/2485278.2485285","url":null,"abstract":"Even though main memory is becoming large enough to fit most OLTP databases, it may not always be the best option. OLTP workloads typically exhibit skewed access patterns where some records are hot (frequently accessed) but many records are cold (infrequently or never accessed). Therefore, it is more economical to store the coldest records on a fast secondary storage device such as a solid-state disk. However, main-memory DBMS have no knowledge of secondary storage, while traditional disk-based databases, designed for workloads where data resides on HDD, introduce too much overhead for the common case where the working set is memory resident.\u0000 In this paper, we propose a simple and low-overhead technique that enables main-memory databases to efficiently migrate cold data to secondary storage by relying on the OS's virtual memory paging mechanism. We propose to log accesses at the tuple level, process the access traces offline to identify relevant access patterns, and then transparently re-organize the in-memory data structures to reduce paging I/O and improve hit rates. The hot/cold data separation is performed on demand and incrementally through careful memory management, without any change to the underlying data structures. We validate experimentally the data re-organization proposal and show that OS paging can be efficient: a TPC-C database can grow two orders of magnitude larger than the available memory size without a noticeable impact on performance.","PeriodicalId":298901,"journal":{"name":"International Workshop on Data Management on New Hardware","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124950364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 51

OLTP in wonderland: where do cache misses come from in major OLTP components? 奇妙的OLTP:在主要的OLTP组件中缓存缺失来自哪里?

International Workshop on Data Management on New Hardware

Pub Date : 2013-06-24 DOI: 10.1145/2485278.2485286

Pınar Tözün, Brian T. Gold, A. Ailamaki

For several decades, online transaction processing has been one of the main applications that drives innovations in the data management ecosystem, and in turn the database and computer architecture communities. Despite the novel approaches from industry and various research proposals from academia, recent studies emphasize that OLTP workloads still cannot exploit the full capability of modern processors. To better integrate OLTP and hardware in future systems, we perform a detailed analysis of instruction and data misses, the main causes of memory stalls. We demonstrate which operations and components of a typical storage manager cause the majority of different types of misses in each level of the memory hierarchy on a configuration that closely represents modern commodity hardware. We also observe the impact of data working set size on these misses. According to our experimental results, L1 instruction misses are an extensive cause of the overall stall time for OLTP even for data working set sizes as large as 100GB as long as the data fits in memory. Capacity misses coming from the index probe operation are the dominant cause of the instruction and data stalls when running typical OLTP workloads. During index probe (one of the most common operations in OLTP), the B-tree, lock, and buffer management components of a storage manager are responsible for more than half of the total misses.

几十年来，在线事务处理一直是驱动数据管理生态系统创新的主要应用程序之一，反过来也推动了数据库和计算机体系结构社区的创新。尽管工业界采用了新颖的方法，学术界也提出了各种研究建议，但最近的研究强调，OLTP工作负载仍然不能充分利用现代处理器的全部功能。为了在未来的系统中更好地集成OLTP和硬件，我们对指令和数据丢失进行了详细的分析，这是导致内存停滞的主要原因。我们演示了典型存储管理器的哪些操作和组件会在内存层次结构的每个级别上导致大多数不同类型的错误，该配置与现代商用硬件非常接近。我们还观察了数据工作集大小对这些失误的影响。根据我们的实验结果，L1指令缺失是OLTP总体停机时间的一个广泛原因，即使数据工作集大小高达100GB，只要数据适合内存。在运行典型的OLTP工作负载时，来自索引探测操作的容量缺失是导致指令和数据停滞的主要原因。在索引探测(OLTP中最常见的操作之一)期间，存储管理器的b树、锁和缓冲区管理组件要为总失败次数的一半以上负责。

{"title":"OLTP in wonderland: where do cache misses come from in major OLTP components?","authors":"Pınar Tözün, Brian T. Gold, A. Ailamaki","doi":"10.1145/2485278.2485286","DOIUrl":"https://doi.org/10.1145/2485278.2485286","url":null,"abstract":"For several decades, online transaction processing has been one of the main applications that drives innovations in the data management ecosystem, and in turn the database and computer architecture communities. Despite the novel approaches from industry and various research proposals from academia, recent studies emphasize that OLTP workloads still cannot exploit the full capability of modern processors.\u0000 To better integrate OLTP and hardware in future systems, we perform a detailed analysis of instruction and data misses, the main causes of memory stalls. We demonstrate which operations and components of a typical storage manager cause the majority of different types of misses in each level of the memory hierarchy on a configuration that closely represents modern commodity hardware. We also observe the impact of data working set size on these misses.\u0000 According to our experimental results, L1 instruction misses are an extensive cause of the overall stall time for OLTP even for data working set sizes as large as 100GB as long as the data fits in memory. Capacity misses coming from the index probe operation are the dominant cause of the instruction and data stalls when running typical OLTP workloads. During index probe (one of the most common operations in OLTP), the B-tree, lock, and buffer management components of a storage manager are responsible for more than half of the total misses.","PeriodicalId":298901,"journal":{"name":"International Workshop on Data Management on New Hardware","volume":"11 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129191451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 26

Optimizing select conditions on GPUs 优化gpu的选择条件

International Workshop on Data Management on New Hardware

Pub Date : 2013-06-24 DOI: 10.1145/2485278.2485282

Evangelia A. Sitaridi, K. A. Ross

Implementations of data processing operators on GPU processors have achieved significant performance improvements over their multicore CPU counterparts. To achieve maximum performance, database operator implementations must take into consideration special features of GPU architectures. A crucial difference is that the unit of execution is a group ("warp") of threads, 32 threads in our target architecture, as opposed to a single thread for CPUs. In the presence of branches, threads in a warp have to follow the same execution path; if some threads diverge then different paths are serialized. Additionally, similarly to CPUs, branches degrade the efficiency of instruction scheduling. Here, we study conjunctive selection queries where branching hurts performance. We compute the optimal execution plan for a conjunctive query, taking branch penalties into account and consider both single-kernel and multi-kernel plans. Our evaluation suggests that divergence affects performance significantly and that our techniques reduce resource underutilization and improve operator performance.

数据处理运算符在GPU处理器上的实现已经取得了显著的性能改进。为了获得最大的性能，数据库操作符的实现必须考虑到GPU架构的特殊特性。一个关键的区别是执行单元是一组线程(“warp”)，在我们的目标体系结构中是32个线程，而cpu是单个线程。在存在分支的情况下，warp中的线程必须遵循相同的执行路径;如果一些线程偏离，则序列化不同的路径。此外，与cpu类似，分支会降低指令调度的效率。在这里，我们研究分支影响性能的联合选择查询。我们计算了一个联合查询的最优执行计划，考虑了分支惩罚，并考虑了单核和多核计划。我们的评估表明，差异对性能有显著影响，我们的技术减少了资源利用率不足，提高了作业者的性能。

引用次数: 32

Making cost-based query optimization asymmetry-aware 使基于成本的查询优化具有不对称意识

International Workshop on Data Management on New Hardware

Pub Date : 2012-05-21 DOI: 10.1145/2236584.2236588

Daniel Bausch, Ilia Petrov, A. Buchmann

The architecture and algorithms of database systems have been built around the properties of existing hardware technologies. Many such elementary design assumptions are 20--30 years old. Over the last five years we witness multiple new I/O technologies (e.g. Flash SSDs, NV-Memories) that have the potential of changing these assumptions. Some of the key technological differences to traditional spinning disk storage are: (i) asymmetric read/write performance; (ii) low latencies; (iii) fast random reads; (iv) endurance issues. Cost functions used by traditional database query optimizers are directly influenced by these properties. Most cost functions estimate the cost of algorithms based on metrics such as sequential and random I/O costs besides CPU and memory consumption. These do not account for asymmetry or high random read and inferior random write performance, which represents a significant mismatch. In the present paper we show a new asymmetry-aware cost model for Flash SSDs with adapted cost functions for algorithms such as external sort, hash-join, sequential scan, index scan, etc. It has been implemented in PostgreSQL and tested with TPC-H. Additionally we describe a tool that automatically finds good settings for the base coefficients of cost models. After tuning the configuration of both the original and the asymmetry-aware cost model with that tool, the optimizer with the asymmetry-aware cost model selects faster execution plans for 14 out of the 22 TPC-H queries (the rest being the same or negligibly worse). We achieve an overall performance improvement of 48% on SSD.

数据库系统的体系结构和算法是围绕现有硬件技术的特性构建的。许多这样的基本设计假设已经有20- 30年的历史了。在过去的五年中，我们见证了多种新的I/O技术(如闪存ssd, nv - memory)，它们有可能改变这些假设。与传统旋转磁盘存储的一些关键技术差异是:(i)非对称读写性能;(ii)低延迟;(iii)快速随机读取;(iv)耐力问题。传统数据库查询优化器使用的代价函数直接受到这些属性的影响。除了CPU和内存消耗之外，大多数成本函数都基于诸如顺序和随机I/O成本之类的指标来估计算法的成本。这些不能解释不对称或高随机读和低随机写性能，这代表了一个显著的不匹配。在本文中，我们展示了一种新的不对称感知的闪存ssd成本模型，该模型具有适应于外部排序、哈希连接、顺序扫描、索引扫描等算法的成本函数。它已经在PostgreSQL中实现，并在TPC-H中进行了测试。此外，我们描述了一个工具，自动找到成本模型的基本系数的良好设置。在使用该工具调优原始和非对称感知成本模型的配置之后，具有非对称感知成本模型的优化器为22个TPC-H查询中的14个选择了更快的执行计划(其余的相同或更差)。我们在SSD上实现了48%的整体性能提升。

{"title":"Making cost-based query optimization asymmetry-aware","authors":"Daniel Bausch, Ilia Petrov, A. Buchmann","doi":"10.1145/2236584.2236588","DOIUrl":"https://doi.org/10.1145/2236584.2236588","url":null,"abstract":"The architecture and algorithms of database systems have been built around the properties of existing hardware technologies. Many such elementary design assumptions are 20--30 years old. Over the last five years we witness multiple new I/O technologies (e.g. Flash SSDs, NV-Memories) that have the potential of changing these assumptions. Some of the key technological differences to traditional spinning disk storage are: (i) asymmetric read/write performance; (ii) low latencies; (iii) fast random reads; (iv) endurance issues.\u0000 Cost functions used by traditional database query optimizers are directly influenced by these properties. Most cost functions estimate the cost of algorithms based on metrics such as sequential and random I/O costs besides CPU and memory consumption. These do not account for asymmetry or high random read and inferior random write performance, which represents a significant mismatch.\u0000 In the present paper we show a new asymmetry-aware cost model for Flash SSDs with adapted cost functions for algorithms such as external sort, hash-join, sequential scan, index scan, etc. It has been implemented in PostgreSQL and tested with TPC-H. Additionally we describe a tool that automatically finds good settings for the base coefficients of cost models. After tuning the configuration of both the original and the asymmetry-aware cost model with that tool, the optimizer with the asymmetry-aware cost model selects faster execution plans for 14 out of the 22 TPC-H queries (the rest being the same or negligibly worse). We achieve an overall performance improvement of 48% on SSD.","PeriodicalId":298901,"journal":{"name":"International Workshop on Data Management on New Hardware","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128378884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 33

Ameliorating memory contention of OLAP operators on GPU processors 改进GPU处理器上OLAP操作符的内存争用

International Workshop on Data Management on New Hardware

Pub Date : 2012-05-21 DOI: 10.1145/2236584.2236590

Evangelia A. Sitaridi, K. A. Ross

Implementations of database operators on GPU processors have shown dramatic performance improvement compared to multicore-CPU implementations. GPU threads can cooperate using shared memory, which is organized in interleaved banks and is fast only when threads read and modify addresses belonging to distinct memory banks. Therefore, data processing operators implemented on a GPU, in addition to contention caused by popular values, have to deal with a new performance limiting factor: thread serialization when accessing values belonging to the same bank. Here, we define the problem of bank and value conflict optimization for data processing operators using the CUDA platform. To analyze the impact of these two factors on operator performance we use two database operations: foreignkey join and grouped aggregation. We suggest and evaluate techniques for optimizing the data arrangement offline by creating clones of values to reduce overall memory contention. Results indicate that columns used for writes, as grouping columns, need be optimized to fully exploit the maximum bandwidth of shared memory.

与多核cpu实现相比，在GPU处理器上实现数据库运算符显示出显著的性能改进。GPU线程可以使用共享内存进行合作，共享内存被组织在交错的内存库中，只有当线程读取和修改属于不同内存库的地址时才会快速。因此，在GPU上实现的数据处理操作符，除了流行值引起的争用之外，还必须处理一个新的性能限制因素:访问属于同一银行的值时的线程序列化。在这里，我们定义了使用CUDA平台的数据处理算子的bank和value冲突优化问题。为了分析这两个因素对操作符性能的影响，我们使用两个数据库操作:外键连接和分组聚合。我们建议并评估通过创建值的克隆来离线优化数据安排的技术，以减少总体内存争用。结果表明，需要对用于写的列(如分组列)进行优化，以充分利用共享内存的最大带宽。

引用次数: 29

GPU join processing revisited 重新访问GPU连接处理

International Workshop on Data Management on New Hardware

Pub Date : 2012-05-21 DOI: 10.1145/2236584.2236592

T. Kaldewey, G. Lohman, René Müller, P. Volk

Until recently, the use of graphics processing units (GPUs) for query processing was limited by the amount of memory on the graphics card, a few gigabytes at best. Moreover, input tables had to be copied to GPU memory before they could be processed, and after computation was completed, query results had to be copied back to CPU memory. The newest generation of Nvidia GPUs and development tools introduces a common memory address space, which now allows the GPU to access CPU memory directly, lifting size limitations and obviating data copy operations. We confirm that this new technology can sustain 98% of its nominal rate of 6.3 GB/sec in practice, and exploit it to process database hash joins at the same rate, i.e., the join is processed "on the fly" as the GPU reads the input tables from CPU memory at PCI-E speeds. Compared to the fastest published results for in-memory joins on the CPU, this represents more than half an order of magnitude speed-up. All of our results include the cost of result materialization (often omitted in earlier work), and we investigate the implications of changing join predicate selectivity and table size.

直到最近，使用图形处理单元(gpu)进行查询处理还受到图形卡上内存数量的限制，最多只有几gb。此外，输入表在处理之前必须复制到GPU内存中，在计算完成后，查询结果必须复制回CPU内存中。最新一代的Nvidia GPU和开发工具引入了一个公共内存地址空间，现在允许GPU直接访问CPU内存，解除了大小限制并避免了数据复制操作。我们确认这项新技术可以在实践中维持其6.3 GB/秒的标称速率的98%，并利用它以相同的速率处理数据库哈希连接，即，当GPU以PCI-E速度从CPU内存读取输入表时，连接被“动态地”处理。与CPU上内存连接的最快发布结果相比，这代表了超过半个数量级的速度提升。我们的所有结果都包括结果物化的成本(在早期的工作中经常被省略)，并且我们研究了改变连接谓词选择性和表大小的含义。

引用次数: 168

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

International Workshop on Data Management on New Hardware

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀