Compiling database queries into executable (sub-) programs provides substantial benefits comparing to traditional interpreted execution. Many of these benefits, such as reduced interpretation overhead, better instruction code locality, and providing opportunities to use SIMD instructions, have previously been provided by redesigning query processors to use a vectorized execution model. In this paper, we try to shed light on the question of how state-of-the-art compilation strategies relate to vectorized execution for analytical database workloads on modern CPUs. For this purpose, we carefully investigate the behavior of vectorized and compiled strategies inside the Ingres VectorWise database system in three use cases: Project, Select and Hash Join. One of the findings is that compilation should always be combined with block-wise query execution. Another contribution is identifying three cases where "loop-compilation" strategies are inferior to vectorized execution. As such, a careful merging of these two strategies is proposed for optimal performance: either by incorporating vectorized execution principles into compiled query plans or using query compilation to create building blocks for vectorized processing.
{"title":"Vectorization vs. compilation in query execution","authors":"Juliusz Sompolski, M. Zukowski, P. Boncz","doi":"10.1145/1995441.1995446","DOIUrl":"https://doi.org/10.1145/1995441.1995446","url":null,"abstract":"Compiling database queries into executable (sub-) programs provides substantial benefits comparing to traditional interpreted execution. Many of these benefits, such as reduced interpretation overhead, better instruction code locality, and providing opportunities to use SIMD instructions, have previously been provided by redesigning query processors to use a vectorized execution model. In this paper, we try to shed light on the question of how state-of-the-art compilation strategies relate to vectorized execution for analytical database workloads on modern CPUs. For this purpose, we carefully investigate the behavior of vectorized and compiled strategies inside the Ingres VectorWise database system in three use cases: Project, Select and Hash Join. One of the findings is that compilation should always be combined with block-wise query execution. Another contribution is identifying three cases where \"loop-compilation\" strategies are inferior to vectorized execution. As such, a careful merging of these two strategies is proposed for optimal performance: either by incorporating vectorized execution principles into compiled query plans or using query compilation to create building blocks for vectorized processing.","PeriodicalId":298901,"journal":{"name":"International Workshop on Data Management on New Hardware","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134372689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sean M. Snyder, Shimin Chen, Panos K. Chrysanthis, Alexandros Labrinidis
Energy consumption for computing devices in general and for data centers in particular is receiving increasingly high attention, both because of the increasing ubiquity of computing and also because of increasing energy prices. In this work, we propose QMD (Quasi Mirrored Disks) that exploit flash as a write buffer to complement RAID systems consisting of hard disks. QMD along with partial on-line mirrors, are a first step towards energy proportionality which is seen as the "holy grail" of energy-efficient system design. QMD exhibits significant energy savings of up 31%, as per our evaluation study using real workloads.
{"title":"QMD: exploiting flash for energy efficient disk arrays","authors":"Sean M. Snyder, Shimin Chen, Panos K. Chrysanthis, Alexandros Labrinidis","doi":"10.1145/1995441.1995447","DOIUrl":"https://doi.org/10.1145/1995441.1995447","url":null,"abstract":"Energy consumption for computing devices in general and for data centers in particular is receiving increasingly high attention, both because of the increasing ubiquity of computing and also because of increasing energy prices. In this work, we propose QMD (Quasi Mirrored Disks) that exploit flash as a write buffer to complement RAID systems consisting of hard disks. QMD along with partial on-line mirrors, are a first step towards energy proportionality which is seen as the \"holy grail\" of energy-efficient system design. QMD exhibits significant energy savings of up 31%, as per our evaluation study using real workloads.","PeriodicalId":298901,"journal":{"name":"International Workshop on Data Management on New Hardware","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115404110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present fpga-ToPSS (Toronto Publish/Subscribe System), an efficient event processing platform to support high-frequency and low-latency event matching. fpga-ToPSS is built over reconfigurable hardware---FPGAs---to achieve line-rate processing by exploring various degrees of parallelism. Furthermore, each of our proposed FPGA-based designs is geared towards a unique application requirement, such as flexibility, adaptability, scalability, or pure performance, such that each solution is specifically optimized to attain a high level of parallelism. Therefore, each solution is formulated as a design trade-off between the degree of parallelism versus the desired application requirement. Moreover, our event processing engine supports Boolean expression matching with an expressive predicate language applicable to a wide range of applications including real-time data analysis, algorithmic trading, targeted advertisement, and (complex) event processing.
{"title":"Towards highly parallel event processing through reconfigurable hardware","authors":"Mohammad Sadoghi, Harsh V. P. Singh, H. Jacobsen","doi":"10.1145/1995441.1995445","DOIUrl":"https://doi.org/10.1145/1995441.1995445","url":null,"abstract":"We present fpga-ToPSS (Toronto Publish/Subscribe System), an efficient event processing platform to support high-frequency and low-latency event matching. fpga-ToPSS is built over reconfigurable hardware---FPGAs---to achieve line-rate processing by exploring various degrees of parallelism. Furthermore, each of our proposed FPGA-based designs is geared towards a unique application requirement, such as flexibility, adaptability, scalability, or pure performance, such that each solution is specifically optimized to attain a high level of parallelism. Therefore, each solution is formulated as a design trade-off between the degree of parallelism versus the desired application requirement. Moreover, our event processing engine supports Boolean expression matching with an expressive predicate language applicable to a wide range of applications including real-time data analysis, algorithmic trading, targeted advertisement, and (complex) event processing.","PeriodicalId":298901,"journal":{"name":"International Workshop on Data Management on New Hardware","volume":"384 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133510658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Grund, J. Schaffner, Jens Krüger, Jan Brunnert, A. Zeier
Virtualization is mainly employed for increasing the utilization of a lightly-loaded system by consolidation, but also to ease the administration based on the possibility to rapidly provision or migrate virtual machines. These facilities are crucial for efficiently managing large data centers. At the same time, modern hardware --- such as Intel's Nehalem microarchitecure --- change critical assumptions about performance bottlenecks and software systems explicitly exploiting the underlying hardware --- such as main memory databases --- gain increasing momentum. In this paper, we address the question of how these specialized software systems perform in a virtualized environment. To do so, we present a set of experiments looking at several different variants of in-memory databases: The MonetDB Calibrator, a fine-grained hybrid row/column in-memory database running an OLTP workload, and an in-memory column store database running a multi-user OLAP workload. We examine how memory management in virtual machine monitors affects these three classes of applications. For the multi-user OLAP experiment we also experimentally compare a virtualized Nehalem server to one of its predecessors. We show that saturation of the memory bus is a major limiting factor but is much less impactful on the new architecture.
{"title":"The effects of virtualization on main memory systems","authors":"M. Grund, J. Schaffner, Jens Krüger, Jan Brunnert, A. Zeier","doi":"10.1145/1869389.1869395","DOIUrl":"https://doi.org/10.1145/1869389.1869395","url":null,"abstract":"Virtualization is mainly employed for increasing the utilization of a lightly-loaded system by consolidation, but also to ease the administration based on the possibility to rapidly provision or migrate virtual machines. These facilities are crucial for efficiently managing large data centers. At the same time, modern hardware --- such as Intel's Nehalem microarchitecure --- change critical assumptions about performance bottlenecks and software systems explicitly exploiting the underlying hardware --- such as main memory databases --- gain increasing momentum.\u0000 In this paper, we address the question of how these specialized software systems perform in a virtualized environment. To do so, we present a set of experiments looking at several different variants of in-memory databases: The MonetDB Calibrator, a fine-grained hybrid row/column in-memory database running an OLTP workload, and an in-memory column store database running a multi-user OLAP workload.\u0000 We examine how memory management in virtual machine monitors affects these three classes of applications. For the multi-user OLAP experiment we also experimentally compare a virtualized Nehalem server to one of its predecessors. We show that saturation of the memory bus is a major limiting factor but is much less impactful on the new architecture.","PeriodicalId":298901,"journal":{"name":"International Workshop on Data Management on New Hardware","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117149576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The high cost associated with powering servers has introduced new challenges in improving the energy efficiency of clusters running data processing jobs. Traditional high-performance servers are largely energy inefficient due to various factors such as the over-provisioning of resources. The increasing trend to replace traditional high-performance server nodes with low-power low-end nodes in clusters has recently been touted as a solution to the cluster energy problem. However, the key tacit assumption that drives such a solution is that the proportional scale-out of such low-power cluster nodes results in constant scaleup in performance. This paper studies the validity of such an assumption using measured price and performance results from a low-power Atom-based node and a traditional Xeon-based server and a number of published parallel scaleup results. Our results show that in most cases, computationally complex queries exhibit disproportionate scaleup characteristics which potentially makes scale-out with low-end nodes an expensive and lower performance solution.
{"title":"Wimpy node clusters: what about non-wimpy workloads?","authors":"Willis Lang, J. Patel, S. Shankar","doi":"10.1145/1869389.1869396","DOIUrl":"https://doi.org/10.1145/1869389.1869396","url":null,"abstract":"The high cost associated with powering servers has introduced new challenges in improving the energy efficiency of clusters running data processing jobs. Traditional high-performance servers are largely energy inefficient due to various factors such as the over-provisioning of resources. The increasing trend to replace traditional high-performance server nodes with low-power low-end nodes in clusters has recently been touted as a solution to the cluster energy problem. However, the key tacit assumption that drives such a solution is that the proportional scale-out of such low-power cluster nodes results in constant scaleup in performance. This paper studies the validity of such an assumption using measured price and performance results from a low-power Atom-based node and a traditional Xeon-based server and a number of published parallel scaleup results. Our results show that in most cases, computationally complex queries exhibit disproportionate scaleup characteristics which potentially makes scale-out with low-end nodes an expensive and lower performance solution.","PeriodicalId":298901,"journal":{"name":"International Workshop on Data Management on New Hardware","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126972770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Flash devices (solid state disks) promise a significant performance improvement for disk-based database processing. However, database storage structures and processing strategies originally designed for magnetic disks prevent the optimal utilization of SSDs. Based on previous work on bench-marking SSDs and a detailed discussion of I/O methods, in this paper, we analyze appropriate execution methods for database processing as well as important parameters and boundaries and present a tool which helps to derive these parameters.
{"title":"Flashing databases: expectations and limitations","authors":"S. Baumann, Giel de Nijs, M. Strobel, K. Sattler","doi":"10.1145/1869389.1869391","DOIUrl":"https://doi.org/10.1145/1869389.1869391","url":null,"abstract":"Flash devices (solid state disks) promise a significant performance improvement for disk-based database processing. However, database storage structures and processing strategies originally designed for magnetic disks prevent the optimal utilization of SSDs. Based on previous work on bench-marking SSDs and a detailed discussion of I/O methods, in this paper, we analyze appropriate execution methods for database processing as well as important parameters and boundaries and present a tool which helps to derive these parameters.","PeriodicalId":298901,"journal":{"name":"International Workshop on Data Management on New Hardware","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129308866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Scientific computing applications often require support for non-traditional data types, for example, numbers with a precision higher than 64-bit floats. As graphics processors, or GPUs, have emerged as a powerful accelerator for scientific computing, we design and implement a GPU-based extended precision library to enable applications with high precision requirement to run on the GPU. Our library contains arithmetic operators, mathematical functions, and data-parallel primitives, each of which can operate at either multi-term or multi-digit precision. The multi-term precision maintains an accuracy of up to 212 bits of signifcand whereas the multi-digit precision allows an accuracy of an arbitrary number of bits. Additionally, we have integrated the extended precision algorithms to a GPU-based query processing engine to support efficient query processing with extended precision on GPUs. To demonstrate the usage of our library, we have implemented three applications: parallel summation in climate modeling, Newton's method used in nonlinear physics, and high precision numerical integration in experimental mathematics. The GPU-based implementation is up to an order of magnitude faster, and achieves the same accuracy as their optimized, quadcore CPU-based counterparts.
{"title":"Supporting extended precision on graphics processors","authors":"Mian Lu, Bingsheng He, Qiong Luo","doi":"10.1145/1869389.1869392","DOIUrl":"https://doi.org/10.1145/1869389.1869392","url":null,"abstract":"Scientific computing applications often require support for non-traditional data types, for example, numbers with a precision higher than 64-bit floats. As graphics processors, or GPUs, have emerged as a powerful accelerator for scientific computing, we design and implement a GPU-based extended precision library to enable applications with high precision requirement to run on the GPU. Our library contains arithmetic operators, mathematical functions, and data-parallel primitives, each of which can operate at either multi-term or multi-digit precision. The multi-term precision maintains an accuracy of up to 212 bits of signifcand whereas the multi-digit precision allows an accuracy of an arbitrary number of bits. Additionally, we have integrated the extended precision algorithms to a GPU-based query processing engine to support efficient query processing with extended precision on GPUs. To demonstrate the usage of our library, we have implemented three applications: parallel summation in climate modeling, Newton's method used in nonlinear physics, and high precision numerical integration in experimental mathematics. The GPU-based implementation is up to an order of magnitude faster, and achieves the same accuracy as their optimized, quadcore CPU-based counterparts.","PeriodicalId":298901,"journal":{"name":"International Workshop on Data Management on New Hardware","volume":"125 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124773884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We study algorithms for efficient compression and decompression of a sequence of integers on modern hardware. Our focus is on universal codes in which the codeword length is a monotonically non-decreasing function of the uncompressed integer value; such codes are widely used for compressing "small integers". In contrast to traditional integer compression, our algorithms make use of the SIMD capabilities of modern processors by encoding multiple integer values at once. More specifically, we provide SIMD versions of both null suppression and Elias gamma encoding. Our experiments show that these versions provide a speedup from 1.5x up to 6.7x for decompression, while maintaining a similar compression performance.
{"title":"Fast integer compression using SIMD instructions","authors":"B. Schlegel, Rainer Gemulla, Wolfgang Lehner","doi":"10.1145/1869389.1869394","DOIUrl":"https://doi.org/10.1145/1869389.1869394","url":null,"abstract":"We study algorithms for efficient compression and decompression of a sequence of integers on modern hardware. Our focus is on universal codes in which the codeword length is a monotonically non-decreasing function of the uncompressed integer value; such codes are widely used for compressing \"small integers\". In contrast to traditional integer compression, our algorithms make use of the SIMD capabilities of modern processors by encoding multiple integer values at once. More specifically, we provide SIMD versions of both null suppression and Elias gamma encoding. Our experiments show that these versions provide a speedup from 1.5x up to 6.7x for decompression, while maintaining a similar compression performance.","PeriodicalId":298901,"journal":{"name":"International Workshop on Data Management on New Hardware","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120990339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tobias Emrich, Franz Graf, H. Kriegel, Matthias Schubert, Marisa Thoma
Similarity queries are an important query type in multimedia databases. To implement these types of queries, database systems often use spatial index structures like the R*-Tree. However, the majority of performance evaluations for spatial index structures rely on a conventional background storage layer based on conventional hard drives. Since newer devices like solid-state-disks (SSD) have a completely different performance characteristic, it is an interesting question how far existing index structures profit from these modern storage devices. In this paper, we therefore examine the performance behaviour of the R*-Tree on an SSD compared to a conventional hard drive. Testing various influencing factors like system load, dimensionality and page size of the index our evaluation leads to interesting insights into the performance of spatial index structures on modern background storage layers.
{"title":"On the impact of flash SSDs on spatial indexing","authors":"Tobias Emrich, Franz Graf, H. Kriegel, Matthias Schubert, Marisa Thoma","doi":"10.1145/1869389.1869390","DOIUrl":"https://doi.org/10.1145/1869389.1869390","url":null,"abstract":"Similarity queries are an important query type in multimedia databases. To implement these types of queries, database systems often use spatial index structures like the R*-Tree. However, the majority of performance evaluations for spatial index structures rely on a conventional background storage layer based on conventional hard drives. Since newer devices like solid-state-disks (SSD) have a completely different performance characteristic, it is an interesting question how far existing index structures profit from these modern storage devices. In this paper, we therefore examine the performance behaviour of the R*-Tree on an SSD compared to a conventional hard drive. Testing various influencing factors like system load, dimensionality and page size of the index our evaluation leads to interesting insights into the performance of spatial index structures on modern background storage layers.","PeriodicalId":298901,"journal":{"name":"International Workshop on Data Management on New Hardware","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131512835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Concurrent read-only scans of memory-resident fact tables can form convoys, which generally help performance because cache misses are amortized over several members of the convoy. Nevertheless, we identify two performance hazards for such convoys. One hazard is underutilization of the memory bandwidth because all members of the convoy hit the same cache lines at the same time, rather than reading several different lines concurrently. The other hazard is a form of interference that occurs on the Sun Niagara T1 and T2 machines under certain workloads. We propose solutions to these hazards, including a local shuffle method that reduces interference, preserves the beneficial aspects of convoy behavior, and increases the effective bandwidth by allowing different members of a convoy to concurrently access different cache lines. We provide experimental validation of the methods on several modern architectures.
{"title":"Optimizing read convoys in main-memory query processing","authors":"K. A. Ross","doi":"10.1145/1869389.1869393","DOIUrl":"https://doi.org/10.1145/1869389.1869393","url":null,"abstract":"Concurrent read-only scans of memory-resident fact tables can form convoys, which generally help performance because cache misses are amortized over several members of the convoy. Nevertheless, we identify two performance hazards for such convoys. One hazard is underutilization of the memory bandwidth because all members of the convoy hit the same cache lines at the same time, rather than reading several different lines concurrently. The other hazard is a form of interference that occurs on the Sun Niagara T1 and T2 machines under certain workloads. We propose solutions to these hazards, including a local shuffle method that reduces interference, preserves the beneficial aspects of convoy behavior, and increases the effective bandwidth by allowing different members of a convoy to concurrently access different cache lines. We provide experimental validation of the methods on several modern architectures.","PeriodicalId":298901,"journal":{"name":"International Workshop on Data Management on New Hardware","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133930587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}