As XQuery rapidly emerges as the standard for querying XML documents, it is very important to understand the architectural characteristics and behaviors of such workloads. A lot of efforts are focused on the implementation, optimization, and evaluation of XQuery tools. However, few or no prior work studies the architectural and memory system behaviors of XQuery workloads on modern hardware platforms. This makes it unclear whether modern CPU techniques, such as the multi-level caches and hardware branch predictors, can support such workloads well enough. This paper presents a detailed characterization of the architectural behavior of XQuery workloads. We examine four XQuery tools on three hardware platforms (AMD, Intel, and Sun) using well-designed XQuery queries. We report measured architectural data, including the L1/L2 cache misses, TLB misses, and branch mispredictions. We believe that the information will be useful in understanding XQuery workloads and analyzing the potential architectural optimization opportunities of improving XQuery performance.
{"title":"Architectural characterization of XQuery workloads on modern processors","authors":"Rubao Lee, Bihui Duan, Taoying Liu","doi":"10.1145/1363189.1363199","DOIUrl":"https://doi.org/10.1145/1363189.1363199","url":null,"abstract":"As XQuery rapidly emerges as the standard for querying XML documents, it is very important to understand the architectural characteristics and behaviors of such workloads. A lot of efforts are focused on the implementation, optimization, and evaluation of XQuery tools. However, few or no prior work studies the architectural and memory system behaviors of XQuery workloads on modern hardware platforms. This makes it unclear whether modern CPU techniques, such as the multi-level caches and hardware branch predictors, can support such workloads well enough.\u0000 This paper presents a detailed characterization of the architectural behavior of XQuery workloads. We examine four XQuery tools on three hardware platforms (AMD, Intel, and Sun) using well-designed XQuery queries. We report measured architectural data, including the L1/L2 cache misses, TLB misses, and branch mispredictions. We believe that the information will be useful in understanding XQuery workloads and analyzing the potential architectural optimization opportunities of improving XQuery performance.","PeriodicalId":298901,"journal":{"name":"International Workshop on Data Management on New Hardware","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132693149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multi-core and multithreaded processors present both opportunities and challenges in the design of database query processing algorithms. Previous work has shown the potential for performance gains, but also that, in adverse circumstances, multithreading can actually reduce performance. This paper examines the performance of a pipeline of hash-join operations when executing on multithreaded and multicore processors. We examine the optimal number of threads to execute and the partitioning of the workload across those threads. We then describe a buffer-management scheme that minimizes cache conflicts among the threads. Additionally we compare the performance of full materialization of the output at each stage in the pipeline versus passing pointers between stages.
{"title":"Pipelined hash-join on multithreaded architectures","authors":"Philip C. Garcia, H. F. Korth","doi":"10.1145/1363189.1363191","DOIUrl":"https://doi.org/10.1145/1363189.1363191","url":null,"abstract":"Multi-core and multithreaded processors present both opportunities and challenges in the design of database query processing algorithms. Previous work has shown the potential for performance gains, but also that, in adverse circumstances, multithreading can actually reduce performance. This paper examines the performance of a pipeline of hash-join operations when executing on multithreaded and multicore processors. We examine the optimal number of threads to execute and the partitioning of the workload across those threads. We then describe a buffer-management scheme that minimizes cache conflicts among the threads. Additionally we compare the performance of full materialization of the output at each stage in the pipeline versus passing pointers between stages.","PeriodicalId":298901,"journal":{"name":"International Workshop on Data Management on New Hardware","volume":"53 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116534342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Large scale OLTP workloads on modern database servers are well understood across the industry. Their runtime performance characterizations serve to drive both server side software features and processor specific design decisions but are not understood outside of the primary industry stakeholders. We provide a rare glimpse into the performance characterizations of processor and platform targeted software optimizations running on a large-scale 32 processor, Intel® Itanium® 2 based, ccNUMA platform.
{"title":"Large scale Itanium® 2 processor OLTP workload characterization and optimization","authors":"Gerrit Saylor, Badriddine M. Khessib","doi":"10.1145/1140402.1140406","DOIUrl":"https://doi.org/10.1145/1140402.1140406","url":null,"abstract":"Large scale OLTP workloads on modern database servers are well understood across the industry. Their runtime performance characterizations serve to drive both server side software features and processor specific design decisions but are not understood outside of the primary industry stakeholders. We provide a rare glimpse into the performance characterizations of processor and platform targeted software optimizations running on a large-scale 32 processor, Intel® Itanium® 2 based, ccNUMA platform.","PeriodicalId":298901,"journal":{"name":"International Workshop on Data Management on New Hardware","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126759018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jafar Adibi, T. Barrett, Spundun Bhatt, Hans Chalupsky, Jacqueline Chame, Mary W. Hall
The goal of this work is to gain insight into whether processing-in-memory (PIM) technology can be used to accelerate the performance of link discovery algorithms, which represent an important class of emerging knowledge discovery techniques. PIM chips that integrate processor logic into memory devices offer a new opportunity for bridging the growing gap between processor and memory speeds, especially for applications with high memory-bandwidth requirements. As LD algorithms are data-intensive and highly parallel, involving read-only queries over large data sets, parallel computing power extremely close (physically) to the data has the potential of providing dramatic computing speedups. For this reason, we evaluated the mapping of LD algorithms to a processing-in-memory (PIM) workstation-class architecture, the DIVA/Godiva hardware testbeds developed by USC/ISI. Accounting for differences in clock speed and data scaling, our analysis shows a performance gain on a single PIM, with the potential for greater improvement when multiple PIMs are used. Measured speedups of 8x are shown on two additional bandwidth benchmarks, even though the Itanium-2 has a clock rate 6X faster.
{"title":"Processing-in-memory technology for knowledge discovery algorithms","authors":"Jafar Adibi, T. Barrett, Spundun Bhatt, Hans Chalupsky, Jacqueline Chame, Mary W. Hall","doi":"10.1145/1140402.1140405","DOIUrl":"https://doi.org/10.1145/1140402.1140405","url":null,"abstract":"The goal of this work is to gain insight into whether processing-in-memory (PIM) technology can be used to accelerate the performance of link discovery algorithms, which represent an important class of emerging knowledge discovery techniques. PIM chips that integrate processor logic into memory devices offer a new opportunity for bridging the growing gap between processor and memory speeds, especially for applications with high memory-bandwidth requirements. As LD algorithms are data-intensive and highly parallel, involving read-only queries over large data sets, parallel computing power extremely close (physically) to the data has the potential of providing dramatic computing speedups. For this reason, we evaluated the mapping of LD algorithms to a processing-in-memory (PIM) workstation-class architecture, the DIVA/Godiva hardware testbeds developed by USC/ISI. Accounting for differences in clock speed and data scaling, our analysis shows a performance gain on a single PIM, with the potential for greater improvement when multiple PIMs are used. Measured speedups of 8x are shown on two additional bandwidth benchmarks, even though the Itanium-2 has a clock rate 6X faster.","PeriodicalId":298901,"journal":{"name":"International Workshop on Data Management on New Hardware","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114501705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Cieslewicz, Jonathan W. Berry, B. Hendrickson, K. A. Ross
A new trend in processor design is increased on-chip support for multithreading in the form of both chip multiprocessors and simultaneous multithreading. Recent research in database systems has begun to explore increased thread-level parallelism made possible by these new multicore and multithreaded processors. The question of how best to use this new technology remains open, particularly as the number of cores per chip and threads per core increase. In this paper we use an existing massively multithreaded architecture, the Cray MTA-2, to explore the implications that a larger degree of on-chip multithreading may have for parallelism in database operations. We find that parallelism in database operations is easy to achieve on the MTA-2 and that, with little effort, parallelism can be made to scale linearly with the number of available processor cores.
{"title":"Realizing parallelism in database operations: insights from a massively multithreaded architecture","authors":"J. Cieslewicz, Jonathan W. Berry, B. Hendrickson, K. A. Ross","doi":"10.1145/1140402.1140408","DOIUrl":"https://doi.org/10.1145/1140402.1140408","url":null,"abstract":"A new trend in processor design is increased on-chip support for multithreading in the form of both chip multiprocessors and simultaneous multithreading. Recent research in database systems has begun to explore increased thread-level parallelism made possible by these new multicore and multithreaded processors. The question of how best to use this new technology remains open, particularly as the number of cores per chip and threads per core increase. In this paper we use an existing massively multithreaded architecture, the Cray MTA-2, to explore the implications that a larger degree of on-chip multithreading may have for parallelism in database operations. We find that parallelism in database operations is easy to achieve on the MTA-2 and that, with little effort, parallelism can be made to scale linearly with the number of available processor cores.","PeriodicalId":298901,"journal":{"name":"International Workshop on Data Management on New Hardware","volume":"191 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122357215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent performance improvements in storage hardware have benefited bandwidth much more than latency. Among other implications, this trend favors large B-tree pages. Recent performance improvements in processor hardware also have benefited processing bandwidth much more than memory latency. Among other implications, this trend favors adding calculations if they save cache faults.With small calculations guiding the search directly to the desired key, interpolation search complements these trends much better than binary search. It performs well if the distribution of key values is perfectly uniform, but it can be useless and even wasteful otherwise. This paper collects and describes more than a dozen techniques for interpolation search in B-tree indexes. Most of them attempt to avoid skew or to detect skew very early and then to avoid its bad effects. Some of these methods are part of the folklore of B-tree search, whereas other techniques are new. The purpose of this survey is to encourage research into such techniques and their performance on modern hardware.
{"title":"B-tree indexes, interpolation search, and skew","authors":"G. Graefe","doi":"10.1145/1140402.1140409","DOIUrl":"https://doi.org/10.1145/1140402.1140409","url":null,"abstract":"Recent performance improvements in storage hardware have benefited bandwidth much more than latency. Among other implications, this trend favors large B-tree pages. Recent performance improvements in processor hardware also have benefited processing bandwidth much more than memory latency. Among other implications, this trend favors adding calculations if they save cache faults.With small calculations guiding the search directly to the desired key, interpolation search complements these trends much better than binary search. It performs well if the distribution of key values is perfectly uniform, but it can be useless and even wasteful otherwise. This paper collects and describes more than a dozen techniques for interpolation search in B-tree indexes. Most of them attempt to avoid skew or to detect skew very early and then to avoid its bad effects. Some of these methods are part of the folklore of B-tree search, whereas other techniques are new. The purpose of this survey is to encourage research into such techniques and their performance on modern hardware.","PeriodicalId":298901,"journal":{"name":"International Workshop on Data Management on New Hardware","volume":"959 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133987327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bishwaranjan Bhattacharjee, N. Abe, Kenneth A. Goldman, B. Zadrozny, Vamsavardhana R. Chillakuru, Marysabel del Carpio, C. Apté
Secure coprocessors have traditionally been used as a keystone of a security subsystem, eliminating the need to protect the rest of the subsystem with physical security measures. With technological advances and hardware miniaturization they have become increasingly powerful. This opens up the possibility of using them for non traditional use. This paper describes a solution for privacy preserving data sharing and mining using cryptographically secure but resource limited coprocessors. It uses memory light data mining methodologies along with a light weight database engine with federation capability, running on a coprocessor. The data to be shared resides with the enterprises that want to collaborate. This system will allow multiple enterprises, which are generally not allowed to share data, to do so solely for the purpose of detecting particular types of anomalies and for generating alerts. We also present results from experiments which demonstrate the value of such collaborations.
{"title":"Using secure coprocessors for privacy preserving collaborative data mining and analysis","authors":"Bishwaranjan Bhattacharjee, N. Abe, Kenneth A. Goldman, B. Zadrozny, Vamsavardhana R. Chillakuru, Marysabel del Carpio, C. Apté","doi":"10.1145/1140402.1140404","DOIUrl":"https://doi.org/10.1145/1140402.1140404","url":null,"abstract":"Secure coprocessors have traditionally been used as a keystone of a security subsystem, eliminating the need to protect the rest of the subsystem with physical security measures. With technological advances and hardware miniaturization they have become increasingly powerful. This opens up the possibility of using them for non traditional use. This paper describes a solution for privacy preserving data sharing and mining using cryptographically secure but resource limited coprocessors. It uses memory light data mining methodologies along with a light weight database engine with federation capability, running on a coprocessor. The data to be shared resides with the enterprises that want to collaborate. This system will allow multiple enterprises, which are generally not allowed to share data, to do so solely for the purpose of detecting particular types of anomalies and for generating alerts. We also present results from experiments which demonstrate the value of such collaborations.","PeriodicalId":298901,"journal":{"name":"International Workshop on Data Management on New Hardware","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128836417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hashing is one of the fundamental techniques used to implement query processing operators such as grouping, aggregation and join. This paper studies the interaction between modern computer architecture and hash-based query processing techniques. First, we focus on extracting maximum hashing performance from super-scalar CPUs. In particular, we discuss fast hash functions, ways to efficiently handle multi-column keys and propose the use of a recently introduced hashing scheme called Cuckoo Hashing over the commonly used bucket-chained hashing. In the second part of the paper, we focus on the CPU cache usage, by dynamically partitioning data streams such that the partial hash tables fit in the CPU cache. Conventional partitioning works as a separate preparatory phase, forcing materialization, which may require I/O if the stream does not fit in RAM. We introduce best-effort partitioning, a technique that interleaves partitioning with execution of hash-based query processing operators and avoids I/O. In the process, we show how to prevent issues in partitioning with cacheline alignment, that can strongly decrease throughput. We also demonstrate overall query processing performance when both CPU-efficient hashing and best-effort partitioning are combined.
{"title":"Architecture-conscious hashing","authors":"M. Zukowski, S. Héman, P. Boncz","doi":"10.1145/1140402.1140410","DOIUrl":"https://doi.org/10.1145/1140402.1140410","url":null,"abstract":"Hashing is one of the fundamental techniques used to implement query processing operators such as grouping, aggregation and join. This paper studies the interaction between modern computer architecture and hash-based query processing techniques. First, we focus on extracting maximum hashing performance from super-scalar CPUs. In particular, we discuss fast hash functions, ways to efficiently handle multi-column keys and propose the use of a recently introduced hashing scheme called Cuckoo Hashing over the commonly used bucket-chained hashing. In the second part of the paper, we focus on the CPU cache usage, by dynamically partitioning data streams such that the partial hash tables fit in the CPU cache. Conventional partitioning works as a separate preparatory phase, forcing materialization, which may require I/O if the stream does not fit in RAM. We introduce best-effort partitioning, a technique that interleaves partitioning with execution of hash-based query processing operators and avoids I/O. In the process, we show how to prevent issues in partitioning with cacheline alignment, that can strongly decrease throughput. We also demonstrate overall query processing performance when both CPU-efficient hashing and best-effort partitioning are combined.","PeriodicalId":298901,"journal":{"name":"International Workshop on Data Management on New Hardware","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127646870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the Ninth International Workshop on Data Management on New Hardware, DaMoN 2013, New York, NY, USA, June 24, 2013","authors":"","doi":"10.1145/2485278","DOIUrl":"https://doi.org/10.1145/2485278","url":null,"abstract":"","PeriodicalId":298901,"journal":{"name":"International Workshop on Data Management on New Hardware","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133053763","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}