The Word-Aligned Hybrid (WAH) compression is a prominent example of a lightweight compression scheme for bitmap indices that considers the word size of the underlying architecture. This is a compromise toward commodity CPUs, where operations below the word granularity perform poorly. With the emergence of novel hardware classes, such compromises may no longer be appropriate. Field-programmable gate arrays (FPGAs) do not even have any meaningful "word size". In this work, we reconsider strategies for bitmap compression in the light of modern hardware architectures. Rather than tuning compression toward a fixed word size, we propose to tune the word size toward optimal compression. The resulting compression scheme, Variable Word Length Word-Aligned Hybrid (VWLWAH), improves compression rates by almost 75% while maintaining line rate performance on FPGAs.
{"title":"Variable word length word-aligned hybrid compression","authors":"Florian Grieskamp, Roland Kühn, J. Teubner","doi":"10.1145/3399666.3399935","DOIUrl":"https://doi.org/10.1145/3399666.3399935","url":null,"abstract":"The Word-Aligned Hybrid (WAH) compression is a prominent example of a lightweight compression scheme for bitmap indices that considers the word size of the underlying architecture. This is a compromise toward commodity CPUs, where operations below the word granularity perform poorly. With the emergence of novel hardware classes, such compromises may no longer be appropriate. Field-programmable gate arrays (FPGAs) do not even have any meaningful \"word size\". In this work, we reconsider strategies for bitmap compression in the light of modern hardware architectures. Rather than tuning compression toward a fixed word size, we propose to tune the word size toward optimal compression. The resulting compression scheme, Variable Word Length Word-Aligned Hybrid (VWLWAH), improves compression rates by almost 75% while maintaining line rate performance on FPGAs.","PeriodicalId":256784,"journal":{"name":"Proceedings of the 16th International Workshop on Data Management on New Hardware","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129427329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hawon Chu, Seounghyun Kim, Joo-Young Lee, Young-Kyoon Suh
In this paper we conduct an empirical study across modern GPU-accelerated DBMSes with TPC-H workloads. Our rigorous experiments demonstrate that the studied DBMSes appear to utilize GPU resource effectively but do not scale well with growing databases nor have full capability to process some complex analytical queries. Thus, we claim that the GPU DBMSes still need to be further engineered to achieve a better analytical performance.
{"title":"Empirical evaluation across multiple GPU-accelerated DBMSes","authors":"Hawon Chu, Seounghyun Kim, Joo-Young Lee, Young-Kyoon Suh","doi":"10.1145/3399666.3399907","DOIUrl":"https://doi.org/10.1145/3399666.3399907","url":null,"abstract":"In this paper we conduct an empirical study across modern GPU-accelerated DBMSes with TPC-H workloads. Our rigorous experiments demonstrate that the studied DBMSes appear to utilize GPU resource effectively but do not scale well with growing databases nor have full capability to process some complex analytical queries. Thus, we claim that the GPU DBMSes still need to be further engineered to achieve a better analytical performance.","PeriodicalId":256784,"journal":{"name":"Proceedings of the 16th International Workshop on Data Management on New Hardware","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121990275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mahmoud Mohsen, Norman May, Christian Färber, David Broneske
An efficient compression of integer vectors is critical in dictionary-encoded column stores like SAP HANA to keep more data in the limited and precious main memory. Past research focused on lightweight compression techniques that trade low latency of data accesses for lower compression ratios. Consequently, only few columns in a wide table benefit from light-weight and effective compression schemes like run-length encoding, prefix compression or sparse encoding. Besides bit-packing, other columns remained uncompressed, which clearly misses opportunities for a better compression ratio for many columns. Furthermore, the main executor for compression was the CPU as compression involves heavy data transfer. Especially when used with co-processors, the data transfer overhead wipes out performance gains from co-processor usage. In this paper, we investigate whether we can achieve good compression ratios even for previously uncompressed columns by using binary packing and prefix suppression offloaded to an FPGA. As a streaming-processor, an FPGA is the perfect candidate to outsource the compression task. As a result of our OpenCL-based implementation, we achieve a saturation of the available PCIe bus during compression on the FPGA, by using less than a third the FPGA's resources. Furthermore, our real-world experiments against CPU-based SAP HANA shows a performance improvement of around a factor of 2 in compression throughput while compressing the data down to 60% of the best SAP HANA compression technique.
{"title":"FPGA-Accelerated compression of integer vectors","authors":"Mahmoud Mohsen, Norman May, Christian Färber, David Broneske","doi":"10.1145/3399666.3399932","DOIUrl":"https://doi.org/10.1145/3399666.3399932","url":null,"abstract":"An efficient compression of integer vectors is critical in dictionary-encoded column stores like SAP HANA to keep more data in the limited and precious main memory. Past research focused on lightweight compression techniques that trade low latency of data accesses for lower compression ratios. Consequently, only few columns in a wide table benefit from light-weight and effective compression schemes like run-length encoding, prefix compression or sparse encoding. Besides bit-packing, other columns remained uncompressed, which clearly misses opportunities for a better compression ratio for many columns. Furthermore, the main executor for compression was the CPU as compression involves heavy data transfer. Especially when used with co-processors, the data transfer overhead wipes out performance gains from co-processor usage. In this paper, we investigate whether we can achieve good compression ratios even for previously uncompressed columns by using binary packing and prefix suppression offloaded to an FPGA. As a streaming-processor, an FPGA is the perfect candidate to outsource the compression task. As a result of our OpenCL-based implementation, we achieve a saturation of the available PCIe bus during compression on the FPGA, by using less than a third the FPGA's resources. Furthermore, our real-world experiments against CPU-based SAP HANA shows a performance improvement of around a factor of 2 in compression throughput while compressing the data down to 60% of the best SAP HANA compression technique.","PeriodicalId":256784,"journal":{"name":"Proceedings of the 16th International Workshop on Data Management on New Hardware","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133595610","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In recent years we have seen a proliferation of FPGA-based key value stores (KVSs) [1--3, 5--7, 10] driven by the need for more efficient large-scale data management and storage solutions. In this context, FPGAs are useful because they offer network-bound performance even with small key-value pairs and near-data processing in a fraction of the energy budget of regular servers. Even though the first FPGA-based key-value stores started appearing already in 2013 and have evolved significantly in the meantime, almost no attention has been paid to offering transactions. Today, however, that such systems are becoming increasingly practical, we need to ensure consistency guarantees for concurrent clients (transactions). This position paper makes the case that adding transaction support is not particularly expensive, compared to other parts of these systems, and in the future all FPGA-based KVSs should provide some form of transactional guarantees. In the remaining of this paper we present a high level view of the typical pipelined architecture of FPGA-based KVSs that most existing designs follow, and show three different ways of implementing transactions, with increasing sophistication: from operation batching, through two phase locking (2PL), to a simplified snapshot isolation model.
{"title":"Let's add transactions to FPGA-based key-value stores!","authors":"Z. István","doi":"10.1145/3399666.3399909","DOIUrl":"https://doi.org/10.1145/3399666.3399909","url":null,"abstract":"In recent years we have seen a proliferation of FPGA-based key value stores (KVSs) [1--3, 5--7, 10] driven by the need for more efficient large-scale data management and storage solutions. In this context, FPGAs are useful because they offer network-bound performance even with small key-value pairs and near-data processing in a fraction of the energy budget of regular servers. Even though the first FPGA-based key-value stores started appearing already in 2013 and have evolved significantly in the meantime, almost no attention has been paid to offering transactions. Today, however, that such systems are becoming increasingly practical, we need to ensure consistency guarantees for concurrent clients (transactions). This position paper makes the case that adding transaction support is not particularly expensive, compared to other parts of these systems, and in the future all FPGA-based KVSs should provide some form of transactional guarantees. In the remaining of this paper we present a high level view of the typical pipelined architecture of FPGA-based KVSs that most existing designs follow, and show three different ways of implementing transactions, with increasing sophistication: from operation batching, through two phase locking (2PL), to a simplified snapshot isolation model.","PeriodicalId":256784,"journal":{"name":"Proceedings of the 16th International Workshop on Data Management on New Hardware","volume":"229 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123190470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stefan Noll, J. Teubner, Norman May, Alexander Böhm
Debugging and tuning database systems is very challenging. Using common profiling tools is often not sufficient because they identify the machine instruction rather than the instance of a data structure that causes a performance problem. This leaves a problem's root cause such as memory hotspots or poor data layouts hidden. The state-of-the-art solution is to augment classical profiling with a memory trace. However, current approaches for collecting memory traces are not usable in practice due to their large runtime overhead. In this work, we leverage a mechanism available in modern processors to collect memory traces via hardware-based sampling. We evaluate our approach using a commercial and an open-source database system running the JCC-H benchmark. In particular, we demonstrate that our approach is practical due to its low runtime overhead and we illustrate how memory traces uncover new insights into the memory access characteristics of database systems.
{"title":"Analyzing memory accesses with modern processors","authors":"Stefan Noll, J. Teubner, Norman May, Alexander Böhm","doi":"10.1145/3399666.3399896","DOIUrl":"https://doi.org/10.1145/3399666.3399896","url":null,"abstract":"Debugging and tuning database systems is very challenging. Using common profiling tools is often not sufficient because they identify the machine instruction rather than the instance of a data structure that causes a performance problem. This leaves a problem's root cause such as memory hotspots or poor data layouts hidden. The state-of-the-art solution is to augment classical profiling with a memory trace. However, current approaches for collecting memory traces are not usable in practice due to their large runtime overhead. In this work, we leverage a mechanism available in modern processors to collect memory traces via hardware-based sampling. We evaluate our approach using a commercial and an open-source database system running the JCC-H benchmark. In particular, we demonstrate that our approach is practical due to its low runtime overhead and we illustrate how memory traces uncover new insights into the memory access characteristics of database systems.","PeriodicalId":256784,"journal":{"name":"Proceedings of the 16th International Workshop on Data Management on New Hardware","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123325720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-06-14DOI: 10.1007/s00778-022-00744-2
Johannes Pietrzyk, Dirk Habich, Wolfgang Lehner
{"title":"To share or not to share vector registers?","authors":"Johannes Pietrzyk, Dirk Habich, Wolfgang Lehner","doi":"10.1007/s00778-022-00744-2","DOIUrl":"https://doi.org/10.1007/s00778-022-00744-2","url":null,"abstract":"","PeriodicalId":256784,"journal":{"name":"Proceedings of the 16th International Workshop on Data Management on New Hardware","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129075150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anil Shanbhag, Nesime Tatbul, David Cohen, S. Madden
New data storage technologies such as the recently introduced Intel® Optane™ DC Persistent Memory Module (PMM) offer exciting opportunities for optimizing the query processing performance of database workloads. In particular, the unique combination of low latency, byte-addressability, persistence, and large capacity make persistent memory (PMem) an attractive alternative along with DRAM and SSDs. Exploring the performance characteristics of this new medium is the first critical step in understanding how it will impact the design and performance of database systems. In this paper, we present one of the first experimental studies on characterizing Intel® Optane™ DC PMM's performance behavior in the context of analytical database workloads. First, we analyze basic access patterns common in such workloads, such as sequential, selective, and random reads as well as the complete Star Schema Benchmark, comparing standalone DRAM- and PMem-based implementations. Then we extend our analysis to join algorithms over larger datasets, which require using DRAM and PMem in a hybrid fashion while paying special attention to the read-write asymmetry of PMem. Our study reveals interesting performance tradeoffs that can help guide the design of next-generation OLAP systems in presence of persistent memory in the storage hierarchy.
{"title":"Large-scale in-memory analytics on Intel® Optane™ DC persistent memory","authors":"Anil Shanbhag, Nesime Tatbul, David Cohen, S. Madden","doi":"10.1145/3399666.3399933","DOIUrl":"https://doi.org/10.1145/3399666.3399933","url":null,"abstract":"New data storage technologies such as the recently introduced Intel® Optane™ DC Persistent Memory Module (PMM) offer exciting opportunities for optimizing the query processing performance of database workloads. In particular, the unique combination of low latency, byte-addressability, persistence, and large capacity make persistent memory (PMem) an attractive alternative along with DRAM and SSDs. Exploring the performance characteristics of this new medium is the first critical step in understanding how it will impact the design and performance of database systems. In this paper, we present one of the first experimental studies on characterizing Intel® Optane™ DC PMM's performance behavior in the context of analytical database workloads. First, we analyze basic access patterns common in such workloads, such as sequential, selective, and random reads as well as the complete Star Schema Benchmark, comparing standalone DRAM- and PMem-based implementations. Then we extend our analysis to join algorithms over larger datasets, which require using DRAM and PMem in a hybrid fashion while paying special attention to the read-write asymmetry of PMem. Our study reveals interesting performance tradeoffs that can help guide the design of next-generation OLAP systems in presence of persistent memory in the storage hierarchy.","PeriodicalId":256784,"journal":{"name":"Proceedings of the 16th International Workshop on Data Management on New Hardware","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123027125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tiemo Bang, Norman May, Ilia Petrov, Carsten Binnig
In this paper, we set out the goal to revisit the results of "Starring into the Abyss [...] of Concurrency Control with [1000] Cores" [27] and analyse in-memory DBMSs on today's large hardware. Despite the original assumption of the authors, today we do not see single-socket CPUs with 1000 cores. Instead multi-socket hardware made its way into production data centres. Hence, we follow up on this prior work with an evaluation of the characteristics of concurrency control schemes on real production multi-socket hardware with 1568 cores. To our surprise, we made several interesting findings which we report on in this paper.
{"title":"The tale of 1000 Cores: an evaluation of concurrency control on real(ly) large multi-socket hardware","authors":"Tiemo Bang, Norman May, Ilia Petrov, Carsten Binnig","doi":"10.1145/3399666.3399910","DOIUrl":"https://doi.org/10.1145/3399666.3399910","url":null,"abstract":"In this paper, we set out the goal to revisit the results of \"Starring into the Abyss [...] of Concurrency Control with [1000] Cores\" [27] and analyse in-memory DBMSs on today's large hardware. Despite the original assumption of the authors, today we do not see single-socket CPUs with 1000 cores. Instead multi-socket hardware made its way into production data centres. Hence, we follow up on this prior work with an evaluation of the characteristics of concurrency control schemes on real production multi-socket hardware with 1568 cores. To our surprise, we made several interesting findings which we report on in this paper.","PeriodicalId":256784,"journal":{"name":"Proceedings of the 16th International Workshop on Data Management on New Hardware","volume":"21 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134173796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yinjun Wu, Kwanghyun Park, Rathijit Sen, Brian Kroth, Jaeyoung Do
Non-volatile memory (NVM) is an emerging technology, which has the persistence characteristics of large capacity storage devices, while providing the low access latency and byte-addressablity of traditional DRAM memory. In this paper, we provide extensive performance evaluations on a recently released NVM device, Intel Optane DC Persistent Memory (PMem), under different configurations with several micro-benchmark tools. Further, we evaluate OLTP and OLAP database workloads with Microsoft SQL Server 2019 when using PMem as buffer pool or persistent storage. From the lessons learned we share some recommendations for future DBMS design with PMem, e.g. simple hardware or software changes are not enough for the best use of PMem in DBMSs.
非易失性存储器(NVM)是一种新兴的存储技术,它既具有大容量存储设备的持久性,又具有传统DRAM存储器的低访问延迟和字节寻址能力。在本文中,我们对最近发布的NVM设备Intel Optane DC Persistent Memory (PMem)在不同配置下使用几个微基准测试工具进行了广泛的性能评估。此外,我们在使用PMem作为缓冲池或持久存储时,使用Microsoft SQL Server 2019评估OLTP和OLAP数据库工作负载。从吸取的经验教训中,我们分享了一些关于未来使用PMem设计DBMS的建议,例如,简单的硬件或软件更改不足以在DBMS中最好地使用PMem。
{"title":"Lessons learned from the early performance evaluation of Intel optane DC persistent memory in DBMS","authors":"Yinjun Wu, Kwanghyun Park, Rathijit Sen, Brian Kroth, Jaeyoung Do","doi":"10.1145/3399666.3399898","DOIUrl":"https://doi.org/10.1145/3399666.3399898","url":null,"abstract":"Non-volatile memory (NVM) is an emerging technology, which has the persistence characteristics of large capacity storage devices, while providing the low access latency and byte-addressablity of traditional DRAM memory. In this paper, we provide extensive performance evaluations on a recently released NVM device, Intel Optane DC Persistent Memory (PMem), under different configurations with several micro-benchmark tools. Further, we evaluate OLTP and OLAP database workloads with Microsoft SQL Server 2019 when using PMem as buffer pool or persistent storage. From the lessons learned we share some recommendations for future DBMS design with PMem, e.g. simple hardware or software changes are not enough for the best use of PMem in DBMSs.","PeriodicalId":256784,"journal":{"name":"Proceedings of the 16th International Workshop on Data Management on New Hardware","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126017480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hardware acceleration of database query processing can be done with the help of FPGAs. In particular, they are partially reconfigurable at runtime, which allows for the adaptation to a variety of queries. Reconfiguration itself, however, takes some time. This paper presents optimizations based on query sequences, which reduce the impact of the reconfigurations. Knowledge of upcoming queries is used to avoid reconfiguration overhead. We evaluate our optimizations with a calibrated model. Improvements in execution time of up to 28% can be obtained even with sequences of only two queries.
{"title":"The ReProVide query-sequence optimization in a hardware-accelerated DBMS","authors":"G. LekshmiB., Andreas Becher, K. Meyer-Wegener","doi":"10.1145/3399666.3399926","DOIUrl":"https://doi.org/10.1145/3399666.3399926","url":null,"abstract":"Hardware acceleration of database query processing can be done with the help of FPGAs. In particular, they are partially reconfigurable at runtime, which allows for the adaptation to a variety of queries. Reconfiguration itself, however, takes some time. This paper presents optimizations based on query sequences, which reduce the impact of the reconfigurations. Knowledge of upcoming queries is used to avoid reconfiguration overhead. We evaluate our optimizations with a calibrated model. Improvements in execution time of up to 28% can be obtained even with sequences of only two queries.","PeriodicalId":256784,"journal":{"name":"Proceedings of the 16th International Workshop on Data Management on New Hardware","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114585252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}