Ildar Absalyamov, Prerna Budhkar, Skyler Windh, R. Halstead, W. Najjar, V. Tsotras
Recent trends in hardware have dramatically dropped the price of RAM and shifted focus from systems operating on disk-resident data to in-memory solutions. In this environment high memory access latency, also known as memory wall, becomes the biggest data processing bottleneck. Traditional CPU-based architectures solved this problem by introducing large cache hierarchies. However algorithms which experience poor locality can limit the benefits of caching. In turn, hardware multithreading provides a generic solution that does not rely on algorithm-specific locality properties. In this paper we present an FPGA-accelerated implementation of in-memory group-by hash aggregation. Our design relies on hardware multithreading to efficiently mask long memory access latency by implementing a custom operation datapath on FPGA. We propose using CAMs (Content Addressable Memories) as a mechanism of synchronization and local pre-aggregation. To the best of our knowledge this is the first work, which uses CAMs as a synchronizing cache. We evaluate aggregation throughput against the state-of-the-art multithreaded software implementations and demonstrate that the FPGA-accelerated approach significantly outperforms them on large grouping key cardinalities and yields speedup up to 10x.
{"title":"FPGA-accelerated group-by aggregation using synchronizing caches","authors":"Ildar Absalyamov, Prerna Budhkar, Skyler Windh, R. Halstead, W. Najjar, V. Tsotras","doi":"10.1145/2933349.2933360","DOIUrl":"https://doi.org/10.1145/2933349.2933360","url":null,"abstract":"Recent trends in hardware have dramatically dropped the price of RAM and shifted focus from systems operating on disk-resident data to in-memory solutions. In this environment high memory access latency, also known as memory wall, becomes the biggest data processing bottleneck. Traditional CPU-based architectures solved this problem by introducing large cache hierarchies. However algorithms which experience poor locality can limit the benefits of caching. In turn, hardware multithreading provides a generic solution that does not rely on algorithm-specific locality properties.\u0000 In this paper we present an FPGA-accelerated implementation of in-memory group-by hash aggregation. Our design relies on hardware multithreading to efficiently mask long memory access latency by implementing a custom operation datapath on FPGA. We propose using CAMs (Content Addressable Memories) as a mechanism of synchronization and local pre-aggregation. To the best of our knowledge this is the first work, which uses CAMs as a synchronizing cache. We evaluate aggregation throughput against the state-of-the-art multithreaded software implementations and demonstrate that the FPGA-accelerated approach significantly outperforms them on large grouping key cardinalities and yields speedup up to 10x.","PeriodicalId":298901,"journal":{"name":"International Workshop on Data Management on New Hardware","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126998782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Although scaling out of low-power cores is an alternative to power-hungry Intel Xeon processors for reducing the power overheads, they have proven inadequate for complex, non-parallelizable workloads. On the other hand, by the introduction of the 64-bit ARMv8 architecture, traditionally low power ARM processors have become powerful enough to run computationally intensive server-class applications. In this study, we compare a high-performance Intel x86 processor with a commercial implementation of the ARM Cortex-A57. We measure the power used, throughput delivered and latency quantified when running OLTP workloads. Our results show that the ARM processor consumes 3 to 15 times less power than the x86, while penalizing OLTP throughput by a much lower factor (1.7 to 3). As a result, the significant power savings deliver up to 9 times higher energy efficiency. The x86's heavily optimized power-hungry micro-architectural structures contribute to throughput only marginally. As a result, the x86 wastes power when utilization is low, while lightweight ARM processor consumes only as much power as it is utilized, achieving energy proportionality. On the other hand, ARM's quantified latency can be up to 11x higher than x86 towards to the tail of latency distribution, making x86 more suitable for certain type of service-level agreements.
{"title":"OLTP on a server-grade ARM: power, throughput and latency comparison","authors":"Utku Sirin, Raja Appuswamy, A. Ailamaki","doi":"10.1145/2933349.2933359","DOIUrl":"https://doi.org/10.1145/2933349.2933359","url":null,"abstract":"Although scaling out of low-power cores is an alternative to power-hungry Intel Xeon processors for reducing the power overheads, they have proven inadequate for complex, non-parallelizable workloads. On the other hand, by the introduction of the 64-bit ARMv8 architecture, traditionally low power ARM processors have become powerful enough to run computationally intensive server-class applications.\u0000 In this study, we compare a high-performance Intel x86 processor with a commercial implementation of the ARM Cortex-A57. We measure the power used, throughput delivered and latency quantified when running OLTP workloads. Our results show that the ARM processor consumes 3 to 15 times less power than the x86, while penalizing OLTP throughput by a much lower factor (1.7 to 3). As a result, the significant power savings deliver up to 9 times higher energy efficiency. The x86's heavily optimized power-hungry micro-architectural structures contribute to throughput only marginally. As a result, the x86 wastes power when utilization is low, while lightweight ARM processor consumes only as much power as it is utilized, achieving energy proportionality. On the other hand, ARM's quantified latency can be up to 11x higher than x86 towards to the tail of latency distribution, making x86 more suitable for certain type of service-level agreements.","PeriodicalId":298901,"journal":{"name":"International Workshop on Data Management on New Hardware","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132500934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jianguo Wang, Dongchul Park, Yang-Suk Kee, Y. Papakonstantinou, S. Swanson
Recently, there has been a renewed interest of in-storage computing in the context of solid state drives (SSDs), called "Smart SSDs." Smart SSDs allow application-specific code to execute inside SSDs. This allows applications to take advantage of the high internal bandwidth that Smart SSDs provide. This work studies the offloading of list intersection into Smart SSDs, because intersection is prominent in both search engines and analytics queries. Furthermore, intersection is interesting because the algorithms are more complex than plain scans; they are affected by multiple parameters, as we show, and provide lessons that can be used in other operations also. We are interested to know whether Smart SSDs can accelerate the processing of list intersection and reduce the consumed energy. Intuitively, the answer is yes. However, the performance tradeoffs on real devices are complex. We implement list intersection into a real Samsung Smart SSD research prototype. We also provide an analytical model to understand the key factors to the overall performance, and when list intersection can benefit from Smart SSDs. Finally, we conduct experiments on the Samsung Smart SSD. Based on the results (both analytical and experimental), we provide many suggestions for both SSD vendors on how to manufacture powerful Smart SSDs and for applications on how to make full use of the functionalities that Smart SSDs provide.
{"title":"SSD in-storage computing for list intersection","authors":"Jianguo Wang, Dongchul Park, Yang-Suk Kee, Y. Papakonstantinou, S. Swanson","doi":"10.1145/2933349.2933353","DOIUrl":"https://doi.org/10.1145/2933349.2933353","url":null,"abstract":"Recently, there has been a renewed interest of in-storage computing in the context of solid state drives (SSDs), called \"Smart SSDs.\" Smart SSDs allow application-specific code to execute inside SSDs. This allows applications to take advantage of the high internal bandwidth that Smart SSDs provide. This work studies the offloading of list intersection into Smart SSDs, because intersection is prominent in both search engines and analytics queries. Furthermore, intersection is interesting because the algorithms are more complex than plain scans; they are affected by multiple parameters, as we show, and provide lessons that can be used in other operations also.\u0000 We are interested to know whether Smart SSDs can accelerate the processing of list intersection and reduce the consumed energy. Intuitively, the answer is yes. However, the performance tradeoffs on real devices are complex. We implement list intersection into a real Samsung Smart SSD research prototype. We also provide an analytical model to understand the key factors to the overall performance, and when list intersection can benefit from Smart SSDs. Finally, we conduct experiments on the Samsung Smart SSD. Based on the results (both analytical and experimental), we provide many suggestions for both SSD vendors on how to manufacture powerful Smart SSDs and for applications on how to make full use of the functionalities that Smart SSDs provide.","PeriodicalId":298901,"journal":{"name":"International Workshop on Data Management on New Hardware","volume":"453 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133847822","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lin Ma, Joy Arulraj, Sam Zhao, Andrew Pavlo, Subramanya R. Dulloor, Michael J. Giardino, Jeff Parkhurst, J. L. Gardner, K. Doshi, S. Zdonik
In-memory database management systems (DBMSs) outperform disk-oriented systems for on-line transaction processing (OLTP) workloads. But this improved performance is only achievable when the database is smaller than the amount of physical memory available in the system. To overcome this limitation, some in-memory DBMSs can move cold data out of volatile DRAM to secondary storage. Such data appears as if it resides in memory with the rest of the database even though it does not. Although there have been several implementations proposed for this type of cold data storage, there has not been a thorough evaluation of the design decisions in implementing this technique, such as policies for when to evict tuples and how to bring them back when they are needed. These choices are further complicated by the varying performance characteristics of different storage devices, including future non-volatile memory technologies. We explore these issues in this paper and discuss several approaches to solve them. We implemented all of these approaches in an in-memory DBMS and evaluated them using five different storage technologies. Our results show that choosing the best strategy based on the hardware improves throughput by 92-340% over a generic configuration.
{"title":"Larger-than-memory data management on modern storage hardware for in-memory OLTP database systems","authors":"Lin Ma, Joy Arulraj, Sam Zhao, Andrew Pavlo, Subramanya R. Dulloor, Michael J. Giardino, Jeff Parkhurst, J. L. Gardner, K. Doshi, S. Zdonik","doi":"10.1145/2933349.2933358","DOIUrl":"https://doi.org/10.1145/2933349.2933358","url":null,"abstract":"In-memory database management systems (DBMSs) outperform disk-oriented systems for on-line transaction processing (OLTP) workloads. But this improved performance is only achievable when the database is smaller than the amount of physical memory available in the system. To overcome this limitation, some in-memory DBMSs can move cold data out of volatile DRAM to secondary storage. Such data appears as if it resides in memory with the rest of the database even though it does not.\u0000 Although there have been several implementations proposed for this type of cold data storage, there has not been a thorough evaluation of the design decisions in implementing this technique, such as policies for when to evict tuples and how to bring them back when they are needed. These choices are further complicated by the varying performance characteristics of different storage devices, including future non-volatile memory technologies. We explore these issues in this paper and discuss several approaches to solve them. We implemented all of these approaches in an in-memory DBMS and evaluated them using five different storage technologies. Our results show that choosing the best strategy based on the hardware improves throughput by 92-340% over a generic configuration.","PeriodicalId":298901,"journal":{"name":"International Workshop on Data Management on New Hardware","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127180709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ismail Oukid, Daniel Booss, Adrien Lespinasse, Wolfgang Lehner
Leveraging Storage Class Memory (SCM) as a universal memory--i.e. as memory and storage at the same time--has deep implications on database architectures. It becomes possible to store a single copy of the data in SCM and directly operate on it at a fine granularity. However, exposing the whole database with direct access to the application dramatically increases the risk of data corruption. In this paper we propose a lightweight on-line testing framework that helps find and debug SCM-related errors that can occur upon software or power failures. Our testing framework simulates failures in critical code paths and achieves fast code coverage by leveraging call stack information to limit duplicate testing. It also partially covers the errors that might arise as a result of reordered memory operations. We show through an experimental evaluation that our testing framework is fast enough to be used with large software systems and discuss its use during the development of our in-house persistent SCM allocator.
{"title":"On testing persistent-memory-based software","authors":"Ismail Oukid, Daniel Booss, Adrien Lespinasse, Wolfgang Lehner","doi":"10.1145/2933349.2933354","DOIUrl":"https://doi.org/10.1145/2933349.2933354","url":null,"abstract":"Leveraging Storage Class Memory (SCM) as a universal memory--i.e. as memory and storage at the same time--has deep implications on database architectures. It becomes possible to store a single copy of the data in SCM and directly operate on it at a fine granularity. However, exposing the whole database with direct access to the application dramatically increases the risk of data corruption. In this paper we propose a lightweight on-line testing framework that helps find and debug SCM-related errors that can occur upon software or power failures. Our testing framework simulates failures in critical code paths and achieves fast code coverage by leveraging call stack information to limit duplicate testing. It also partially covers the errors that might arise as a result of reordered memory operations. We show through an experimental evaluation that our testing framework is fast enough to be used with large software systems and discuss its use during the development of our in-house persistent SCM allocator.","PeriodicalId":298901,"journal":{"name":"International Workshop on Data Management on New Hardware","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116076585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
LIDAR is a popular remote sensing method used to examine the surface of the Earth. LIDAR instruments use light in the form of a pulsed laser to measure ranges (variable distances) and generate vast amounts of precise three dimensional point data describing the shape of the Earth. Processing large collections of point cloud data and combining them with auxiliary GIS data remain an open research problem. Past research in the area of geographic information systems focused on handling large collections of complex geometric objects stored on disk and most algorithms have been designed and studied in a single-thread setting even though multi-core systems are well established. In this paper, we describe parallel alternatives of known algorithms for evaluating spatial selections over point clouds and spatial joins between point clouds and rectangle collections.
{"title":"In memory processing of massive point clouds for multi-core systems","authors":"K. Kyzirakos, F. Alvanaki, M. Kersten","doi":"10.1145/2933349.2933356","DOIUrl":"https://doi.org/10.1145/2933349.2933356","url":null,"abstract":"LIDAR is a popular remote sensing method used to examine the surface of the Earth. LIDAR instruments use light in the form of a pulsed laser to measure ranges (variable distances) and generate vast amounts of precise three dimensional point data describing the shape of the Earth. Processing large collections of point cloud data and combining them with auxiliary GIS data remain an open research problem.\u0000 Past research in the area of geographic information systems focused on handling large collections of complex geometric objects stored on disk and most algorithms have been designed and studied in a single-thread setting even though multi-core systems are well established. In this paper, we describe parallel alternatives of known algorithms for evaluating spatial selections over point clouds and spatial joins between point clouds and rectangle collections.","PeriodicalId":298901,"journal":{"name":"International Workshop on Data Management on New Hardware","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128298234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Viktor Leis, F. Scheibner, A. Kemper, Thomas Neumann
The performance of transactional database systems is critically dependent on the efficient synchronization of in-memory data structures. The traditional approach, fine-grained locking, does not scale on modern hardware. Lock-free data structures, in contrast, scale very well but are extremely difficult to implement and often require additional indirections. In this work, we argue for a middle ground, i.e., synchronization protocols that use locking, but only sparingly. We synchronize the Adaptive Radix Tree (ART) using two such protocols, Optimistic Lock Coupling and Read-Optimized Write EXclusion (ROWEX). Both perform and scale very well while being much easier to implement than lock-free techniques.
{"title":"The ART of practical synchronization","authors":"Viktor Leis, F. Scheibner, A. Kemper, Thomas Neumann","doi":"10.1145/2933349.2933352","DOIUrl":"https://doi.org/10.1145/2933349.2933352","url":null,"abstract":"The performance of transactional database systems is critically dependent on the efficient synchronization of in-memory data structures. The traditional approach, fine-grained locking, does not scale on modern hardware. Lock-free data structures, in contrast, scale very well but are extremely difficult to implement and often require additional indirections. In this work, we argue for a middle ground, i.e., synchronization protocols that use locking, but only sparingly. We synchronize the Adaptive Radix Tree (ART) using two such protocols, Optimistic Lock Coupling and Read-Optimized Write EXclusion (ROWEX). Both perform and scale very well while being much easier to implement than lock-free techniques.","PeriodicalId":298901,"journal":{"name":"International Workshop on Data Management on New Hardware","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130801361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Danica Porobic, Pınar Tözün, Raja Appuswamy, A. Ailamaki
Multisocket multicores feature hardware islands - groups of cores that communicate fast among themselves and slower with other groups. With high speed networking becoming a commodity, clusters of hardware islands with fast networks are becoming a preferred platform for high end OLTP workloads. While behavior of OLTP on multisockets is well understood, multi-machine OLTP deployments have been studied only in the geo-distributed context where network is much slower. In this paper, we analyze the behavior of different OLTP designs when deployed on clusters of multisockets with fast networks. We demonstrate that choosing the optimal deployment configuration within a multisocket node can improve performance by 2 to 4 times. A slow network can decrease the throughput by 40% when communication cannot be overlapped with other processing, while having negligible impact when other overheads dominate. Finally, we identify opportunities for combining the best characteristics of scale-up and scale-out designs.
{"title":"More than a network: distributed OLTP on clusters of hardware islands","authors":"Danica Porobic, Pınar Tözün, Raja Appuswamy, A. Ailamaki","doi":"10.1145/2933349.2933355","DOIUrl":"https://doi.org/10.1145/2933349.2933355","url":null,"abstract":"Multisocket multicores feature hardware islands - groups of cores that communicate fast among themselves and slower with other groups. With high speed networking becoming a commodity, clusters of hardware islands with fast networks are becoming a preferred platform for high end OLTP workloads. While behavior of OLTP on multisockets is well understood, multi-machine OLTP deployments have been studied only in the geo-distributed context where network is much slower. In this paper, we analyze the behavior of different OLTP designs when deployed on clusters of multisockets with fast networks.\u0000 We demonstrate that choosing the optimal deployment configuration within a multisocket node can improve performance by 2 to 4 times. A slow network can decrease the throughput by 40% when communication cannot be overlapped with other processing, while having negligible impact when other overheads dominate. Finally, we identify opportunities for combining the best characteristics of scale-up and scale-out designs.","PeriodicalId":298901,"journal":{"name":"International Workshop on Data Management on New Hardware","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129570474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Evangelia A. Sitaridi, Orestis Polychroniou, K. A. Ross
String processing tasks are common in analytical queries powering business intelligence. Besides substring matching, provided in SQL by the like operator, popular DBMSs also support regular expressions as selective filters. Substring matching can be optimized by using specialized SIMD instructions on mainstream CPUs, reaching the performance of numeric column scans. However, generic regular expressions are harder to evaluate, being dependent on both the DFA size and the irregularity of the input. Here, we optimize matching string columns against regular expressions using SIMD-vectorized code. Our approach avoids accessing the strings in lockstep without branching, to exploit cases when some strings are accepted or rejected early by looking at the first few characters. On common string lengths, our implementation is up to 2X faster than scalar code on a mainstream CPU and up to 5X faster on the Xeon Phi co-processor, improving regular expression support in DBMSs.
{"title":"SIMD-accelerated regular expression matching","authors":"Evangelia A. Sitaridi, Orestis Polychroniou, K. A. Ross","doi":"10.1145/2933349.2933357","DOIUrl":"https://doi.org/10.1145/2933349.2933357","url":null,"abstract":"String processing tasks are common in analytical queries powering business intelligence. Besides substring matching, provided in SQL by the like operator, popular DBMSs also support regular expressions as selective filters. Substring matching can be optimized by using specialized SIMD instructions on mainstream CPUs, reaching the performance of numeric column scans. However, generic regular expressions are harder to evaluate, being dependent on both the DFA size and the irregularity of the input. Here, we optimize matching string columns against regular expressions using SIMD-vectorized code. Our approach avoids accessing the strings in lockstep without branching, to exploit cases when some strings are accepted or rejected early by looking at the first few characters. On common string lengths, our implementation is up to 2X faster than scalar code on a mainstream CPU and up to 5X faster on the Xeon Phi co-processor, improving regular expression support in DBMSs.","PeriodicalId":298901,"journal":{"name":"International Workshop on Data Management on New Hardware","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117028950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jana Giceva, Gerd Zellweger, G. Alonso, Timothy Roscoe
For decades, database engines have found the generic interfaces offered by the operating systems at odds with the need for efficient utilization of hardware resources. As a result, most engines circumvent the OS and manage hardware directly. With the growing complexity and heterogeneity of modern hardware, database engines are now facing a steep increase in the complexity they must absorb to achieve good performance. Taking advantage of recent proposals in operating system design, such as multi-kernels, in this paper we explore the development of a light weight OS kernel tailored for data processing and discuss its benefits for simplifying the design and improving the performance of data management systems.
{"title":"Customized OS support for data-processing","authors":"Jana Giceva, Gerd Zellweger, G. Alonso, Timothy Roscoe","doi":"10.1145/2933349.2933351","DOIUrl":"https://doi.org/10.1145/2933349.2933351","url":null,"abstract":"For decades, database engines have found the generic interfaces offered by the operating systems at odds with the need for efficient utilization of hardware resources. As a result, most engines circumvent the OS and manage hardware directly. With the growing complexity and heterogeneity of modern hardware, database engines are now facing a steep increase in the complexity they must absorb to achieve good performance. Taking advantage of recent proposals in operating system design, such as multi-kernels, in this paper we explore the development of a light weight OS kernel tailored for data processing and discuss its benefits for simplifying the design and improving the performance of data management systems.","PeriodicalId":298901,"journal":{"name":"International Workshop on Data Management on New Hardware","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127108059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}