International Workshop on Data Management on New Hardware最新文献

英文中文

FPGA-accelerated group-by aggregation using synchronizing caches 使用同步缓存的fpga加速分组聚合

International Workshop on Data Management on New Hardware

Pub Date : 2016-06-26 DOI: 10.1145/2933349.2933360

Ildar Absalyamov, Prerna Budhkar, Skyler Windh, R. Halstead, W. Najjar, V. Tsotras

Recent trends in hardware have dramatically dropped the price of RAM and shifted focus from systems operating on disk-resident data to in-memory solutions. In this environment high memory access latency, also known as memory wall, becomes the biggest data processing bottleneck. Traditional CPU-based architectures solved this problem by introducing large cache hierarchies. However algorithms which experience poor locality can limit the benefits of caching. In turn, hardware multithreading provides a generic solution that does not rely on algorithm-specific locality properties. In this paper we present an FPGA-accelerated implementation of in-memory group-by hash aggregation. Our design relies on hardware multithreading to efficiently mask long memory access latency by implementing a custom operation datapath on FPGA. We propose using CAMs (Content Addressable Memories) as a mechanism of synchronization and local pre-aggregation. To the best of our knowledge this is the first work, which uses CAMs as a synchronizing cache. We evaluate aggregation throughput against the state-of-the-art multithreaded software implementations and demonstrate that the FPGA-accelerated approach significantly outperforms them on large grouping key cardinalities and yields speedup up to 10x.

最近的硬件趋势已经显著降低了RAM的价格，并将重点从在磁盘驻留数据上操作的系统转移到内存解决方案上。在这种环境下，高内存访问延迟(也称为内存墙)成为最大的数据处理瓶颈。传统的基于cpu的体系结构通过引入大型缓存层次结构来解决这个问题。然而，局部性差的算法会限制缓存的好处。反过来，硬件多线程提供了一种不依赖于特定于算法的局部性属性的通用解决方案。在本文中，我们提出了一个fpga加速的内存分组哈希聚合实现。我们的设计依赖于硬件多线程，通过在FPGA上实现自定义操作数据路径来有效地掩盖长内存访问延迟。我们建议使用CAMs(内容可寻址存储器)作为同步和局部预聚合的机制。据我们所知，这是第一个使用CAMs作为同步缓存的工作。我们根据最先进的多线程软件实现评估聚合吞吐量，并证明fpga加速方法在大型分组关键基数上显着优于它们，并产生高达10倍的加速。

{"title":"FPGA-accelerated group-by aggregation using synchronizing caches","authors":"Ildar Absalyamov, Prerna Budhkar, Skyler Windh, R. Halstead, W. Najjar, V. Tsotras","doi":"10.1145/2933349.2933360","DOIUrl":"https://doi.org/10.1145/2933349.2933360","url":null,"abstract":"Recent trends in hardware have dramatically dropped the price of RAM and shifted focus from systems operating on disk-resident data to in-memory solutions. In this environment high memory access latency, also known as memory wall, becomes the biggest data processing bottleneck. Traditional CPU-based architectures solved this problem by introducing large cache hierarchies. However algorithms which experience poor locality can limit the benefits of caching. In turn, hardware multithreading provides a generic solution that does not rely on algorithm-specific locality properties.\u0000 In this paper we present an FPGA-accelerated implementation of in-memory group-by hash aggregation. Our design relies on hardware multithreading to efficiently mask long memory access latency by implementing a custom operation datapath on FPGA. We propose using CAMs (Content Addressable Memories) as a mechanism of synchronization and local pre-aggregation. To the best of our knowledge this is the first work, which uses CAMs as a synchronizing cache. We evaluate aggregation throughput against the state-of-the-art multithreaded software implementations and demonstrate that the FPGA-accelerated approach significantly outperforms them on large grouping key cardinalities and yields speedup up to 10x.","PeriodicalId":298901,"journal":{"name":"International Workshop on Data Management on New Hardware","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126998782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

OLTP on a server-grade ARM: power, throughput and latency comparison 服务器级ARM上的OLTP:功率、吞吐量和延迟比较

International Workshop on Data Management on New Hardware

Pub Date : 2016-06-26 DOI: 10.1145/2933349.2933359

Utku Sirin, Raja Appuswamy, A. Ailamaki

Although scaling out of low-power cores is an alternative to power-hungry Intel Xeon processors for reducing the power overheads, they have proven inadequate for complex, non-parallelizable workloads. On the other hand, by the introduction of the 64-bit ARMv8 architecture, traditionally low power ARM processors have become powerful enough to run computationally intensive server-class applications. In this study, we compare a high-performance Intel x86 processor with a commercial implementation of the ARM Cortex-A57. We measure the power used, throughput delivered and latency quantified when running OLTP workloads. Our results show that the ARM processor consumes 3 to 15 times less power than the x86, while penalizing OLTP throughput by a much lower factor (1.7 to 3). As a result, the significant power savings deliver up to 9 times higher energy efficiency. The x86's heavily optimized power-hungry micro-architectural structures contribute to throughput only marginally. As a result, the x86 wastes power when utilization is low, while lightweight ARM processor consumes only as much power as it is utilized, achieving energy proportionality. On the other hand, ARM's quantified latency can be up to 11x higher than x86 towards to the tail of latency distribution, making x86 more suitable for certain type of service-level agreements.

尽管在低功耗内核之外扩展是降低功耗的英特尔至强处理器的一种替代方案，但它们已被证明不适用于复杂的、不可并行的工作负载。另一方面，由于64位ARMv8架构的引入，传统的低功耗ARM处理器已经变得足够强大，可以运行计算密集型的服务器类应用程序。在本研究中，我们比较了高性能英特尔x86处理器与ARM Cortex-A57的商业实现。我们测量运行OLTP工作负载时使用的功率、交付的吞吐量和量化的延迟。我们的结果表明，ARM处理器消耗的功率比x86少3到15倍，而对OLTP吞吐量的影响要低得多(1.7到3)。因此，显著的功率节省提供了高达9倍的能源效率。x86高度优化的耗电微架构结构对吞吐量的贡献微乎其微。因此，x86在利用率较低时浪费电力，而轻量级的ARM处理器只消耗与被利用的相同的电力，实现了能量比例。另一方面，在延迟分布的尾部，ARM的量化延迟可以比x86高11倍，这使得x86更适合某些类型的服务级别协议。

{"title":"OLTP on a server-grade ARM: power, throughput and latency comparison","authors":"Utku Sirin, Raja Appuswamy, A. Ailamaki","doi":"10.1145/2933349.2933359","DOIUrl":"https://doi.org/10.1145/2933349.2933359","url":null,"abstract":"Although scaling out of low-power cores is an alternative to power-hungry Intel Xeon processors for reducing the power overheads, they have proven inadequate for complex, non-parallelizable workloads. On the other hand, by the introduction of the 64-bit ARMv8 architecture, traditionally low power ARM processors have become powerful enough to run computationally intensive server-class applications.\u0000 In this study, we compare a high-performance Intel x86 processor with a commercial implementation of the ARM Cortex-A57. We measure the power used, throughput delivered and latency quantified when running OLTP workloads. Our results show that the ARM processor consumes 3 to 15 times less power than the x86, while penalizing OLTP throughput by a much lower factor (1.7 to 3). As a result, the significant power savings deliver up to 9 times higher energy efficiency. The x86's heavily optimized power-hungry micro-architectural structures contribute to throughput only marginally. As a result, the x86 wastes power when utilization is low, while lightweight ARM processor consumes only as much power as it is utilized, achieving energy proportionality. On the other hand, ARM's quantified latency can be up to 11x higher than x86 towards to the tail of latency distribution, making x86 more suitable for certain type of service-level agreements.","PeriodicalId":298901,"journal":{"name":"International Workshop on Data Management on New Hardware","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132500934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

SSD in-storage computing for list intersection 用于列表交叉的SSD存储计算

International Workshop on Data Management on New Hardware

Pub Date : 2016-06-26 DOI: 10.1145/2933349.2933353

Jianguo Wang, Dongchul Park, Yang-Suk Kee, Y. Papakonstantinou, S. Swanson

Recently, there has been a renewed interest of in-storage computing in the context of solid state drives (SSDs), called "Smart SSDs." Smart SSDs allow application-specific code to execute inside SSDs. This allows applications to take advantage of the high internal bandwidth that Smart SSDs provide. This work studies the offloading of list intersection into Smart SSDs, because intersection is prominent in both search engines and analytics queries. Furthermore, intersection is interesting because the algorithms are more complex than plain scans; they are affected by multiple parameters, as we show, and provide lessons that can be used in other operations also. We are interested to know whether Smart SSDs can accelerate the processing of list intersection and reduce the consumed energy. Intuitively, the answer is yes. However, the performance tradeoffs on real devices are complex. We implement list intersection into a real Samsung Smart SSD research prototype. We also provide an analytical model to understand the key factors to the overall performance, and when list intersection can benefit from Smart SSDs. Finally, we conduct experiments on the Samsung Smart SSD. Based on the results (both analytical and experimental), we provide many suggestions for both SSD vendors on how to manufacture powerful Smart SSDs and for applications on how to make full use of the functionalities that Smart SSDs provide.

最近，人们对固态硬盘(ssd)(称为“智能ssd”)的存储内计算重新产生了兴趣。智能ssd允许在ssd内执行特定于应用程序的代码。这允许应用程序利用智能ssd提供的高内部带宽。这项工作研究了将列表交叉点卸载到智能ssd中，因为交叉点在搜索引擎和分析查询中都很突出。此外，交叉是有趣的，因为算法比普通扫描更复杂;正如我们所展示的，它们受到多个参数的影响，并提供了可以在其他操作中使用的经验教训。我们感兴趣的是，Smart ssd是否可以加速列表交集的处理，减少消耗的能量。直觉上，答案是肯定的。然而，在实际设备上的性能权衡是复杂的。我们在一个真正的三星智能SSD研究原型中实现了列表交叉。我们还提供了一个分析模型，以了解影响整体性能的关键因素，以及何时列表交叉可以从智能ssd中受益。最后，在Samsung Smart SSD上进行实验。根据结果(分析和实验)，我们为SSD供应商提供了许多关于如何制造强大的智能SSD的建议，以及如何充分利用智能SSD提供的功能的应用程序。

{"title":"SSD in-storage computing for list intersection","authors":"Jianguo Wang, Dongchul Park, Yang-Suk Kee, Y. Papakonstantinou, S. Swanson","doi":"10.1145/2933349.2933353","DOIUrl":"https://doi.org/10.1145/2933349.2933353","url":null,"abstract":"Recently, there has been a renewed interest of in-storage computing in the context of solid state drives (SSDs), called \"Smart SSDs.\" Smart SSDs allow application-specific code to execute inside SSDs. This allows applications to take advantage of the high internal bandwidth that Smart SSDs provide. This work studies the offloading of list intersection into Smart SSDs, because intersection is prominent in both search engines and analytics queries. Furthermore, intersection is interesting because the algorithms are more complex than plain scans; they are affected by multiple parameters, as we show, and provide lessons that can be used in other operations also.\u0000 We are interested to know whether Smart SSDs can accelerate the processing of list intersection and reduce the consumed energy. Intuitively, the answer is yes. However, the performance tradeoffs on real devices are complex. We implement list intersection into a real Samsung Smart SSD research prototype. We also provide an analytical model to understand the key factors to the overall performance, and when list intersection can benefit from Smart SSDs. Finally, we conduct experiments on the Samsung Smart SSD. Based on the results (both analytical and experimental), we provide many suggestions for both SSD vendors on how to manufacture powerful Smart SSDs and for applications on how to make full use of the functionalities that Smart SSDs provide.","PeriodicalId":298901,"journal":{"name":"International Workshop on Data Management on New Hardware","volume":"453 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133847822","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 55

Larger-than-memory data management on modern storage hardware for in-memory OLTP database systems 用于内存OLTP数据库系统的现代存储硬件上的大于内存的数据管理

International Workshop on Data Management on New Hardware

Pub Date : 2016-06-26 DOI: 10.1145/2933349.2933358

Lin Ma, Joy Arulraj, Sam Zhao, Andrew Pavlo, Subramanya R. Dulloor, Michael J. Giardino, Jeff Parkhurst, J. L. Gardner, K. Doshi, S. Zdonik

In-memory database management systems (DBMSs) outperform disk-oriented systems for on-line transaction processing (OLTP) workloads. But this improved performance is only achievable when the database is smaller than the amount of physical memory available in the system. To overcome this limitation, some in-memory DBMSs can move cold data out of volatile DRAM to secondary storage. Such data appears as if it resides in memory with the rest of the database even though it does not. Although there have been several implementations proposed for this type of cold data storage, there has not been a thorough evaluation of the design decisions in implementing this technique, such as policies for when to evict tuples and how to bring them back when they are needed. These choices are further complicated by the varying performance characteristics of different storage devices, including future non-volatile memory technologies. We explore these issues in this paper and discuss several approaches to solve them. We implemented all of these approaches in an in-memory DBMS and evaluated them using five different storage technologies. Our results show that choosing the best strategy based on the hardware improves throughput by 92-340% over a generic configuration.

内存数据库管理系统(dbms)在联机事务处理(OLTP)工作负载方面优于面向磁盘的系统。但是，只有当数据库小于系统中可用的物理内存量时，才能实现这种改进的性能。为了克服这个限制，一些内存中的dbms可以将冷数据从易失性DRAM移到辅助存储中。这些数据看起来好像与数据库的其余部分一起驻留在内存中，尽管事实并非如此。尽管已经针对这种类型的冷数据存储提出了几种实现，但是还没有对实现这种技术的设计决策进行彻底的评估，例如何时驱逐元组以及如何在需要时将它们带回来的策略。由于不同存储设备(包括未来的非易失性存储技术)的不同性能特征，这些选择变得更加复杂。我们在本文中探讨了这些问题，并讨论了解决这些问题的几种方法。我们在一个内存DBMS中实现了所有这些方法，并使用五种不同的存储技术对它们进行了评估。我们的结果表明，选择基于硬件的最佳策略比通用配置提高了92-340%的吞吐量。

{"title":"Larger-than-memory data management on modern storage hardware for in-memory OLTP database systems","authors":"Lin Ma, Joy Arulraj, Sam Zhao, Andrew Pavlo, Subramanya R. Dulloor, Michael J. Giardino, Jeff Parkhurst, J. L. Gardner, K. Doshi, S. Zdonik","doi":"10.1145/2933349.2933358","DOIUrl":"https://doi.org/10.1145/2933349.2933358","url":null,"abstract":"In-memory database management systems (DBMSs) outperform disk-oriented systems for on-line transaction processing (OLTP) workloads. But this improved performance is only achievable when the database is smaller than the amount of physical memory available in the system. To overcome this limitation, some in-memory DBMSs can move cold data out of volatile DRAM to secondary storage. Such data appears as if it resides in memory with the rest of the database even though it does not.\u0000 Although there have been several implementations proposed for this type of cold data storage, there has not been a thorough evaluation of the design decisions in implementing this technique, such as policies for when to evict tuples and how to bring them back when they are needed. These choices are further complicated by the varying performance characteristics of different storage devices, including future non-volatile memory technologies. We explore these issues in this paper and discuss several approaches to solve them. We implemented all of these approaches in an in-memory DBMS and evaluated them using five different storage technologies. Our results show that choosing the best strategy based on the hardware improves throughput by 92-340% over a generic configuration.","PeriodicalId":298901,"journal":{"name":"International Workshop on Data Management on New Hardware","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127180709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

On testing persistent-memory-based software 关于测试基于持久内存的软件

International Workshop on Data Management on New Hardware

Pub Date : 2016-06-26 DOI: 10.1145/2933349.2933354

Ismail Oukid, Daniel Booss, Adrien Lespinasse, Wolfgang Lehner

Leveraging Storage Class Memory (SCM) as a universal memory--i.e. as memory and storage at the same time--has deep implications on database architectures. It becomes possible to store a single copy of the data in SCM and directly operate on it at a fine granularity. However, exposing the whole database with direct access to the application dramatically increases the risk of data corruption. In this paper we propose a lightweight on-line testing framework that helps find and debug SCM-related errors that can occur upon software or power failures. Our testing framework simulates failures in critical code paths and achieves fast code coverage by leveraging call stack information to limit duplicate testing. It also partially covers the errors that might arise as a result of reordered memory operations. We show through an experimental evaluation that our testing framework is fast enough to be used with large software systems and discuss its use during the development of our in-house persistent SCM allocator.

利用存储类内存(SCM)作为通用内存——即由于内存和存储同时存在——对数据库体系结构有着深刻的影响。在SCM中存储数据的单个副本并以细粒度直接对其进行操作成为可能。但是，暴露整个数据库并直接访问应用程序会极大地增加数据损坏的风险。在本文中，我们提出了一个轻量级的在线测试框架，它可以帮助发现和调试scm相关的错误，这些错误可能发生在软件或电源故障上。我们的测试框架模拟关键代码路径中的故障，并通过利用调用堆栈信息来限制重复测试来实现快速的代码覆盖。它还部分涵盖了由于重新排序内存操作而可能出现的错误。我们通过实验评估表明，我们的测试框架足够快，可以与大型软件系统一起使用，并在我们内部的持久SCM分配器的开发过程中讨论了它的使用。

引用次数: 13

In memory processing of massive point clouds for multi-core systems 多核系统海量点云的内存处理

International Workshop on Data Management on New Hardware

Pub Date : 2016-06-26 DOI: 10.1145/2933349.2933356

K. Kyzirakos, F. Alvanaki, M. Kersten

LIDAR is a popular remote sensing method used to examine the surface of the Earth. LIDAR instruments use light in the form of a pulsed laser to measure ranges (variable distances) and generate vast amounts of precise three dimensional point data describing the shape of the Earth. Processing large collections of point cloud data and combining them with auxiliary GIS data remain an open research problem. Past research in the area of geographic information systems focused on handling large collections of complex geometric objects stored on disk and most algorithms have been designed and studied in a single-thread setting even though multi-core systems are well established. In this paper, we describe parallel alternatives of known algorithms for evaluating spatial selections over point clouds and spatial joins between point clouds and rectangle collections.

激光雷达是一种流行的遥感方法，用于检查地球表面。激光雷达仪器以脉冲激光的形式使用光来测量范围(可变距离)，并生成大量精确的三维点数据，描述地球的形状。处理大量点云数据并将其与辅助GIS数据相结合仍然是一个开放的研究问题。过去在地理信息系统领域的研究主要集中在处理存储在磁盘上的复杂几何对象的大型集合，尽管多核系统已经建立，但大多数算法都是在单线程设置中设计和研究的。在本文中，我们描述了用于评估点云和点云和矩形集合之间的空间连接的空间选择的已知算法的并行替代方案。

引用次数: 4

The ART of practical synchronization 实用同步的艺术

International Workshop on Data Management on New Hardware

Pub Date : 2016-06-26 DOI: 10.1145/2933349.2933352

Viktor Leis, F. Scheibner, A. Kemper, Thomas Neumann

The performance of transactional database systems is critically dependent on the efficient synchronization of in-memory data structures. The traditional approach, fine-grained locking, does not scale on modern hardware. Lock-free data structures, in contrast, scale very well but are extremely difficult to implement and often require additional indirections. In this work, we argue for a middle ground, i.e., synchronization protocols that use locking, but only sparingly. We synchronize the Adaptive Radix Tree (ART) using two such protocols, Optimistic Lock Coupling and Read-Optimized Write EXclusion (ROWEX). Both perform and scale very well while being much easier to implement than lock-free techniques.

事务性数据库系统的性能严重依赖于内存中数据结构的有效同步。传统的方法(细粒度锁定)在现代硬件上无法伸缩。相比之下，无锁数据结构的可伸缩性非常好，但实现起来非常困难，而且通常需要额外的间接指令。在这项工作中，我们主张一个中间立场，即使用锁定的同步协议，但只是有节制地使用。我们使用两种这样的协议，乐观锁耦合和读优化写排斥(ROWEX)来同步自适应基树(ART)。两者的性能和可伸缩性都非常好，而且比无锁技术更容易实现。

引用次数: 83

More than a network: distributed OLTP on clusters of hardware islands 不止一个网络:分布式OLTP在硬件孤岛集群上

International Workshop on Data Management on New Hardware

Pub Date : 2016-06-26 DOI: 10.1145/2933349.2933355

Danica Porobic, Pınar Tözün, Raja Appuswamy, A. Ailamaki

Multisocket multicores feature hardware islands - groups of cores that communicate fast among themselves and slower with other groups. With high speed networking becoming a commodity, clusters of hardware islands with fast networks are becoming a preferred platform for high end OLTP workloads. While behavior of OLTP on multisockets is well understood, multi-machine OLTP deployments have been studied only in the geo-distributed context where network is much slower. In this paper, we analyze the behavior of different OLTP designs when deployed on clusters of multisockets with fast networks. We demonstrate that choosing the optimal deployment configuration within a multisocket node can improve performance by 2 to 4 times. A slow network can decrease the throughput by 40% when communication cannot be overlapped with other processing, while having negligible impact when other overheads dominate. Finally, we identify opportunities for combining the best characteristics of scale-up and scale-out designs.

多套接字多核的特点是硬件孤岛——一组核之间的通信速度很快，而与其他组之间的通信速度较慢。随着高速网络成为一种商品，具有快速网络的硬件孤岛集群正在成为高端OLTP工作负载的首选平台。虽然对多套接字上的OLTP行为有很好的理解，但多机器OLTP部署只在网络速度慢得多的地理分布式环境中进行了研究。在本文中，我们分析了不同的OLTP设计在部署在多套接字快速网络集群时的行为。我们证明，在多套接字节点中选择最佳部署配置可以将性能提高2到4倍。当通信不能与其他处理重叠时，慢速网络会使吞吐量降低40%，而当其他开销占主导地位时，其影响可以忽略不计。最后，我们确定了结合按比例放大和按比例缩小设计的最佳特性的机会。

引用次数: 4

SIMD-accelerated regular expression matching simd加速正则表达式匹配

International Workshop on Data Management on New Hardware

Pub Date : 2016-06-26 DOI: 10.1145/2933349.2933357

Evangelia A. Sitaridi, Orestis Polychroniou, K. A. Ross

String processing tasks are common in analytical queries powering business intelligence. Besides substring matching, provided in SQL by the like operator, popular DBMSs also support regular expressions as selective filters. Substring matching can be optimized by using specialized SIMD instructions on mainstream CPUs, reaching the performance of numeric column scans. However, generic regular expressions are harder to evaluate, being dependent on both the DFA size and the irregularity of the input. Here, we optimize matching string columns against regular expressions using SIMD-vectorized code. Our approach avoids accessing the strings in lockstep without branching, to exploit cases when some strings are accepted or rejected early by looking at the first few characters. On common string lengths, our implementation is up to 2X faster than scalar code on a mainstream CPU and up to 5X faster on the Xeon Phi co-processor, improving regular expression support in DBMSs.

字符串处理任务在支持业务智能的分析查询中很常见。除了SQL中由like操作符提供的子字符串匹配之外，流行的dbms还支持正则表达式作为选择性过滤器。可以通过在主流cpu上使用专门的SIMD指令来优化子字符串匹配，从而达到数字列扫描的性能。然而，通用正则表达式更难求值，这取决于DFA大小和输入的不规则性。在这里，我们使用simd矢量化代码针对正则表达式优化匹配字符串列。我们的方法避免在没有分支的情况下同步访问字符串，以便通过查看前几个字符来利用某些字符串被接受或拒绝的情况。对于常见的字符串长度，我们的实现比主流CPU上的标量代码快2倍，在Xeon Phi协处理器上快5倍，从而改进了dbms中的正则表达式支持。

引用次数: 19

Customized OS support for data-processing 自定义操作系统支持数据处理

International Workshop on Data Management on New Hardware

Pub Date : 2016-06-26 DOI: 10.1145/2933349.2933351

Jana Giceva, Gerd Zellweger, G. Alonso, Timothy Roscoe

For decades, database engines have found the generic interfaces offered by the operating systems at odds with the need for efficient utilization of hardware resources. As a result, most engines circumvent the OS and manage hardware directly. With the growing complexity and heterogeneity of modern hardware, database engines are now facing a steep increase in the complexity they must absorb to achieve good performance. Taking advantage of recent proposals in operating system design, such as multi-kernels, in this paper we explore the development of a light weight OS kernel tailored for data processing and discuss its benefits for simplifying the design and improving the performance of data management systems.

几十年来，数据库引擎发现操作系统提供的通用接口与有效利用硬件资源的需求不一致。因此，大多数引擎绕过操作系统，直接管理硬件。随着现代硬件的复杂性和异构性的增长，数据库引擎现在面临着复杂性的急剧增加，它们必须吸收复杂性才能获得良好的性能。在本文中，我们利用最近在操作系统设计中的建议，如多内核，探讨了为数据处理量身定制的轻量级操作系统内核的开发，并讨论了它对简化设计和提高数据管理系统性能的好处。

引用次数: 25

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

International Workshop on Data Management on New Hardware

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀