首页 > 最新文献

Proceedings of the 11th International Workshop on Data Management on New Hardware最新文献

英文 中文
Toward GPUs being mainstream in analytic processing: An initial argument using simple scan-aggregate queries gpu成为分析处理的主流:使用简单扫描-聚合查询的初步论证
Jason Power, Yinan Li, M. Hill, J. Patel, D. Wood
There have been a number of research proposals to use discrete graphics processing units (GPUs) to accelerate database operations. Although many of these works show up to an order of magnitude performance improvement, discrete GPUs are not commonly used in modern database systems. However, there is now a proliferation of integrated GPUs which are on the same silicon die as the conventional CPU. With the advent of new programming models like heterogeneous system architecture, these integrated GPUs are considered first-class compute units, with transparent access to CPU virtual addresses and very low overhead for computation offloading. We show that integrated GPUs significantly reduce the overheads of using GPUs in a database environment. Specifically, an integrated GPU is 3x faster than a discrete GPU even though the discrete GPU has 4x the computational capability. Therefore, we develop high performance scan and aggregate algorithms for the integrated GPU. We show that the integrated GPU can outperform a four-core CPU with SIMD extensions by an average of 30% (up to 3:2x) and provides an average of 45% reduction in energy on 16 TPC-H queries.
已经有许多研究建议使用离散图形处理单元(gpu)来加速数据库操作。尽管许多这些工作显示出了一个数量级的性能改进,但离散gpu在现代数据库系统中并不常用。然而,现在集成gpu的数量激增,这些gpu与传统CPU使用相同的硅芯片。随着异构系统架构等新编程模型的出现,这些集成gpu被认为是一流的计算单元,具有对CPU虚拟地址的透明访问和非常低的计算卸载开销。我们展示了集成gpu显著降低了在数据库环境中使用gpu的开销。具体来说,集成GPU比独立GPU快3倍,尽管独立GPU的计算能力是独立GPU的4倍。因此,我们为集成GPU开发了高性能的扫描和聚合算法。我们表明,集成的GPU可以比具有SIMD扩展的四核CPU平均高出30%(高达3:2x),并且在16个TPC-H查询中平均减少45%的能量。
{"title":"Toward GPUs being mainstream in analytic processing: An initial argument using simple scan-aggregate queries","authors":"Jason Power, Yinan Li, M. Hill, J. Patel, D. Wood","doi":"10.1145/2771937.2771941","DOIUrl":"https://doi.org/10.1145/2771937.2771941","url":null,"abstract":"There have been a number of research proposals to use discrete graphics processing units (GPUs) to accelerate database operations. Although many of these works show up to an order of magnitude performance improvement, discrete GPUs are not commonly used in modern database systems. However, there is now a proliferation of integrated GPUs which are on the same silicon die as the conventional CPU. With the advent of new programming models like heterogeneous system architecture, these integrated GPUs are considered first-class compute units, with transparent access to CPU virtual addresses and very low overhead for computation offloading. We show that integrated GPUs significantly reduce the overheads of using GPUs in a database environment. Specifically, an integrated GPU is 3x faster than a discrete GPU even though the discrete GPU has 4x the computational capability. Therefore, we develop high performance scan and aggregate algorithms for the integrated GPU. We show that the integrated GPU can outperform a four-core CPU with SIMD extensions by an average of 30% (up to 3:2x) and provides an average of 45% reduction in energy on 16 TPC-H queries.","PeriodicalId":267524,"journal":{"name":"Proceedings of the 11th International Workshop on Data Management on New Hardware","volume":"48 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113957761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
Applying HTM to an OLTP System: No Free Lunch 将HTM应用于OLTP系统:没有免费的午餐
David Cervini, Danica Porobic, Pınar Tözün, A. Ailamaki
Transactional memory is a promising way for implementing efficient synchronization mechanisms for multicore processors. Intel's introduction of hardware transactional memory (HTM) into their Haswell line of processors marks an important step toward mainstream availability of transactional memory. Transaction processing systems require execution of dozens of critical sections to insure isolation among threads, which makes them one of the target applications for exploiting HTM. In this study, we quantify the opportunities and limitations of directly applying HTM to an existing OLTP system that uses fine-grained synchronization. Our target is Shore-MT, a modern multithreaded transactional storage manager that uses a variety of fine-grained synchronization mechanisms to provide scalability on multicore processors. We find that HTM can improve performance of the TATP workload by 13--17% when applied judiciously. However, attempting to replace all synchronization reduces performance compared to the baseline case due to high percentage of aborts caused by the limitations of the current HTM implementation.
事务性内存是为多核处理器实现高效同步机制的一种很有前途的方法。英特尔在其Haswell处理器系列中引入硬件事务性内存(HTM),标志着事务性内存向主流可用性迈出了重要的一步。事务处理系统需要执行数十个关键段以确保线程之间的隔离,这使得它们成为利用HTM的目标应用程序之一。在本研究中,我们量化了直接将HTM应用于使用细粒度同步的现有OLTP系统的机会和限制。我们的目标是Shore-MT,这是一种现代多线程事务存储管理器,它使用各种细粒度同步机制来提供多核处理器上的可伸缩性。我们发现,如果应用得当,HTM可以将TATP工作负载的性能提高13- 17%。然而,与基线情况相比,尝试替换所有同步会降低性能,这是由于当前HTM实现的限制导致的高中断百分比。
{"title":"Applying HTM to an OLTP System: No Free Lunch","authors":"David Cervini, Danica Porobic, Pınar Tözün, A. Ailamaki","doi":"10.1145/2771937.2771946","DOIUrl":"https://doi.org/10.1145/2771937.2771946","url":null,"abstract":"Transactional memory is a promising way for implementing efficient synchronization mechanisms for multicore processors. Intel's introduction of hardware transactional memory (HTM) into their Haswell line of processors marks an important step toward mainstream availability of transactional memory. Transaction processing systems require execution of dozens of critical sections to insure isolation among threads, which makes them one of the target applications for exploiting HTM. In this study, we quantify the opportunities and limitations of directly applying HTM to an existing OLTP system that uses fine-grained synchronization. Our target is Shore-MT, a modern multithreaded transactional storage manager that uses a variety of fine-grained synchronization mechanisms to provide scalability on multicore processors. We find that HTM can improve performance of the TATP workload by 13--17% when applied judiciously. However, attempting to replace all synchronization reduces performance compared to the baseline case due to high percentage of aborts caused by the limitations of the current HTM implementation.","PeriodicalId":267524,"journal":{"name":"Proceedings of the 11th International Workshop on Data Management on New Hardware","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114256892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Energy-Efficient In-Memory Data Stores on Hybrid Memory Hierarchies 基于混合内存层次结构的高效内存数据存储
Ahmad Hassan, H. Vandierendonck, Dimitrios S. Nikolopoulos
Increasingly large amounts of data are stored in main memory of data center servers. However, DRAM-based memory is an important consumer of energy and is unlikely to scale in the future. Various byte-addressable non-volatile memory (NVM) technologies promise high density and near-zero static energy, however they suffer from increased latency and increased dynamic energy consumption. This paper proposes to leverage a hybrid memory architecture, consisting of both DRAM and NVM, by novel, application-level data management policies that decide to place data on DRAM vs. NVM. We analyze modern column-oriented and key-value data stores and demonstrate the feasibility of application-level data management. Cycle-accurate simulation confirms that our methodology reduces the energy with least performance degradation as compared to the current state-of-the-art hardware or OS approaches. Moreover, we utilize our techniques to apportion DRAM and NVM memory sizes for these workloads.
越来越多的数据存储在数据中心服务器的主存中。然而,基于dram的存储器是一个重要的能源消耗者,并且在未来不太可能扩展。各种字节可寻址非易失性存储器(NVM)技术承诺高密度和接近于零的静态能量,但是它们受到延迟增加和动态能耗增加的影响。本文提出利用由DRAM和NVM组成的混合内存架构,通过新颖的应用程序级数据管理策略决定将数据放在DRAM和NVM上。我们分析了现代面向列和键值数据存储,并演示了应用级数据管理的可行性。周期精确的模拟证实,与当前最先进的硬件或操作系统方法相比,我们的方法在减少能量的同时,性能下降最少。此外,我们利用我们的技术为这些工作负载分配DRAM和NVM内存大小。
{"title":"Energy-Efficient In-Memory Data Stores on Hybrid Memory Hierarchies","authors":"Ahmad Hassan, H. Vandierendonck, Dimitrios S. Nikolopoulos","doi":"10.1145/2771937.2771940","DOIUrl":"https://doi.org/10.1145/2771937.2771940","url":null,"abstract":"Increasingly large amounts of data are stored in main memory of data center servers. However, DRAM-based memory is an important consumer of energy and is unlikely to scale in the future. Various byte-addressable non-volatile memory (NVM) technologies promise high density and near-zero static energy, however they suffer from increased latency and increased dynamic energy consumption. This paper proposes to leverage a hybrid memory architecture, consisting of both DRAM and NVM, by novel, application-level data management policies that decide to place data on DRAM vs. NVM. We analyze modern column-oriented and key-value data stores and demonstrate the feasibility of application-level data management. Cycle-accurate simulation confirms that our methodology reduces the energy with least performance degradation as compared to the current state-of-the-art hardware or OS approaches. Moreover, we utilize our techniques to apportion DRAM and NVM memory sizes for these workloads.","PeriodicalId":267524,"journal":{"name":"Proceedings of the 11th International Workshop on Data Management on New Hardware","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129656324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
NUMA obliviousness through memory mapping 通过内存映射实现NUMA遗忘
M. Gawade, M. Kersten
With the rise of multi-socket multi-core CPUs a lot of effort is being put into how to best exploit their abundant CPU power. In a shared memory setting the multi-socket CPUs are equipped with their own memory module, and access memory modules across sockets in a non-uniform access pattern (NUMA). Memory access across socket is relatively expensive compared to memory access within a socket. One of the common solutions to minimize across socket memory access is to partition the data, such that the data affinity is maintained per socket. In this paper we explore the role of memory mapped storage to provide transparent data access in a NUMA environment, without the need of explicit data partitioning. We compare the performance of a database engine in a distributed setting in a multi-socket environment, with a database engine in a NUMA oblivious setting. We show that though the operating system tries to keep the data affinity to local sockets, a significant remote memory access still occurs, as the number of threads increase. Hence, setting explicit process and memory affinity results into a robust execution in NUMA oblivious plans. We use micro-experiments and SQL queries from the TPC-H benchmark to provide an in-depth experimental exploration of the landscape, in a four socket Intel machine.
随着多插槽多核CPU的兴起,如何更好地利用其丰富的CPU能力成为人们关注的焦点。在共享内存设置中,多套接字cpu配备了它们自己的内存模块,并以非统一访问模式(NUMA)跨套接字访问内存模块。与套接字内的内存访问相比,跨套接字的内存访问相对昂贵。最小化跨套接字内存访问的常见解决方案之一是对数据进行分区,这样每个套接字维护数据关联。在本文中,我们探讨了内存映射存储在NUMA环境中提供透明数据访问的作用,而不需要显式的数据分区。我们比较了多套接字环境中分布式设置下的数据库引擎与NUMA无关设置下的数据库引擎的性能。我们表明,尽管操作系统试图保持数据与本地套接字的关联,但随着线程数量的增加,仍然会发生大量的远程内存访问。因此,在NUMA无关计划中设置显式的进程和内存关联可以实现健壮的执行。我们使用来自TPC-H基准测试的微实验和SQL查询,在一台四个插槽的Intel机器上对场景进行了深入的实验探索。
{"title":"NUMA obliviousness through memory mapping","authors":"M. Gawade, M. Kersten","doi":"10.1145/2771937.2771948","DOIUrl":"https://doi.org/10.1145/2771937.2771948","url":null,"abstract":"With the rise of multi-socket multi-core CPUs a lot of effort is being put into how to best exploit their abundant CPU power. In a shared memory setting the multi-socket CPUs are equipped with their own memory module, and access memory modules across sockets in a non-uniform access pattern (NUMA). Memory access across socket is relatively expensive compared to memory access within a socket. One of the common solutions to minimize across socket memory access is to partition the data, such that the data affinity is maintained per socket. In this paper we explore the role of memory mapped storage to provide transparent data access in a NUMA environment, without the need of explicit data partitioning. We compare the performance of a database engine in a distributed setting in a multi-socket environment, with a database engine in a NUMA oblivious setting. We show that though the operating system tries to keep the data affinity to local sockets, a significant remote memory access still occurs, as the number of threads increase. Hence, setting explicit process and memory affinity results into a robust execution in NUMA oblivious plans. We use micro-experiments and SQL queries from the TPC-H benchmark to provide an in-depth experimental exploration of the landscape, in a four socket Intel machine.","PeriodicalId":267524,"journal":{"name":"Proceedings of the 11th International Workshop on Data Management on New Hardware","volume":"128 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128144045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
The Serial Safety Net: Efficient Concurrency Control on Modern Hardware 串行安全网:现代硬件上的高效并发控制
Tianzheng Wang, Ryan Johnson, A. Fekete, I. Pandis
Concurrency control (CC) algorithms must trade off strictness for performance, with serializable schemes generally paying high cost---both in runtime overhead such as contention on lock tables, and in wasted efforts by aborting transactions---to prevent anomalies. We propose the serial safety net (SSN), a serializability-enforcing certifier for modern hardware with substantial core count and large main memory. SSN can be applied with minimal overhead on top of various CC schemes that offer higher performance but admit anomalies, e.g., snapshot isolation and read committed. We demonstrate the efficiency, accuracy and robustness of SSN using a memory-optimized OLTP engine with different CC schemes. We find that SSN is a promising approach to serializability with low abort rates and robust performance for various workloads.
并发控制(CC)算法必须在严格性和性能之间进行权衡,可序列化的方案通常要付出高昂的代价——运行时开销(如锁表上的争用)和因中止事务而浪费的精力——以防止异常。我们提出了串行安全网(SSN),这是一种用于具有大量核数和大主存的现代硬件的串行性强制认证器。SSN可以以最小的开销应用于各种CC方案之上,这些方案提供更高的性能,但承认异常,例如快照隔离和已提交读取。我们使用一个具有不同CC方案的内存优化OLTP引擎来证明SSN的效率、准确性和鲁棒性。我们发现SSN是一种很有前途的序列化方法,具有低中断率和各种工作负载的健壮性能。
{"title":"The Serial Safety Net: Efficient Concurrency Control on Modern Hardware","authors":"Tianzheng Wang, Ryan Johnson, A. Fekete, I. Pandis","doi":"10.1145/2771937.2771949","DOIUrl":"https://doi.org/10.1145/2771937.2771949","url":null,"abstract":"Concurrency control (CC) algorithms must trade off strictness for performance, with serializable schemes generally paying high cost---both in runtime overhead such as contention on lock tables, and in wasted efforts by aborting transactions---to prevent anomalies. We propose the serial safety net (SSN), a serializability-enforcing certifier for modern hardware with substantial core count and large main memory. SSN can be applied with minimal overhead on top of various CC schemes that offer higher performance but admit anomalies, e.g., snapshot isolation and read committed. We demonstrate the efficiency, accuracy and robustness of SSN using a memory-optimized OLTP engine with different CC schemes. We find that SSN is a promising approach to serializability with low abort rates and robust performance for various workloads.","PeriodicalId":267524,"journal":{"name":"Proceedings of the 11th International Workshop on Data Management on New Hardware","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121637276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
TLB misses: The Missing Issue of Adaptive Radix Tree? TLB缺失:自适应基树缺失的问题?
Petrie Wong, Ziqiang Feng, Wenjian Xu, Eric Lo, B. Kao
Efficient main-memory index structures are crucial to main-memory database systems. Adaptive Radix Tree (ART) is the most recent in-memory index structure. ART is designed to avoid cache miss, leverage SIMD data parallelism, minimize branch mis-prediction, and have small memory footprint. When an in-memory index structure like ART has significantly few cache misses and branch mis-predictions, it is natural to question whether misses in Translation Lookaside Buffer (TLB) matters. In this paper, we try to confirm whether this is the case and if the answer is positive, what are the measures that we can take to alleviate that and how effective they are.
高效的主存索引结构对主存数据库系统至关重要。自适应基树(ART)是最新的内存索引结构。ART旨在避免缓存丢失,利用SIMD数据并行性,最大限度地减少分支错误预测,并具有较小的内存占用。当像ART这样的内存索引结构的缓存缺失和分支错误预测非常少时,很自然地会质疑翻译Lookaside Buffer (TLB)中的缺失是否重要。在本文中,我们试图确认情况是否如此,如果答案是肯定的,我们可以采取哪些措施来缓解这种情况,以及这些措施的效果如何。
{"title":"TLB misses: The Missing Issue of Adaptive Radix Tree?","authors":"Petrie Wong, Ziqiang Feng, Wenjian Xu, Eric Lo, B. Kao","doi":"10.1145/2771937.2771942","DOIUrl":"https://doi.org/10.1145/2771937.2771942","url":null,"abstract":"Efficient main-memory index structures are crucial to main-memory database systems. Adaptive Radix Tree (ART) is the most recent in-memory index structure. ART is designed to avoid cache miss, leverage SIMD data parallelism, minimize branch mis-prediction, and have small memory footprint. When an in-memory index structure like ART has significantly few cache misses and branch mis-predictions, it is natural to question whether misses in Translation Lookaside Buffer (TLB) matters. In this paper, we try to confirm whether this is the case and if the answer is positive, what are the measures that we can take to alleviate that and how effective they are.","PeriodicalId":267524,"journal":{"name":"Proceedings of the 11th International Workshop on Data Management on New Hardware","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115992080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Energy-Efficient Query Processing on Embedded CPU-GPU Architectures 嵌入式CPU-GPU架构的节能查询处理
Xuntao Cheng, Bingsheng He, C. Lau
Energy efficiency is a major design and optimization factor for query co-processing of databases in embedded devices. Recently, GPUs of new-generation embedded devices have evolved with the programmability and computational capability for general-purpose applications. Such CPU-GPU architectures offer us opportunities to revisit GPU query co-processing in embedded environments for energy efficiency. In this paper, we experimentally evaluate and analyze the performance and energy consumption of a GPU query co-processor on such hybrid embedded architectures. Specifically, we study four major database operators as micro-benchmarks and evaluate TPC-H queries on CARMA, which has a quad-core ARM Cortex-A9 CPU and a NVIDIA Quadro 1000M GPU. We observe that the CPU delivers both better performance and lower energy consumption than the GPU for simple operators such as selection and aggregation. However, the GPU outperforms the CPU for sort and hash join in terms of both performance and energy consumption. We further show that CPU-GPU query co-processing can be an effective means of energy-efficient query co-processing in embedded systems with proper tuning and optimizations.
能效是嵌入式设备中数据库查询协同处理的主要设计和优化因素。近年来,新一代嵌入式设备的gpu已经发展到具有通用应用的可编程性和计算能力。这样的CPU-GPU架构为我们提供了在嵌入式环境中重新审视GPU查询协同处理以提高能效的机会。在本文中,我们对这种混合嵌入式架构下的GPU查询协处理器的性能和能耗进行了实验评估和分析。具体来说,我们研究了四种主要的数据库操作作为微基准,并在CARMA上评估TPC-H查询,CARMA具有四核ARM Cortex-A9 CPU和NVIDIA Quadro 1000M GPU。我们观察到,对于简单的操作,如选择和聚合,CPU提供了比GPU更好的性能和更低的能耗。然而,GPU在排序和哈希连接方面的性能和能耗都优于CPU。通过适当的调优和优化,我们进一步证明了CPU-GPU查询协同处理可以成为嵌入式系统中节能查询协同处理的有效手段。
{"title":"Energy-Efficient Query Processing on Embedded CPU-GPU Architectures","authors":"Xuntao Cheng, Bingsheng He, C. Lau","doi":"10.1145/2771937.2771939","DOIUrl":"https://doi.org/10.1145/2771937.2771939","url":null,"abstract":"Energy efficiency is a major design and optimization factor for query co-processing of databases in embedded devices. Recently, GPUs of new-generation embedded devices have evolved with the programmability and computational capability for general-purpose applications. Such CPU-GPU architectures offer us opportunities to revisit GPU query co-processing in embedded environments for energy efficiency. In this paper, we experimentally evaluate and analyze the performance and energy consumption of a GPU query co-processor on such hybrid embedded architectures. Specifically, we study four major database operators as micro-benchmarks and evaluate TPC-H queries on CARMA, which has a quad-core ARM Cortex-A9 CPU and a NVIDIA Quadro 1000M GPU. We observe that the CPU delivers both better performance and lower energy consumption than the GPU for simple operators such as selection and aggregation. However, the GPU outperforms the CPU for sort and hash join in terms of both performance and energy consumption. We further show that CPU-GPU query co-processing can be an effective means of energy-efficient query co-processing in embedded systems with proper tuning and optimizations.","PeriodicalId":267524,"journal":{"name":"Proceedings of the 11th International Workshop on Data Management on New Hardware","volume":"304 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122467118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Ultra-Fast Similarity Search Using Ternary Content Addressable Memory 使用三元内容可寻址存储器的超快速相似性搜索
A. Bremler-Barr, Yotam Harchol, David Hay, Y. Hel-Or
Similarity search, and specifically the nearest-neighbor search (NN) problem is widely used in many fields of computer science such as machine learning, computer vision and databases. However, in many settings such searches are known to suffer from the notorious curse of dimensionality, where running time grows exponentially with d. This causes severe performance degradation when working in high-dimensional spaces. Approximate techniques such as locality-sensitive hashing [2] improve the performance of the search, but are still computationally intensive. In this paper we propose a new way to solve this problem using a special hardware device called ternary content addressable memory (TCAM). TCAM is an associative memory, which is a special type of computer memory that is widely used in switches and routers for very high speed search applications. We show that the TCAM computational model can be leveraged and adjusted to solve NN search problems in a single TCAM lookup cycle, and with linear space. This concept does not suffer from the curse of dimensionality and is shown to improve the best known approaches for NN by more than four orders of magnitude. Simulation results demonstrate dramatic improvement over the best known approaches for NN, and suggest that TCAM devices may play a critical role in future large-scale databases and cloud applications.
相似搜索,特别是最近邻搜索(NN)问题广泛应用于计算机科学的许多领域,如机器学习、计算机视觉和数据库。然而,在许多情况下,这样的搜索会受到维数的困扰,运行时间会随着d呈指数增长。在高维空间中工作时,这会导致严重的性能下降。类似位置敏感散列[2]的近似技术提高了搜索的性能,但仍然需要大量的计算。本文提出了一种新的方法来解决这个问题,使用一种特殊的硬件设备,称为三元内容可寻址存储器(TCAM)。TCAM是一种联想存储器,它是一种特殊类型的计算机存储器,广泛应用于交换机和路由器中,用于非常高速的搜索应用。我们表明,可以利用和调整TCAM计算模型来解决单个TCAM查找周期和线性空间中的神经网络搜索问题。这个概念不会遭受维度的诅咒,并且被证明可以将最著名的神经网络方法提高四个数量级以上。仿真结果表明,与最知名的神经网络方法相比,TCAM方法有了显著的改进,并表明TCAM设备可能在未来的大规模数据库和云应用中发挥关键作用。
{"title":"Ultra-Fast Similarity Search Using Ternary Content Addressable Memory","authors":"A. Bremler-Barr, Yotam Harchol, David Hay, Y. Hel-Or","doi":"10.1145/2771937.2771938","DOIUrl":"https://doi.org/10.1145/2771937.2771938","url":null,"abstract":"Similarity search, and specifically the nearest-neighbor search (NN) problem is widely used in many fields of computer science such as machine learning, computer vision and databases. However, in many settings such searches are known to suffer from the notorious curse of dimensionality, where running time grows exponentially with d. This causes severe performance degradation when working in high-dimensional spaces. Approximate techniques such as locality-sensitive hashing [2] improve the performance of the search, but are still computationally intensive. In this paper we propose a new way to solve this problem using a special hardware device called ternary content addressable memory (TCAM). TCAM is an associative memory, which is a special type of computer memory that is widely used in switches and routers for very high speed search applications. We show that the TCAM computational model can be leveraged and adjusted to solve NN search problems in a single TCAM lookup cycle, and with linear space. This concept does not suffer from the curse of dimensionality and is shown to improve the best known approaches for NN by more than four orders of magnitude. Simulation results demonstrate dramatic improvement over the best known approaches for NN, and suggest that TCAM devices may play a critical role in future large-scale databases and cloud applications.","PeriodicalId":267524,"journal":{"name":"Proceedings of the 11th International Workshop on Data Management on New Hardware","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132331607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Scaling the Memory Power Wall With DRAM-Aware Data Management 通过内存感知数据管理扩展内存电源墙
Raja Appuswamy, Matthaios Olma, A. Ailamaki
Improving the energy efficiency of database systems has emerged as an important topic of research over the past few years. While significant attention has been paid to optimizing the power consumption of tradition disk-based databases, little attention has been paid to the growing cost of DRAM power consumption in main-memory databases (MMDB). In this paper, we bridge this divide by examining power--performance tradeoffs involved in designing MMDBs. In doing so, we first show how DRAM will soon emerge as the dominating source of power consumption in emerging MMDB servers unlike traditional database servers, where CPU power consumption overshadows that of DRAM. Second, we show that using DRAM frequency scaling and power-down modes can provide substantial improvement in performance/Watt under both transactional and analytical workloads. This, again contradicts rules of thumb established for traditional servers, where the most energy-efficient configuration is often the one with highest performance. Based on our observations, we argue that the long-overlooked task of optimizing DRAM power consumption should henceforth be considered a first-class citizen in designing MMDBs. In doing so, we highlight several promising research directions and identify key design challenges that must be overcome towards achieving this goal.
在过去的几年里,提高数据库系统的能源效率已经成为一个重要的研究课题。虽然人们对优化传统基于磁盘的数据库的功耗给予了极大的关注,但对主存数据库(MMDB)中不断增长的DRAM功耗成本却很少关注。在本文中,我们通过研究设计mmdb所涉及的功率-性能权衡来弥合这一鸿沟。在此过程中,我们首先展示了DRAM将如何很快成为新兴MMDB服务器的主要功耗来源,而传统数据库服务器的CPU功耗超过了DRAM。其次,我们表明使用DRAM频率缩放和断电模式可以在事务性和分析性工作负载下大幅提高性能/瓦特。这再次与传统服务器的经验法则相矛盾,在传统服务器中,最节能的配置通常是具有最高性能的配置。根据我们的观察,我们认为优化DRAM功耗这一长期被忽视的任务应该被视为mmdb设计中的头等大事。在此过程中,我们强调了几个有前途的研究方向,并确定了实现这一目标必须克服的关键设计挑战。
{"title":"Scaling the Memory Power Wall With DRAM-Aware Data Management","authors":"Raja Appuswamy, Matthaios Olma, A. Ailamaki","doi":"10.1145/2771937.2771947","DOIUrl":"https://doi.org/10.1145/2771937.2771947","url":null,"abstract":"Improving the energy efficiency of database systems has emerged as an important topic of research over the past few years. While significant attention has been paid to optimizing the power consumption of tradition disk-based databases, little attention has been paid to the growing cost of DRAM power consumption in main-memory databases (MMDB). In this paper, we bridge this divide by examining power--performance tradeoffs involved in designing MMDBs. In doing so, we first show how DRAM will soon emerge as the dominating source of power consumption in emerging MMDB servers unlike traditional database servers, where CPU power consumption overshadows that of DRAM. Second, we show that using DRAM frequency scaling and power-down modes can provide substantial improvement in performance/Watt under both transactional and analytical workloads. This, again contradicts rules of thumb established for traditional servers, where the most energy-efficient configuration is often the one with highest performance. Based on our observations, we argue that the long-overlooked task of optimizing DRAM power consumption should henceforth be considered a first-class citizen in designing MMDBs. In doing so, we highlight several promising research directions and identify key design challenges that must be overcome towards achieving this goal.","PeriodicalId":267524,"journal":{"name":"Proceedings of the 11th International Workshop on Data Management on New Hardware","volume":"101 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122207397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
Beyond the Wall: Near-Data Processing for Databases 墙外:数据库的近数据处理
S. Xi, Oreoluwatomiwa O. Babarinsa, Manos Athanassoulis, Stratos Idreos
The continuous growth of main memory size allows modern data systems to process entire large scale datasets in memory. The increase in memory capacity, however, is not matched by proportional decrease in memory latency, causing a mismatch for in-memory processing. As a result, data movement through the memory hierarchy is now one of the main performance bottlenecks for main memory data systems. Database systems researchers have proposed several innovative solutions to minimize data movement and to make data access patterns hardware-aware. Nevertheless, all relevant rows and columns for a given query have to be moved through the memory hierarchy; hence, movement of large data sets is on the critical path. In this paper, we present JAFAR, a Near-Data Processing (NDP) accelerator for pushing selects down to memory in modern column-stores. JAFAR implements the select operator and allows only qualifying data to travel up the memory hierarchy. Through a detailed simulation of JAFAR hardware we show that it has the potential to provide 9x improvement for selects in column-stores. In addition, we discuss both hardware and software challenges for using NDP in database systems as well as opportunities for further NDP accelerators to boost additional relational operators.
主存储器大小的持续增长使现代数据系统能够在内存中处理整个大规模数据集。然而,内存容量的增加并没有与内存延迟的相应减少相匹配,从而导致内存中处理的不匹配。因此,通过内存层次结构的数据移动现在是主内存数据系统的主要性能瓶颈之一。数据库系统研究人员提出了几个创新的解决方案,以尽量减少数据移动并使数据访问模式对硬件敏感。然而,给定查询的所有相关行和列都必须在内存层次结构中移动;因此,大型数据集的移动处于关键路径上。在本文中,我们提出了JAFAR,一个近数据处理(NDP)加速器,用于在现代列存储中将选择推入内存。JAFAR实现了select操作符,只允许符合条件的数据在内存层次结构中向上传递。通过对JAFAR硬件的详细模拟,我们表明它有潜力为列存储中的选择提供9倍的改进。此外,我们还讨论了在数据库系统中使用NDP所面临的硬件和软件挑战,以及进一步使用NDP加速器来提升其他关系操作符的机会。
{"title":"Beyond the Wall: Near-Data Processing for Databases","authors":"S. Xi, Oreoluwatomiwa O. Babarinsa, Manos Athanassoulis, Stratos Idreos","doi":"10.1145/2771937.2771945","DOIUrl":"https://doi.org/10.1145/2771937.2771945","url":null,"abstract":"The continuous growth of main memory size allows modern data systems to process entire large scale datasets in memory. The increase in memory capacity, however, is not matched by proportional decrease in memory latency, causing a mismatch for in-memory processing. As a result, data movement through the memory hierarchy is now one of the main performance bottlenecks for main memory data systems. Database systems researchers have proposed several innovative solutions to minimize data movement and to make data access patterns hardware-aware. Nevertheless, all relevant rows and columns for a given query have to be moved through the memory hierarchy; hence, movement of large data sets is on the critical path. In this paper, we present JAFAR, a Near-Data Processing (NDP) accelerator for pushing selects down to memory in modern column-stores. JAFAR implements the select operator and allows only qualifying data to travel up the memory hierarchy. Through a detailed simulation of JAFAR hardware we show that it has the potential to provide 9x improvement for selects in column-stores. In addition, we discuss both hardware and software challenges for using NDP in database systems as well as opportunities for further NDP accelerators to boost additional relational operators.","PeriodicalId":267524,"journal":{"name":"Proceedings of the 11th International Workshop on Data Management on New Hardware","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123752267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 85
期刊
Proceedings of the 11th International Workshop on Data Management on New Hardware
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1