Conference on Innovative Data Systems Research最新文献

英文中文

Lessons Learned from Managing a Petabyte 管理pb的经验教训

Conference on Innovative Data Systems Research

Pub Date : 2005-01-20 DOI: 10.2172/839755

J. Becla, Daniel L. Wang

The amount of data collected and stored by the average business doubles each year. Many commercial databases are already approaching hundreds of terabytes, and at this rate, will soon be managing petabytes. More data enables new functionality and capability, but the larger scale reveals new problems and issues hidden in ''smaller'' terascale environments. This paper presents some of these new problems along with implemented solutions in the framework of a petabyte dataset for a large High Energy Physics experiment. Through experience with two persistence technologies, a commercial database and a file-based approach, we expose format-independent concepts and issues prevalent at this new scale of computing.

一般企业收集和存储的数据量每年都会翻一番。许多商业数据库已经接近数百太字节，按照这个速度，很快就会达到pb级。更多的数据可以实现新的功能和能力，但更大的规模揭示了隐藏在“更小”的万亿级环境中的新问题和问题。本文介绍了其中的一些新问题以及在大型高能物理实验的pb数据集框架下实现的解决方案。通过使用两种持久性技术(商业数据库和基于文件的方法)的经验，我们揭示了在这种新的计算规模中流行的与格式无关的概念和问题。

引用次数: 50

A Case for Staged Database Systems 分阶段数据库系统的案例

Conference on Innovative Data Systems Research

Pub Date : 1900-01-01 DOI: 10.1007/978-0-387-39940-9_3675

S. Harizopoulos, A. Ailamaki

引用次数: 75

Cache-Oblivious Query Processing 无关缓存的查询处理

Conference on Innovative Data Systems Research

Pub Date : 1900-01-01 DOI: 10.14711/thesis-b1029228

Bingsheng He, Qiong Luo

As CPU caches have become a performance bottleneck for main memory databases, optimizing the cache performance is essential for high-performance query processing on relational databases. Cache-oblivious techniques, proposed by the theory community, have optimal asymptotic bounds on the amount of data transferred between any two adjacent levels of an arbitrary memory hierarchy. Moreover, this optimal performance is achieved without any hardware platform specific tuning. These properties are highly attractive to autonomous databases, especially because the hardware architectures are becoming increasingly complex and diverse. In this thesis, we present the design, implementation, and evaluation of the first cache-oblivious, in-memory query processor, EaseDB. All query processing algorithms in EaseDB are designed to be cache-oblivious and match the performance of their cache-conscious counterparts. Moreover, we discuss the inherent limitations of the cache-oblivious approach as well as the opportunities given by the upcoming hardware architectures. Specifically, a cache-oblivious technique usually requires sophisticated algorithm design to achieve a comparable performance to its cache-conscious counterpart. Nevertheless, this development-time effort is compensated by the automaticity of performance achievement and the reduced ownership cost. We evaluate EaseDB in comparison with its cache-conscious counterparts on different architectures including Intel, AMD and Ultra-Sparc processors. Our results, with homegrown workloads and micro benchmarks, show that our cache-oblivious algorithms achieve a performance comparable to their fine-tuned cache-conscious counterparts. Moreover, cache-oblivious techniques can outperform their cache-conscious counterparts in multi-threading processors.

由于CPU缓存已经成为主存数据库的性能瓶颈，优化缓存性能对于关系数据库的高性能查询处理至关重要。由理论界提出的缓存无关技术，在任意存储器层次结构的任意两个相邻级别之间传输的数据量具有最优渐近边界。此外，这种最佳性能是在没有任何硬件平台特定调优的情况下实现的。这些属性对自治数据库非常有吸引力，特别是因为硬件架构正变得越来越复杂和多样化。在本文中，我们介绍了第一个缓存无关的内存查询处理器EaseDB的设计、实现和评估。EaseDB中的所有查询处理算法都被设计为缓存无关的，并且与缓存敏感的对应算法的性能相匹配。此外，我们还讨论了缓存无关方法的固有局限性以及即将到来的硬件体系结构所提供的机会。具体来说，缓存无关技术通常需要复杂的算法设计来实现与缓存相关技术相当的性能。然而，这种开发时间的努力得到了性能实现的自动化和减少的所有权成本的补偿。我们将EaseDB与不同架构(包括英特尔、AMD和Ultra-Sparc处理器)上具有缓存意识的同类产品进行了比较。我们使用本地工作负载和微基准测试的结果表明，我们的缓存无关算法实现了与经过微调的缓存敏感算法相当的性能。此外，在多线程处理器中，缓参无关技术的性能要优于缓参相关技术。

{"title":"Cache-Oblivious Query Processing","authors":"Bingsheng He, Qiong Luo","doi":"10.14711/thesis-b1029228","DOIUrl":"https://doi.org/10.14711/thesis-b1029228","url":null,"abstract":"As CPU caches have become a performance bottleneck for main memory databases, optimizing the cache performance is essential for high-performance query processing on relational databases. Cache-oblivious techniques, proposed by the theory community, have optimal asymptotic bounds on the amount of data transferred between any two adjacent levels of an arbitrary memory hierarchy. Moreover, this optimal performance is achieved without any hardware platform specific tuning. These properties are highly attractive to autonomous databases, especially because the hardware architectures are becoming increasingly complex and diverse. \u0000In this thesis, we present the design, implementation, and evaluation of the first cache-oblivious, in-memory query processor, EaseDB. All query processing algorithms in EaseDB are designed to be cache-oblivious and match the performance of their cache-conscious counterparts. Moreover, we discuss the inherent limitations of the cache-oblivious approach as well as the opportunities given by the upcoming hardware architectures. Specifically, a cache-oblivious technique usually requires sophisticated algorithm design to achieve a comparable performance to its cache-conscious counterpart. Nevertheless, this development-time effort is compensated by the automaticity of performance achievement and the reduced ownership cost. We evaluate EaseDB in comparison with its cache-conscious counterparts on different architectures including Intel, AMD and Ultra-Sparc processors. Our results, with homegrown workloads and micro benchmarks, show that our cache-oblivious algorithms achieve a performance comparable to their fine-tuned cache-conscious counterparts. Moreover, cache-oblivious techniques can outperform their cache-conscious counterparts in multi-threading processors.","PeriodicalId":118073,"journal":{"name":"Conference on Innovative Data Systems Research","volume":"186 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132035911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

(Almost) Hands-Off Information Integration for the Life Sciences (几乎)不干涉的生命科学信息集成

Conference on Innovative Data Systems Research

Pub Date : 1900-01-01 DOI: 10.18452/9201

U. Leser, Felix Naumann

Data integration in complex domains, such as the life sciences, involves either manual data curation, offering highest information quality at highest price, or follows a schema integration and mapping approach, leading to moderate information quality at a moderate price. We suggest a radically different integration approach, called ALADIN, for the life sciences application domain. The predominant feature of the ALADIN system is an architecture that allows almost automatic integration of new data sources into the system, i.e., it offers data integration at almost no cost. We suggest a novel combination of data and text mining, schema matching, and duplicate detection to combat the reduction in information quality that seems inevitable when demanding a high degree of automatism. These heuristics can also lead to the detection of previously unknown or unseen relationships between objects, thus directly supporting the discovery-based work of life science researchers. We argue that such a system is a valuable contribution in two areas. First, it offers challenging and new problems for database research. Second, the ALADIN system would be a valuable knowledge resource for life science research.

复杂领域(如生命科学)中的数据集成要么涉及手动数据管理，以最高的价格提供最高的信息质量，要么遵循模式集成和映射方法，以中等的价格获得中等的信息质量。我们建议一种完全不同的集成方法，称为ALADIN，用于生命科学应用领域。ALADIN系统的主要特点是允许几乎自动地将新数据源集成到系统中的架构，也就是说，它几乎不需要任何成本就可以提供数据集成。我们建议将数据和文本挖掘、模式匹配和重复检测结合起来，以对抗在要求高度自动化时似乎不可避免的信息质量下降。这些启发式方法还可以检测到物体之间以前未知或看不见的关系，从而直接支持生命科学研究人员基于发现的工作。我们认为，这样一个制度在两个方面作出了宝贵的贡献。首先，它为数据库研究提出了具有挑战性的新问题。其次，ALADIN系统将为生命科学研究提供宝贵的知识资源。

{"title":"(Almost) Hands-Off Information Integration for the Life Sciences","authors":"U. Leser, Felix Naumann","doi":"10.18452/9201","DOIUrl":"https://doi.org/10.18452/9201","url":null,"abstract":"Data integration in complex domains, such as the life sciences, involves either manual data curation, offering highest information quality at highest price, or follows a schema integration and mapping approach, leading to moderate information quality at a moderate price. We suggest a radically different integration approach, called ALADIN, for the life sciences application domain. The predominant feature of the ALADIN system is an architecture that allows almost automatic integration of new data sources into the system, i.e., it offers data integration at almost no cost. We suggest a novel combination of data and text mining, schema matching, and duplicate detection to combat the reduction in information quality that seems inevitable when demanding a high degree of automatism. These heuristics can also lead to the detection of previously unknown or unseen relationships between objects, thus directly supporting the discovery-based work of life science researchers. We argue that such a system is a valuable contribution in two areas. First, it offers challenging and new problems for database research. Second, the ALADIN system would be a valuable knowledge resource for life science research.","PeriodicalId":118073,"journal":{"name":"Conference on Innovative Data Systems Research","volume":"135 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117344410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 33

DPI: The Data Processing Interface for Modern Networks DPI:现代网络的数据处理接口

Conference on Innovative Data Systems Research

Pub Date : 1900-01-01 DOI: 10.18420/btw2019-ws-02

G. Alonso, Carsten Binnig, I. Pandis, K. Salem, Jan Skrzypczak, Ryan Stutsman, Lasse Thostrup, Tianzheng Wang, Zeke Wang, Tobias Ziegler

As data processing evolves towards large scale, distributed platforms, the network will necessarily play a substantial role in achieving efficiency and performance. Increasingly, switches, network cards, and protocols are becoming more flexible while programmability at all levels (aka, software defined networks) opens up many possibilities to tailor the network to data processing applications and to push processing down to the network elements. In this paper, we propose DPI, an interface providing a set of simple yet powerful abstractions flexible enough to exploit features of modern networks (e.g., RDMA or in-network processing) suitable for data processing. Mirroring the concept behind the Message Passing Interface (MPI) used extensively in high-performance computing, DPI is an interface definition rather than an implementation so as to be able to bridge different networking technologies and to evolve with them. In the paper we motivate and discuss key primitives of the interface and present a number of use cases that show the potential of DPI for data-intensive applications, such as analytic engines and distributed database systems.

随着数据处理向大规模、分布式平台的发展，网络将在实现效率和性能方面发挥重要作用。交换机、网卡和协议变得越来越灵活，而所有级别的可编程性(即软件定义的网络)开辟了许多可能性，可以根据数据处理应用程序定制网络，并将处理推至网络元素。在本文中，我们提出了DPI，一个提供一组简单而强大的抽象的接口，它足够灵活，可以利用适合数据处理的现代网络(例如，RDMA或网络内处理)的特征。DPI反映了在高性能计算中广泛使用的消息传递接口(MPI)背后的概念，它是一个接口定义，而不是一个实现，以便能够连接不同的网络技术并与它们一起发展。在本文中，我们激发并讨论了接口的关键原语，并提出了一些用例，这些用例显示了DPI在数据密集型应用程序(如分析引擎和分布式数据库系统)中的潜力。

{"title":"DPI: The Data Processing Interface for Modern Networks","authors":"G. Alonso, Carsten Binnig, I. Pandis, K. Salem, Jan Skrzypczak, Ryan Stutsman, Lasse Thostrup, Tianzheng Wang, Zeke Wang, Tobias Ziegler","doi":"10.18420/btw2019-ws-02","DOIUrl":"https://doi.org/10.18420/btw2019-ws-02","url":null,"abstract":"As data processing evolves towards large scale, distributed platforms, the network will necessarily play a substantial role in achieving efficiency and performance. Increasingly, switches, network cards, and protocols are becoming more flexible while programmability at all levels (aka, software defined networks) opens up many possibilities to tailor the network to data processing applications and to push processing down to the network elements. \u0000 \u0000In this paper, we propose DPI, an interface providing a set of simple yet powerful abstractions flexible enough to exploit features of modern networks (e.g., RDMA or in-network processing) suitable for data processing. Mirroring the concept behind the Message Passing Interface (MPI) used extensively in high-performance computing, DPI is an interface definition rather than an implementation so as to be able to bridge different networking technologies and to evolve with them. In the paper we motivate and discuss key primitives of the interface and present a number of use cases that show the potential of DPI for data-intensive applications, such as analytic engines and distributed database systems.","PeriodicalId":118073,"journal":{"name":"Conference on Innovative Data Systems Research","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127836128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 31

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Conference on Innovative Data Systems Research

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀