An approach for automatic data virtualization

Proceedings. 13th IEEE International Symposium on High performance Distributed Computing, 2004. Pub Date : 2004-06-04 DOI:10.1109/HPDC.2004.2

L. Weng, G. Agrawal, Ümit V. Çatalyürek, T. Kurç, S. Narayanan, J. Saltz

{"title":"An approach for automatic data virtualization","authors":"L. Weng, G. Agrawal, Ümit V. Çatalyürek, T. Kurç, S. Narayanan, J. Saltz","doi":"10.1109/HPDC.2004.2","DOIUrl":null,"url":null,"abstract":"Analysis of large and/or geographically distributed scientific datasets is emerging as a key component of grid computing. One challenge in this area is that scientific datasets are typically stored as binary or character flat-files, which makes specification of processing much harder. In view of this, there has been recent interest in data virtualization, and data services to support such virtualization. This paper presents an approach for automatically creating data services to support data virtualization. Specifically, we show how a relational table like data abstraction can be supported for complex multidimensional scientific datasets that are resident on a cluster. We have designed and implemented a tool that processes SQL queries (with select and where statements) on multi-dimensional datasets. We have designed a meta-data description language that is used for specifying the data layout. From such description, our tool automatically generates efficient data subsetting and access functions. We have extensively evaluated our system. The key observations from our experiments are as follows. First, our tool can correctly and efficiently handle a variety of different data layouts. Second, our system scales well as the number of nodes or the amount of data is scaled. Third, the performance of the automatically generated code for indexing and contracting functions is quite comparable to the performance of hand-written codes.","PeriodicalId":446429,"journal":{"name":"Proceedings. 13th IEEE International Symposium on High performance Distributed Computing, 2004.","volume":"78 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2004-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"44","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. 13th IEEE International Symposium on High performance Distributed Computing, 2004.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPDC.2004.2","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 44

Abstract

Analysis of large and/or geographically distributed scientific datasets is emerging as a key component of grid computing. One challenge in this area is that scientific datasets are typically stored as binary or character flat-files, which makes specification of processing much harder. In view of this, there has been recent interest in data virtualization, and data services to support such virtualization. This paper presents an approach for automatically creating data services to support data virtualization. Specifically, we show how a relational table like data abstraction can be supported for complex multidimensional scientific datasets that are resident on a cluster. We have designed and implemented a tool that processes SQL queries (with select and where statements) on multi-dimensional datasets. We have designed a meta-data description language that is used for specifying the data layout. From such description, our tool automatically generates efficient data subsetting and access functions. We have extensively evaluated our system. The key observations from our experiments are as follows. First, our tool can correctly and efficiently handle a variety of different data layouts. Second, our system scales well as the number of nodes or the amount of data is scaled. Third, the performance of the automatically generated code for indexing and contracting functions is quite comparable to the performance of hand-written codes.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

一种自动数据虚拟化的方法

分析大型和/或地理上分布的科学数据集正在成为网格计算的关键组成部分。这一领域的一个挑战是，科学数据集通常以二进制或字符平面文件的形式存储，这使得规范处理变得更加困难。鉴于此，最近出现了对数据虚拟化和支持这种虚拟化的数据服务的兴趣。本文提出了一种自动创建数据服务以支持数据虚拟化的方法。具体来说，我们将展示如何为驻留在集群上的复杂多维科学数据集支持像数据抽象这样的关系表。我们设计并实现了一个工具来处理多维数据集上的SQL查询(使用select和where语句)。我们设计了一种元数据描述语言，用于指定数据布局。根据这样的描述，我们的工具自动生成高效的数据子集和访问函数。我们对我们的系统进行了广泛的评估。我们实验的主要观察结果如下。首先，我们的工具可以正确有效地处理各种不同的数据布局。其次，我们的系统可以很好地扩展节点数量或数据量。第三，用于索引和收缩函数的自动生成代码的性能与手写代码的性能相当。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings. 13th IEEE International Symposium on High performance Distributed Computing, 2004.

自引率

0.00%

发文量

期刊最新文献

Measuring and understanding user comfort with resource borrowing Globus and PlanetLab resource management solutions compared FPN: a distributed hash table for commercial applications GAIS: grid advanced information service based on P2P mechanism Utilization of a local grid of Mac OS X-based computers using Xgrid