Predicting an Optimal Virtual Data Model for Uniform Access to Large Heterogeneous Data

IF 1.3 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Data Intelligence Pub Date : 2023-06-13 DOI:10.1162/dint_a_00216

Chahrazed B. Bachir Belmehdi, A. Khiat, Nabil Keskes

{"title":"Predicting an Optimal Virtual Data Model for Uniform Access to Large Heterogeneous Data","authors":"Chahrazed B. Bachir Belmehdi, A. Khiat, Nabil Keskes","doi":"10.1162/dint_a_00216","DOIUrl":null,"url":null,"abstract":"\n The growth of generated data in the industry requires new efficient big data integration approaches for uniform data access by end-users to perform better business operations. Data virtualization systems, including Ontology-Based Data Access (ODBA) query data on-the-fly against the original data sources without any prior data materialization. Existing approaches by design use a fixed model e.g., TABULAR as the only Virtual Data Model - a uniform schema built on-the-fly to load, transform, and join relevant data. While other data models, such as GRAPH or DOCUMENT, are more flexible and, thus, can be more suitable for some common types of queries, such as join or nested queries. Those queries are hard to predict because they depend on many criteria, such as query plan, data model, data size, and operations. To address the problem of selecting the optimal virtual data model for queries on large datasets, we present a new approach that (1) builds on the principal of OBDA to query and join large heterogeneous data in a distributed manner and (2) calls a deep learning method to predict the optimal virtual data model using features extracted from SPARQL queries. OPTIMA - implementation of our approach currently leverages state-of-the-art Big Data technologies, Apache-Spark and Graphx, and implements two virtual data models, GRAPH and TABULAR, and supports out-of-the-box five data s ources m odels: property graph, document-based, e.g., wide-columnar, relational, and tabular, stored in Neo4j, MongoDB, Cassandra, MySQL, and CSV respectively. Extensive experiments show that our approach is returning the optimal virtual model with an accuracy of 0.831, thus, a reduction in query execution time of over 40% for the tabular model selection and over 30% for the graph model selection.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":" ","pages":""},"PeriodicalIF":1.3000,"publicationDate":"2023-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data Intelligence","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1162/dint_a_00216","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

The growth of generated data in the industry requires new efficient big data integration approaches for uniform data access by end-users to perform better business operations. Data virtualization systems, including Ontology-Based Data Access (ODBA) query data on-the-fly against the original data sources without any prior data materialization. Existing approaches by design use a fixed model e.g., TABULAR as the only Virtual Data Model - a uniform schema built on-the-fly to load, transform, and join relevant data. While other data models, such as GRAPH or DOCUMENT, are more flexible and, thus, can be more suitable for some common types of queries, such as join or nested queries. Those queries are hard to predict because they depend on many criteria, such as query plan, data model, data size, and operations. To address the problem of selecting the optimal virtual data model for queries on large datasets, we present a new approach that (1) builds on the principal of OBDA to query and join large heterogeneous data in a distributed manner and (2) calls a deep learning method to predict the optimal virtual data model using features extracted from SPARQL queries. OPTIMA - implementation of our approach currently leverages state-of-the-art Big Data technologies, Apache-Spark and Graphx, and implements two virtual data models, GRAPH and TABULAR, and supports out-of-the-box five data s ources m odels: property graph, document-based, e.g., wide-columnar, relational, and tabular, stored in Neo4j, MongoDB, Cassandra, MySQL, and CSV respectively. Extensive experiments show that our approach is returning the optimal virtual model with an accuracy of 0.831, thus, a reduction in query execution time of over 40% for the tabular model selection and over 30% for the graph model selection.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

统一访问大型异构数据的最优虚拟数据模型预测

行业中生成数据的增长需要新的高效大数据集成方法，以便最终用户统一访问数据，以执行更好的业务运营。数据虚拟化系统，包括基于本体的数据访问（ODBA），在没有任何先前数据物化的情况下，根据原始数据源动态查询数据。现有的设计方法使用固定模型，例如TABULAR作为唯一的虚拟数据模型，这是一种动态构建的统一模式，用于加载、转换和连接相关数据。而其他数据模型，如GRAPH或DOCUMENT，则更灵活，因此更适合于一些常见类型的查询，如联接或嵌套查询。这些查询很难预测，因为它们依赖于许多条件，如查询计划、数据模型、数据大小和操作。为了解决在大型数据集上选择最佳虚拟数据模型进行查询的问题，我们提出了一种新方法，该方法（1）建立在OBDA的基础上，以分布式方式查询和连接大型异构数据，（2）调用深度学习方法，使用从SPARQL查询中提取的特征来预测最佳虚拟数据模式。OPTIMA-我们方法的实现目前利用了最先进的大数据技术，Apache Spark和Graphx，并实现了两个虚拟数据模型，GRAPH和TABULAR，并支持开箱即用的五种数据源模型：属性图、基于文档的（例如，宽列、关系和表格），分别存储在Neo4j、MongoDB、Cassandra、MySQL和CSV中。大量实验表明，我们的方法以0.831的精度返回了最佳虚拟模型，因此，对于表格模型选择，查询执行时间减少了40%以上，对于图形模型选择，则查询执行时间缩短了30%以上。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊