首页 > 最新文献

Advances in database technology : proceedings. International Conference on Extending Database Technology最新文献

英文 中文
Spatial Structure-Aware Road Network Embedding via Graph Contrastive Learning 基于图对比学习的空间结构感知道路网络嵌入
Yanchuan Chang, E. Tanin, Xin Cao, Jianzhong Qi
Road networks are widely used as a fundamental structure in urban transportation studies. In recent years, with more research leveraging deep learning to solve conventional transportation problems, how to obtain robust road network representations (i.e., embeddings) applicable for a wide range of applications became a fundamental need. Existing studies mainly adopt graph embedding methods. Such methods, however, foremost learn the topological correlations of road networks but ignore the spatial structure (i.e., spatial correlations) which are also important in applications such as querying similar trajectories. Besides, most studies learn task-specic embeddings in a supervised manner such that the embeddings are sub-optimal when being used for new tasks. It is inecient to store or learn dedicated embeddings for every dierent task in a large transportation system. To tackle these issues, we propose a model named SARN to learn generic and task-agnostic road network embeddings based on self-supervised contrastive learning. We present (i) a spatial similarity matrix to help learn the spatial correlations of the roads, (ii) a sampling strategy based on the spatial structure of a road network to form self-supervised training samples, and (iii) a two-level loss function to guide SARN to learn embeddings based on both local and global contrasts of similar and dissimilar road segments. Experimental results on three downstream tasks over real-world road networks show that SARN outperforms state-of-the-art self-supervised models consistently and achieves comparable (or even better) performance to supervised models.
道路网作为城市交通研究的基本结构被广泛应用。近年来,随着越来越多的研究利用深度学习来解决传统的交通问题,如何获得适用于广泛应用的鲁棒道路网络表示(即嵌入)成为一个基本需求。现有研究主要采用图嵌入方法。然而,这些方法首先学习了道路网络的拓扑相关性,但忽略了空间结构(即空间相关性),这在查询类似轨迹等应用中也很重要。此外,大多数研究以监督的方式学习特定任务的嵌入,使得嵌入在用于新任务时是次优的。在大型运输系统中,为每个不同的任务存储或学习专用的嵌入是不方便的。为了解决这些问题,我们提出了一个基于自监督对比学习的SARN模型来学习通用的和任务不可知的道路网络嵌入。我们提出了(i)一个空间相似性矩阵来帮助学习道路的空间相关性,(ii)一个基于道路网络空间结构的采样策略来形成自监督训练样本,以及(iii)一个两级损失函数来指导SARN学习基于相似和不相似道路段的局部和全局对比的嵌入。在现实世界道路网络的三个下游任务上的实验结果表明,SARN始终优于最先进的自监督模型,并达到与监督模型相当(甚至更好)的性能。
{"title":"Spatial Structure-Aware Road Network Embedding via Graph Contrastive Learning","authors":"Yanchuan Chang, E. Tanin, Xin Cao, Jianzhong Qi","doi":"10.48786/edbt.2023.12","DOIUrl":"https://doi.org/10.48786/edbt.2023.12","url":null,"abstract":"Road networks are widely used as a fundamental structure in urban transportation studies. In recent years, with more research leveraging deep learning to solve conventional transportation problems, how to obtain robust road network representations (i.e., embeddings) applicable for a wide range of applications became a fundamental need. Existing studies mainly adopt graph embedding methods. Such methods, however, foremost learn the topological correlations of road networks but ignore the spatial structure (i.e., spatial correlations) which are also important in applications such as querying similar trajectories. Besides, most studies learn task-specic embeddings in a supervised manner such that the embeddings are sub-optimal when being used for new tasks. It is inecient to store or learn dedicated embeddings for every dierent task in a large transportation system. To tackle these issues, we propose a model named SARN to learn generic and task-agnostic road network embeddings based on self-supervised contrastive learning. We present (i) a spatial similarity matrix to help learn the spatial correlations of the roads, (ii) a sampling strategy based on the spatial structure of a road network to form self-supervised training samples, and (iii) a two-level loss function to guide SARN to learn embeddings based on both local and global contrasts of similar and dissimilar road segments. Experimental results on three downstream tasks over real-world road networks show that SARN outperforms state-of-the-art self-supervised models consistently and achieves comparable (or even better) performance to supervised models.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"46 1","pages":"144-156"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84432765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Workload-Aware Query Recommendation Using Deep Learning 基于深度学习的工作负载感知查询推荐
E. Y. Lai, Zainab Zolaktaf, Mostafa Milani, Omar AlOmeir, Jianhao Cao, R. Pottinger
Users interact with databases by writing sequences of SQL queries that are are often stored in query workloads. Current SQL query recommendation approaches make little use of query workloads. Our work presents a novel workload-aware approach to query recommendation. We use deep learning prediction models trained on query pairs extracted from large-scale query workloads to build our approach. Our algorithms suggest contextual (query fragments) and structural (query templates) information to aid users in formulating their next query. We evaluate our algorithms on two real-world datasets: the Sloan Digital Sky Survey (SDSS) and SQLShare. We perform a thorough analysis of the workloads and then empirically show that our workload-aware, deep-learning approach vastly outperforms known collaborative filtering approaches.
用户通过编写通常存储在查询工作负载中的SQL查询序列与数据库进行交互。当前的SQL查询推荐方法很少使用查询工作负载。我们的工作提出了一种新颖的工作负载感知查询推荐方法。我们使用从大规模查询工作负载中提取的查询对训练的深度学习预测模型来构建我们的方法。我们的算法建议上下文(查询片段)和结构(查询模板)信息,以帮助用户制定下一个查询。我们在两个真实世界的数据集上评估了我们的算法:斯隆数字巡天(SDSS)和SQLShare。我们对工作负载进行了彻底的分析,然后通过经验表明,我们的工作负载感知、深度学习方法大大优于已知的协同过滤方法。
{"title":"Workload-Aware Query Recommendation Using Deep Learning","authors":"E. Y. Lai, Zainab Zolaktaf, Mostafa Milani, Omar AlOmeir, Jianhao Cao, R. Pottinger","doi":"10.48786/edbt.2023.05","DOIUrl":"https://doi.org/10.48786/edbt.2023.05","url":null,"abstract":"Users interact with databases by writing sequences of SQL queries that are are often stored in query workloads. Current SQL query recommendation approaches make little use of query workloads. Our work presents a novel workload-aware approach to query recommendation. We use deep learning prediction models trained on query pairs extracted from large-scale query workloads to build our approach. Our algorithms suggest contextual (query fragments) and structural (query templates) information to aid users in formulating their next query. We evaluate our algorithms on two real-world datasets: the Sloan Digital Sky Survey (SDSS) and SQLShare. We perform a thorough analysis of the workloads and then empirically show that our workload-aware, deep-learning approach vastly outperforms known collaborative filtering approaches.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"23 1","pages":"53-65"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84189099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Density-Based Geometry Compression for LiDAR Point Clouds 基于密度的激光雷达点云几何压缩
Xibo Sun, Qiong Luo
LiDAR (Light Detection and Ranging) sensors produce 3D point clouds that capture the surroundings, and these data are used in applications such as autonomous driving, tra � c monitoring, and remote surveys. LiDAR point clouds are usually compressed for e � cient transmission and storage. However, to achieve a high compression ratio, existing work often sacri � ces the geometric accuracy of the data, which hurts the e � ectiveness of downstream applications. Therefore, we propose a system that achieves a high compression ratio while preserving geometric accuracy. In our method, we � rst perform density-based clustering to distinguish the dense points from the sparse ones, because they are suitable for di � erent compression methods. The clustering algorithm is optimized for our purpose and its parameter values are set to preserve accuracy. We then compress the dense points with an octree, and organize the sparse ones into polylines to reduce the redundancy. We further propose to compress the sparse points on the polylines by their spherical coordinates considering the properties of both the LiDAR sensors and the real-world scenes. Finally, we design suitable schemes to compress the remaining sparse points not on any polyline. Experimental results on DBGC, our prototype system, show that our scheme compressed large-scale real-world datasets by up to 19 times with an error bound under 0.02 meters for scenes of thousands of cubic meters. This result, together with the fast compression speed of DBGC, demonstrates the online compression of LiDAR data with high accuracy. Our source code is publicly available at https://github.com/RapidsAtHKUST/DBGC.
激光雷达(光探测和测距)传感器产生捕捉周围环境的3D点云,这些数据用于自动驾驶、交通监控和远程调查等应用。为了方便传输和存储,激光雷达点云通常被压缩。然而,为了实现高压缩比,现有的工作往往会牺牲数据的几何精度,从而损害下游应用程序的有效性。因此,我们提出了一个在保持几何精度的同时实现高压缩比的系统。在我们的方法中,我们首先执行基于密度的聚类来区分密集点和稀疏点,因为它们适用于不同的压缩方法。聚类算法针对我们的目的进行了优化,其参数值被设置为保持准确性。然后用八叉树压缩密集点,将稀疏点组织成折线,以减少冗余。考虑到激光雷达传感器和真实场景的特性,我们进一步提出对折线上的稀疏点进行球坐标压缩。最后,我们设计了合适的方案来压缩不在任何折线上的剩余稀疏点。在原型系统DBGC上的实验结果表明,我们的方案将大规模真实数据集压缩了19倍,对于数千立方米的场景,误差范围在0.02米以下。这一结果与DBGC的快速压缩速度一起证明了激光雷达数据的在线压缩具有很高的精度。我们的源代码可以在https://github.com/RapidsAtHKUST/DBGC上公开获得。
{"title":"Density-Based Geometry Compression for LiDAR Point Clouds","authors":"Xibo Sun, Qiong Luo","doi":"10.48786/edbt.2023.30","DOIUrl":"https://doi.org/10.48786/edbt.2023.30","url":null,"abstract":"LiDAR (Light Detection and Ranging) sensors produce 3D point clouds that capture the surroundings, and these data are used in applications such as autonomous driving, tra � c monitoring, and remote surveys. LiDAR point clouds are usually compressed for e � cient transmission and storage. However, to achieve a high compression ratio, existing work often sacri � ces the geometric accuracy of the data, which hurts the e � ectiveness of downstream applications. Therefore, we propose a system that achieves a high compression ratio while preserving geometric accuracy. In our method, we � rst perform density-based clustering to distinguish the dense points from the sparse ones, because they are suitable for di � erent compression methods. The clustering algorithm is optimized for our purpose and its parameter values are set to preserve accuracy. We then compress the dense points with an octree, and organize the sparse ones into polylines to reduce the redundancy. We further propose to compress the sparse points on the polylines by their spherical coordinates considering the properties of both the LiDAR sensors and the real-world scenes. Finally, we design suitable schemes to compress the remaining sparse points not on any polyline. Experimental results on DBGC, our prototype system, show that our scheme compressed large-scale real-world datasets by up to 19 times with an error bound under 0.02 meters for scenes of thousands of cubic meters. This result, together with the fast compression speed of DBGC, demonstrates the online compression of LiDAR data with high accuracy. Our source code is publicly available at https://github.com/RapidsAtHKUST/DBGC.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"44 1","pages":"378-390"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83061379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MM-quecat: A Tool for Unified Querying of Multi-Model Data MM-quecat:多模型数据统一查询工具
P. Koupil, Daniel Crha, I. Holubová
The dawn of multi-model data has brought many challenges to most aspects of data management. In addition, no standards exist focusing on how the models should be combined and man-aged. This paper focuses on the problems related to multi-model querying. We introduce MM-quecat , a tool that enables one to query multi-model data regardless of the underlying multi-model database or polystore. Using category theory, we provide a unified abstract representation of multi-model data, which can be viewed as a graph and, thus, queried using a SPARQL-based query language. Moreover, the support for cross-model redundancy enables the choice of the optimal multi-model query strategy.
多模型数据的出现给数据管理的大多数方面带来了许多挑战。此外,没有标准关注模型应该如何组合和管理。本文主要研究了多模型查询的相关问题。我们介绍MM-quecat,这是一种工具,可以查询多模型数据,而不管底层的多模型数据库或多存储。使用范畴论,我们提供了多模型数据的统一抽象表示,可以将其视为一个图,因此可以使用基于sparql的查询语言进行查询。此外,支持跨模型冗余可以选择最优的多模型查询策略。
{"title":"MM-quecat: A Tool for Unified Querying of Multi-Model Data","authors":"P. Koupil, Daniel Crha, I. Holubová","doi":"10.48786/edbt.2023.76","DOIUrl":"https://doi.org/10.48786/edbt.2023.76","url":null,"abstract":"The dawn of multi-model data has brought many challenges to most aspects of data management. In addition, no standards exist focusing on how the models should be combined and man-aged. This paper focuses on the problems related to multi-model querying. We introduce MM-quecat , a tool that enables one to query multi-model data regardless of the underlying multi-model database or polystore. Using category theory, we provide a unified abstract representation of multi-model data, which can be viewed as a graph and, thus, queried using a SPARQL-based query language. Moreover, the support for cross-model redundancy enables the choice of the optimal multi-model query strategy.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"42 1","pages":"831-834"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81527880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Adaptive Real-time Virtualization of Legacy ETL Pipelines in Cloud Data Warehouses 云数据仓库中遗留ETL管道的自适应实时虚拟化
Ehab Abdelhamid, Nikos Tsikoudis, M. Duller, Marc B. Sugiyama, Nicholas E. Marino, F. Waas
Extract, Transform, and Load (ETL) pipelines are widely used to ingest data into Enterprise Data Warehouse (EDW) systems. These pipelines can be very complex and often tightly coupled to a given EDW, making it challenging to upgrade from a legacy EDW to a Cloud Data Warehouse (CDW). This paper presents a novel solution for a transparent and fully-automated porting of legacy ETL pipelines to CDW environments.
提取、转换和加载(ETL)管道广泛用于将数据摄取到企业数据仓库(EDW)系统中。这些管道可能非常复杂,并且通常与给定的EDW紧密耦合,这使得从传统EDW升级到云数据仓库(CDW)具有挑战性。本文提出了一种新颖的解决方案,用于将遗留ETL管道透明且全自动地移植到CDW环境中。
{"title":"Adaptive Real-time Virtualization of Legacy ETL Pipelines in Cloud Data Warehouses","authors":"Ehab Abdelhamid, Nikos Tsikoudis, M. Duller, Marc B. Sugiyama, Nicholas E. Marino, F. Waas","doi":"10.48786/edbt.2023.64","DOIUrl":"https://doi.org/10.48786/edbt.2023.64","url":null,"abstract":"Extract, Transform, and Load (ETL) pipelines are widely used to ingest data into Enterprise Data Warehouse (EDW) systems. These pipelines can be very complex and often tightly coupled to a given EDW, making it challenging to upgrade from a legacy EDW to a Cloud Data Warehouse (CDW). This paper presents a novel solution for a transparent and fully-automated porting of legacy ETL pipelines to CDW environments.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"33 1","pages":"765-772"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89588031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A new PET for Data Collection via Forms with Data Minimization, Full Accuracy and Informed Consent 一个新的PET通过数据最小化,完全准确和知情同意的表格收集数据
N. Anciaux, S. Frittella, Baptiste Joffroy, Benjamin Nguyen, Guillaume Scerri
The advent of privacy laws and principles such as data minimization and informed consent are supposed to protect citizens from over-collection of personal data. Nevertheless, current processes, mainly through filling forms are still based on practices that lead to over-collection. Indeed, any citizen wishing to apply for a benefit (or service) will transmit all their personal data involved in the evaluation of the eligibility criteria. The resulting problem of over-collection affects millions of individuals, with considerable volumes of information collected. If this problem of compliance concerns both public and private organizations (e.g., social services, banks, insurance companies), it is because it faces non-trivial issues, which hinder the implementation of data minimization by developers. In this paper, we propose a new modeling approach that enables data minimization and informed choices for the users, for any decision problem modeled using classical logic, which covers a wide range of practical cases. Our data minimization solution uses game theoretic notions to explain and quantify the privacy payoff for the user. We show how our algorithms can be applied to practical cases study as a new PET for minimal, fully accurate (all due services must be preserved) and informed data collection.
隐私法和数据最小化和知情同意等原则的出现,本应保护公民免受个人数据的过度收集。然而,目前的流程,主要是通过填写表格,仍然基于导致过度收集的做法。事实上,任何希望申请福利(或服务)的公民都将提交所有涉及资格标准评估的个人数据。由此产生的过度收集问题影响了数百万个人,收集了大量信息。如果这个遵从性问题涉及公共和私人组织(例如,社会服务、银行、保险公司),那是因为它面临着一些重要的问题,这些问题阻碍了开发人员实现数据最小化。在本文中,我们提出了一种新的建模方法,可以为用户提供数据最小化和知情选择,适用于使用经典逻辑建模的任何决策问题,该方法涵盖了广泛的实际案例。我们的数据最小化解决方案使用博弈论概念来解释和量化用户的隐私回报。我们展示了如何将我们的算法应用于实际案例研究,作为一种新的PET,以实现最小的、完全准确的(所有应有的服务必须保留)和知情的数据收集。
{"title":"A new PET for Data Collection via Forms with Data Minimization, Full Accuracy and Informed Consent","authors":"N. Anciaux, S. Frittella, Baptiste Joffroy, Benjamin Nguyen, Guillaume Scerri","doi":"10.48786/edbt.2024.08","DOIUrl":"https://doi.org/10.48786/edbt.2024.08","url":null,"abstract":"The advent of privacy laws and principles such as data minimization and informed consent are supposed to protect citizens from over-collection of personal data. Nevertheless, current processes, mainly through filling forms are still based on practices that lead to over-collection. Indeed, any citizen wishing to apply for a benefit (or service) will transmit all their personal data involved in the evaluation of the eligibility criteria. The resulting problem of over-collection affects millions of individuals, with considerable volumes of information collected. If this problem of compliance concerns both public and private organizations (e.g., social services, banks, insurance companies), it is because it faces non-trivial issues, which hinder the implementation of data minimization by developers. In this paper, we propose a new modeling approach that enables data minimization and informed choices for the users, for any decision problem modeled using classical logic, which covers a wide range of practical cases. Our data minimization solution uses game theoretic notions to explain and quantify the privacy payoff for the user. We show how our algorithms can be applied to practical cases study as a new PET for minimal, fully accurate (all due services must be preserved) and informed data collection.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"57 1","pages":"81-93"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80497929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Exploration of Approaches for In-Database ML 数据库内ML方法的探索
Steffen Kläbe, Stefan Hagedorn, K. Sattler
Database systems are no longer used only for the storage of plain structured data and basic analyses. An increasing role is also played by the integration of ML models, e.g., neural networks with specialized frameworks, and their use for classification or prediction. However, using such models on data stored in a database system might require downloading the data and performing the computations outside. In this paper, we evaluate approaches for integrating the ML inference step as a special query operator - the ModelJoin. We explore several options for this integration on different abstraction levels: relational representation of the models as well as SQL queries for inference, the use of UDFs, the use of APIs to existing ML runtimes and a native implementation of the ModelJoin as a query operator supporting both CPU and GPU execution. Our evaluation results show that integrating ML runtimes over APIs perform similarly to a native operator while being generic to support arbitrary model types. The solution of relational representation and SQL queries is most portable and works well for smaller inputs without any changes needed in the database engine.
数据库系统不再仅仅用于存储简单的结构化数据和进行基本分析。机器学习模型的集成也发挥着越来越重要的作用,例如,具有专门框架的神经网络,以及它们用于分类或预测。但是,对存储在数据库系统中的数据使用这种模型可能需要下载数据并在外部执行计算。在本文中,我们评估了将ML推理步骤集成为一个特殊查询操作符- ModelJoin的方法。我们在不同的抽象层次上探索了这种集成的几个选项:模型的关系表示以及用于推理的SQL查询、udf的使用、对现有ML运行时使用api以及将ModelJoin作为支持CPU和GPU执行的查询操作符的本地实现。我们的评估结果表明,在api上集成ML运行时的性能与本机操作符相似,同时具有泛型以支持任意模型类型。关系表示和SQL查询的解决方案是最可移植的,并且可以很好地用于较小的输入,而无需在数据库引擎中进行任何更改。
{"title":"Exploration of Approaches for In-Database ML","authors":"Steffen Kläbe, Stefan Hagedorn, K. Sattler","doi":"10.48786/edbt.2023.25","DOIUrl":"https://doi.org/10.48786/edbt.2023.25","url":null,"abstract":"Database systems are no longer used only for the storage of plain structured data and basic analyses. An increasing role is also played by the integration of ML models, e.g., neural networks with specialized frameworks, and their use for classification or prediction. However, using such models on data stored in a database system might require downloading the data and performing the computations outside. In this paper, we evaluate approaches for integrating the ML inference step as a special query operator - the ModelJoin. We explore several options for this integration on different abstraction levels: relational representation of the models as well as SQL queries for inference, the use of UDFs, the use of APIs to existing ML runtimes and a native implementation of the ModelJoin as a query operator supporting both CPU and GPU execution. Our evaluation results show that integrating ML runtimes over APIs perform similarly to a native operator while being generic to support arbitrary model types. The solution of relational representation and SQL queries is most portable and works well for smaller inputs without any changes needed in the database engine.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"17 1","pages":"311-323"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85281129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
COVIDKG.ORG - a Web-scale COVID-19 Interactive, Trustworthy Knowledge Graph, Constructed and Interrogated for Bias using Deep-Learning COVIDKG。ORG -一个网络规模的COVID-19互动,可信赖的知识图谱,使用深度学习构建和询问偏见
Bhimesh Kandibedala, A. Pyayt, Nick Piraino, Chris Caballero, M. Gubanov
We describe a Web-scale interactive Knowledge Graph (KG) , populated with trustworthy information from the latest published medical findings on COVID-19. Currently existing, socially maintained KGs, such as YAGO or DBPedia or more specialized medical ontologies, such as NCBI, Virus-, and COVID-19-related are getting stale very quickly, lack any latest COVID-19 medical findings - most importantly lack any scalable mechanism to keep them up to date. Here we describe COVIDKG.ORG - an online, interactive, trust-worthy COVID-19 Web-scale Knowledge Graph and several advanced search-engines. Its content is extracted and updated from the latest medical research. Because of that it does not suffer from any bias or misinformation, often dominating public information sources.
我们描述了一个网络规模的交互式知识图(KG),其中填充了来自最新发表的COVID-19医学发现的可靠信息。目前现有的、由社会维护的知识库,如YAGO或DBPedia,或更专业的医学本体,如NCBI、病毒和COVID-19相关的知识库,正在迅速过时,缺乏任何最新的COVID-19医学发现——最重要的是缺乏任何可扩展的机制来保持它们的最新状态。这里我们来描述一下covid - kg。ORG——一个在线的、互动的、值得信赖的COVID-19网络规模知识图谱和几个先进的搜索引擎。它的内容是从最新的医学研究中提取和更新的。正因为如此,它不会受到任何偏见或错误信息的影响,经常占据公共信息来源的主导地位。
{"title":"COVIDKG.ORG - a Web-scale COVID-19 Interactive, Trustworthy Knowledge Graph, Constructed and Interrogated for Bias using Deep-Learning","authors":"Bhimesh Kandibedala, A. Pyayt, Nick Piraino, Chris Caballero, M. Gubanov","doi":"10.48786/edbt.2023.63","DOIUrl":"https://doi.org/10.48786/edbt.2023.63","url":null,"abstract":"We describe a Web-scale interactive Knowledge Graph (KG) , populated with trustworthy information from the latest published medical findings on COVID-19. Currently existing, socially maintained KGs, such as YAGO or DBPedia or more specialized medical ontologies, such as NCBI, Virus-, and COVID-19-related are getting stale very quickly, lack any latest COVID-19 medical findings - most importantly lack any scalable mechanism to keep them up to date. Here we describe COVIDKG.ORG - an online, interactive, trust-worthy COVID-19 Web-scale Knowledge Graph and several advanced search-engines. Its content is extracted and updated from the latest medical research. Because of that it does not suffer from any bias or misinformation, often dominating public information sources.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"188 1","pages":"757-764"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73944081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
WedgeBlock: An Off-Chain Secure Logging Platform for Blockchain Applications WedgeBlock:区块链应用的链下安全日志平台
Abhishek A. Singh, Yinan Zhou, Mohammad Sadoghi, S. Mehrotra, Sharad Sharma, Faisal Nawab
Over the recent years, there has been a growing interest in building blockchain-based decentralized applications (DApps). Developing DApps faces many challenges due to the cost and high-latency of writing to a blockchain smart contract. We propose WedgeBlock , a secure data logging infrastructure for DApps. WedgeBlock ’s design reduces the performance and monetary cost of DApps with its main technical innovation called lazy-minimum trust (LMT). LMT combines the following features: (1) off-chain storage component, (2) it lazily writes digests of data—rather than all data—on-chain to minimize costs, and (3) it integrates a trust mechanism to ensure the detection and punishment of malicious acts by the Offchain Node . Our experiments show that WedgeBlock is up to 1470× faster and 310× cheaper than a baseline solution of writing directly on chain.
近年来,人们对构建基于区块链的去中心化应用程序(DApps)越来越感兴趣。由于写入区块链智能合约的成本和高延迟,开发dapp面临许多挑战。我们提出了WedgeBlock,一个安全的dapp数据记录基础设施。WedgeBlock的设计降低了dapp的性能和货币成本,其主要技术创新被称为惰性最小信任(LMT)。LMT结合了以下特点:(1)off-chain存储组件;(2)它惰性地写入数据摘要而不是链上的所有数据,以最大限度地降低成本;(3)它集成了信任机制,以确保Offchain Node检测和惩罚恶意行为。我们的实验表明,WedgeBlock比直接在链上写入的基准解决方案快1470倍,便宜310倍。
{"title":"WedgeBlock: An Off-Chain Secure Logging Platform for Blockchain Applications","authors":"Abhishek A. Singh, Yinan Zhou, Mohammad Sadoghi, S. Mehrotra, Sharad Sharma, Faisal Nawab","doi":"10.48786/edbt.2023.45","DOIUrl":"https://doi.org/10.48786/edbt.2023.45","url":null,"abstract":"Over the recent years, there has been a growing interest in building blockchain-based decentralized applications (DApps). Developing DApps faces many challenges due to the cost and high-latency of writing to a blockchain smart contract. We propose WedgeBlock , a secure data logging infrastructure for DApps. WedgeBlock ’s design reduces the performance and monetary cost of DApps with its main technical innovation called lazy-minimum trust (LMT). LMT combines the following features: (1) off-chain storage component, (2) it lazily writes digests of data—rather than all data—on-chain to minimize costs, and (3) it integrates a trust mechanism to ensure the detection and punishment of malicious acts by the Offchain Node . Our experiments show that WedgeBlock is up to 1470× faster and 310× cheaper than a baseline solution of writing directly on chain.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"17 1","pages":"526-539"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89906455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Streaming Weighted Sampling over Join Queries 在连接查询上流式加权抽样
Michael Shekelyan, Graham Cormode, Qingzhi Ma, A. Shanghooshabad, P. Triantafillou
Join queries are a fundamental database tool, capturing a range of tasks that involve linking heterogeneous data sources. However, with massive table sizes, it is often impractical to keep these in memory, and we can only take one or few streaming passes over them. Moreover, building out the full join result (e.g., linking heterogeneous data sources along quasi-identifiers) can lead to a combinatorial explosion of results due to many-to-many links. Random sampling is a natural tool to boil this oversized result down to a representative subset with well-understood statistical properties, but turns out to be a challenging task due to the combinatorial nature of the sampling domain. Existing techniques in the literature focus solely on the setting with tabular data resid-ing in main memory, and do not address aspects such as stream operation, weighted sampling and more general join operators that are urgently needed in a modern data processing context. The main contribution of this work is to meet these needs with more lightweight practical approaches. First, a bijection between the sampling problem and a graph problem is introduced to support weighted sampling and common join operators. Second, the sampling techniques are refined to minimise the number of streaming passes. Third, techniques are presented to deal with very large tables under limited memory. Finally, the proposed techniques are compared to existing approaches that rely on database indices and the results indicate substantial memory savings, reduced runtimes for ad-hoc queries and competitive amortised runtimes. All pertinent code and data can be found at:
连接查询是一种基本的数据库工具,用于捕获涉及链接异构数据源的一系列任务。然而,对于巨大的表大小,将它们保存在内存中通常是不切实际的,并且我们只能对它们进行一次或几次流传递。此外,构建完整的连接结果(例如,沿着准标识符链接异构数据源)可能会由于多对多链接而导致结果的组合爆炸。随机抽样是一种自然的工具,可以将这个超大的结果归结为具有良好理解的统计属性的代表性子集,但由于抽样域的组合性质,这是一项具有挑战性的任务。文献中现有的技术仅仅关注于驻留在主存中的表格数据的设置,而没有解决现代数据处理环境中迫切需要的流操作、加权采样和更通用的连接操作等方面。这项工作的主要贡献是用更轻量级的实用方法来满足这些需求。首先,在抽样问题和图问题之间引入双射,以支持加权抽样和公共连接算子。其次,采样技术的改进,以尽量减少流的数量通过。第三,介绍了在有限内存下处理非常大的表的技术。最后,将所建议的技术与依赖数据库索引的现有方法进行比较,结果表明节省了大量内存,减少了临时查询的运行时间,并具有竞争性的分摊运行时间。有关守则及资料可于以下网址查阅:
{"title":"Streaming Weighted Sampling over Join Queries","authors":"Michael Shekelyan, Graham Cormode, Qingzhi Ma, A. Shanghooshabad, P. Triantafillou","doi":"10.48786/edbt.2023.24","DOIUrl":"https://doi.org/10.48786/edbt.2023.24","url":null,"abstract":"Join queries are a fundamental database tool, capturing a range of tasks that involve linking heterogeneous data sources. However, with massive table sizes, it is often impractical to keep these in memory, and we can only take one or few streaming passes over them. Moreover, building out the full join result (e.g., linking heterogeneous data sources along quasi-identifiers) can lead to a combinatorial explosion of results due to many-to-many links. Random sampling is a natural tool to boil this oversized result down to a representative subset with well-understood statistical properties, but turns out to be a challenging task due to the combinatorial nature of the sampling domain. Existing techniques in the literature focus solely on the setting with tabular data resid-ing in main memory, and do not address aspects such as stream operation, weighted sampling and more general join operators that are urgently needed in a modern data processing context. The main contribution of this work is to meet these needs with more lightweight practical approaches. First, a bijection between the sampling problem and a graph problem is introduced to support weighted sampling and common join operators. Second, the sampling techniques are refined to minimise the number of streaming passes. Third, techniques are presented to deal with very large tables under limited memory. Finally, the proposed techniques are compared to existing approaches that rely on database indices and the results indicate substantial memory savings, reduced runtimes for ad-hoc queries and competitive amortised runtimes. All pertinent code and data can be found at:","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"17 1","pages":"298-310"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84911541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Advances in database technology : proceedings. International Conference on Extending Database Technology
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1