首页 > 最新文献

The VLDB Journal最新文献

英文 中文
A multi-facet analysis of BERT-based entity matching models 基于bert的实体匹配模型的多层面分析
Pub Date : 2023-11-29 DOI: 10.1007/s00778-023-00824-x
Matteo Paganelli, Donato Tiano, Francesco Guerra

State-of-the-art Entity Matching approaches rely on transformer architectures, such as BERT, for generating highly contextualized embeddings of terms. The embeddings are then used to predict whether pairs of entity descriptions refer to the same real-world entity. BERT-based EM models demonstrated to be effective, but act as black-boxes for the users, who have limited insight into the motivations behind their decisions. In this paper, we perform a multi-facet analysis of the components of pre-trained and fine-tuned BERT architectures applied to an EM task. The main findings resulting from our extensive experimental evaluation are (1) the fine-tuning process applied to the EM task mainly modifies the last layers of the BERT components, but in a different way on tokens belonging to descriptions of matching/non-matching entities; (2) the special structure of the EM datasets, where records are pairs of entity descriptions, is recognized by BERT; (3) the pair-wise semantic similarity of tokens is not a key knowledge exploited by BERT-based EM models; (4) fine-tuning SBERT, a pre-trained version of BERT on the sentence similarity task, i.e., a task close to EM, does not allow the model to largely improve the effectiveness and to learn different forms of knowledge. Approaches customized for EM, such as Ditto and SupCon, seem to rely on the same knowledge as the other transformer-based models. Only the contrastive learning training allows SupCon to learn different knowledge from matching and non-matching entity descriptions; (5) the fine-tuning process based on a binary classifier does not allow the model to learn key distinctive features of the entity descriptions.

最先进的实体匹配方法依赖于变压器架构,例如BERT,用于生成高度上下文化的术语嵌入。然后使用嵌入来预测实体描述对是否指向相同的现实世界实体。基于bert的EM模型被证明是有效的,但对于用户来说,它就像黑盒子一样,用户对其决策背后的动机了解有限。在本文中,我们对应用于EM任务的预训练和微调BERT架构的组件进行了多方面的分析。我们广泛的实验评估得出的主要发现是:(1)应用于EM任务的微调过程主要修改BERT组件的最后一层,但以不同的方式修改属于匹配/非匹配实体描述的令牌;(2) BERT识别EM数据集的特殊结构(记录是实体描述对);(3)标记的成对语义相似度不是bert EM模型利用的关键知识;(4)微调SBERT,这是BERT在句子相似任务(即接近EM的任务)上的预训练版本,不能使模型在很大程度上提高有效性和学习不同形式的知识。为EM定制的方法,如Ditto和SupCon,似乎依赖于与其他基于变压器的模型相同的知识。只有通过对比学习训练,SupCon才能从匹配和不匹配的实体描述中学习到不同的知识;(5)基于二元分类器的微调过程不允许模型学习实体描述的关键显著特征。
{"title":"A multi-facet analysis of BERT-based entity matching models","authors":"Matteo Paganelli, Donato Tiano, Francesco Guerra","doi":"10.1007/s00778-023-00824-x","DOIUrl":"https://doi.org/10.1007/s00778-023-00824-x","url":null,"abstract":"<p>State-of-the-art Entity Matching approaches rely on transformer architectures, such as <i>BERT</i>, for generating highly contextualized embeddings of terms. The embeddings are then used to predict whether pairs of entity descriptions refer to the same real-world entity. BERT-based EM models demonstrated to be effective, but act as black-boxes for the users, who have limited insight into the motivations behind their decisions. In this paper, we perform a multi-facet analysis of the components of pre-trained and fine-tuned BERT architectures applied to an EM task. The main findings resulting from our extensive experimental evaluation are (1) the fine-tuning process applied to the EM task mainly modifies the last layers of the BERT components, but in a different way on tokens belonging to descriptions of matching/non-matching entities; (2) the special structure of the EM datasets, where records are pairs of entity descriptions, is recognized by BERT; (3) the pair-wise semantic similarity of tokens is not a key knowledge exploited by BERT-based EM models; (4) fine-tuning SBERT, a pre-trained version of BERT on the sentence similarity task, i.e., a task close to EM, does not allow the model to largely improve the effectiveness and to learn different forms of knowledge. Approaches customized for EM, such as Ditto and SupCon, seem to rely on the same knowledge as the other transformer-based models. Only the contrastive learning training allows SupCon to learn different knowledge from matching and non-matching entity descriptions; (5) the fine-tuning process based on a binary classifier does not allow the model to learn key distinctive features of the entity descriptions.\u0000</p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"171 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138544046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Givens rotations for QR decomposition, SVD and PCA over database joins 给出了QR分解、SVD和PCA在数据库连接上的旋转
Pub Date : 2023-11-23 DOI: 10.1007/s00778-023-00818-9
Dan Olteanu, Nils Vortmeier, Ɖorđe Živanović

This article introduces FiGaRo, an algorithm for computing the upper-triangular matrix in the QR decomposition of the matrix defined by the natural join over relational data. FiGaRo ’s main novelty is that it pushes the QR decomposition past the join. This leads to several desirable properties. For acyclic joins, it takes time linear in the database size and independent of the join size. Its execution is equivalent to the application of a sequence of Givens rotations proportional to the join size. Its number of rounding errors relative to the classical QR decomposition algorithms is on par with the database size relative to the join output size. The QR decomposition lies at the core of many linear algebra computations including the singular value decomposition (SVD) and the principal component analysis (PCA). We show how FiGaRo can be used to compute the orthogonal matrix in the QR decomposition, the SVD and the PCA of the join output without the need to materialize the join output. A suite of experiments validate that FiGaRo can outperform both in runtime performance and numerical accuracy the LAPACK library Intel MKL by a factor proportional to the gap between the sizes of the join output and input.

本文介绍了一种在关系数据上由自然连接定义的矩阵QR分解中计算上三角矩阵的算法FiGaRo。FiGaRo的主要新颖之处在于它将QR分解推过了连接。这导致了几个理想的特性。对于非循环连接,它所花费的时间与数据库大小成线性关系,与连接大小无关。它的执行相当于应用一系列与连接大小成比例的Givens旋转。它相对于经典QR分解算法的舍入误差数量与数据库大小相对于连接输出大小的数量相当。QR分解是许多线性代数计算的核心,包括奇异值分解(SVD)和主成分分析(PCA)。我们展示了如何使用FiGaRo来计算QR分解中的正交矩阵、SVD和连接输出的PCA,而不需要具体化连接输出。一组实验验证了FiGaRo在运行时性能和数值精度上都优于LAPACK库Intel MKL,其系数与连接输出和输入大小之间的差距成正比。
{"title":"Givens rotations for QR decomposition, SVD and PCA over database joins","authors":"Dan Olteanu, Nils Vortmeier, Ɖorđe Živanović","doi":"10.1007/s00778-023-00818-9","DOIUrl":"https://doi.org/10.1007/s00778-023-00818-9","url":null,"abstract":"<p>This article introduces <span>FiGaRo</span>, an algorithm for computing the upper-triangular matrix in the QR decomposition of the matrix defined by the natural join over relational data. <span>FiGaRo</span> ’s main novelty is that it pushes the QR decomposition past the join. This leads to several desirable properties. For acyclic joins, it takes time linear in the database size and independent of the join size. Its execution is equivalent to the application of a sequence of Givens rotations proportional to the join size. Its number of rounding errors relative to the classical QR decomposition algorithms is on par with the database size relative to the join output size. The QR decomposition lies at the core of many linear algebra computations including the singular value decomposition (SVD) and the principal component analysis (PCA). We show how <span>FiGaRo</span> can be used to compute the orthogonal matrix in the QR decomposition, the SVD and the PCA of the join output without the need to materialize the join output. A suite of experiments validate that <span>FiGaRo</span> can outperform both in runtime performance and numerical accuracy the LAPACK library Intel MKL by a factor proportional to the gap between the sizes of the join output and input.\u0000</p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138544049","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
A survey on the evolution of stream processing systems 流处理系统发展综述
Pub Date : 2023-11-22 DOI: 10.1007/s00778-023-00819-8
Marios Fragkoulis, Paris Carbone, Vasiliki Kalavri, Asterios Katsifodimos

Stream processing has been an active research field for more than 20 years, but it is now witnessing its prime time due to recent successful efforts by the research community and numerous worldwide open-source communities. This survey provides a comprehensive overview of fundamental aspects of stream processing systems and their evolution in the functional areas of out-of-order data management, state management, fault tolerance, high availability, load management, elasticity, and reconfiguration. We review noteworthy past research findings, outline the similarities and differences between the first (’00–’10) and second (’11–’23) generation of stream processing systems, and discuss future trends and open problems.

流处理作为一个活跃的研究领域已经有20多年的历史了,但由于研究社区和众多全球开源社区最近的成功努力,它现在正见证着它的黄金时期。本调查全面概述了流处理系统的基本方面,以及它们在无序数据管理、状态管理、容错、高可用性、负载管理、弹性和重新配置等功能领域的演变。我们回顾了过去值得注意的研究成果,概述了第一代(2000年至2010年)和第二代(2011年至1923年)流处理系统之间的异同,并讨论了未来的趋势和开放的问题。
{"title":"A survey on the evolution of stream processing systems","authors":"Marios Fragkoulis, Paris Carbone, Vasiliki Kalavri, Asterios Katsifodimos","doi":"10.1007/s00778-023-00819-8","DOIUrl":"https://doi.org/10.1007/s00778-023-00819-8","url":null,"abstract":"<p>Stream processing has been an active research field for more than 20 years, but it is now witnessing its prime time due to recent successful efforts by the research community and numerous worldwide open-source communities. This survey provides a comprehensive overview of fundamental aspects of stream processing systems and their evolution in the functional areas of out-of-order data management, state management, fault tolerance, high availability, load management, elasticity, and reconfiguration. We review noteworthy past research findings, outline the similarities and differences between the first (’00–’10) and second (’11–’23) generation of stream processing systems, and discuss future trends and open problems.</p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138544047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 37
Alfa: active learning for graph neural network-based semantic schema alignment 基于图神经网络的语义模式对齐的主动学习
Pub Date : 2023-11-21 DOI: 10.1007/s00778-023-00822-z
Venkata Vamsikrishna Meduri, Abdul Quamar, Chuan Lei, Xiao Qin, Berthold Reinwald

Semantic schema alignment aims to match elements across a pair of schemas based on their semantic representation. It is a key primitive for data integration that facilitates the creation of a common data fabric across heterogeneous data sources. Deep learning approaches such as graph representation learning have shown promise for effective alignment of semantically rich schemas, often captured as ontologies. Most of these approaches are supervised and require large amounts of labeled training data, which is expensive in terms of cost and manual labor. Active learning (AL) techniques can alleviate this issue by intelligently choosing the data to be labeled utilizing a human-in-the-loop approach, while minimizing the amount of labeled training data required. However, existing active learning techniques are limited in their ability to utilize the rich semantic information from underlying schemas. Therefore, they cannot drive effective and efficient sample selection for human labeling that is necessary to scale to larger datasets. In this paper, we propose Alfa, an active learning framework to overcome these limitations. Alfa exploits the schema element properties as well as the relationships between schema elements (structure) to drive a novel ontology-aware sample selection and label propagation algorithm for training highly accurate alignment models. We propose semantic blocking to scale to larger datasets without compromising model quality. Our experimental results across three real-world datasets show that (1) Alfa leads to a substantial reduction (27–82%) in the cost of human labeling, (2) semantic blocking reduces label skew up to 40(times ) without adversely affecting model quality and scales AL to large datasets, and (3) sample selection achieves comparable schema matching quality (90% F1-score) to models trained on the entire set of available training data. We also show that Alfa outperforms the state-of-the-art ontology alignment system, BERTMap, in terms of (1) 10(times ) shorter time per AL iteration and (2) requiring half of the AL iterations to achieve the highest convergent F1-score.

语义模式对齐的目的是基于语义表示来匹配跨一对模式的元素。它是数据集成的关键原语,有助于跨异构数据源创建公共数据结构。像图表示学习这样的深度学习方法已经显示出对语义丰富的模式(通常作为本体捕获)进行有效对齐的希望。这些方法中的大多数都是有监督的,并且需要大量标记的训练数据,这在成本和人工方面都是昂贵的。主动学习(AL)技术可以通过利用人在循环方法智能地选择要标记的数据来缓解这个问题,同时最小化所需的标记训练数据量。然而,现有的主动学习技术在利用来自底层模式的丰富语义信息方面受到限制。因此,它们不能驱动有效和高效的人类标签样本选择,这是扩展到更大数据集所必需的。在本文中,我们提出了一个主动学习框架Alfa来克服这些限制。Alfa利用模式元素属性以及模式元素(结构)之间的关系来驱动一种新的本体感知的样本选择和标签传播算法,用于训练高精度的对齐模型。我们提出语义块在不影响模型质量的情况下扩展到更大的数据集。我们在三个真实世界数据集上的实验结果表明:(1)阿尔法导致大量减少(27-82)%) in the cost of human labeling, (2) semantic blocking reduces label skew up to 40(times ) without adversely affecting model quality and scales AL to large datasets, and (3) sample selection achieves comparable schema matching quality (90% F1-score) to models trained on the entire set of available training data. We also show that Alfa outperforms the state-of-the-art ontology alignment system, BERTMap, in terms of (1) 10(times ) shorter time per AL iteration and (2) requiring half of the AL iterations to achieve the highest convergent F1-score.
{"title":"Alfa: active learning for graph neural network-based semantic schema alignment","authors":"Venkata Vamsikrishna Meduri, Abdul Quamar, Chuan Lei, Xiao Qin, Berthold Reinwald","doi":"10.1007/s00778-023-00822-z","DOIUrl":"https://doi.org/10.1007/s00778-023-00822-z","url":null,"abstract":"<p>Semantic schema alignment aims to match elements across a pair of schemas based on their semantic representation. It is a key primitive for data integration that facilitates the creation of a common data fabric across heterogeneous data sources. Deep learning approaches such as graph representation learning have shown promise for effective alignment of semantically rich schemas, often captured as ontologies. Most of these approaches are supervised and require large amounts of labeled training data, which is expensive in terms of cost and manual labor. Active learning (AL) techniques can alleviate this issue by intelligently choosing the data to be labeled utilizing a human-in-the-loop approach, while minimizing the amount of labeled training data required. However, existing active learning techniques are limited in their ability to utilize the rich semantic information from underlying schemas. Therefore, they cannot drive effective and efficient sample selection for human labeling that is necessary to scale to larger datasets. In this paper, we propose <span>Alfa</span>, an active learning framework to overcome these limitations. <span>Alfa</span> exploits the schema element properties as well as the relationships between schema elements (structure) to drive a novel ontology-aware sample selection and label propagation algorithm for training highly accurate alignment models. We propose semantic blocking to scale to larger datasets without compromising model quality. Our experimental results across three real-world datasets show that (1) <span>Alfa</span> leads to a substantial reduction (27–82%) in the cost of human labeling, (2) semantic blocking reduces label skew up to 40<span>(times )</span> without adversely affecting model quality and scales AL to large datasets, and (3) sample selection achieves comparable schema matching quality (90% F1-score) to models trained on the entire set of available training data. We also show that <span>Alfa</span> outperforms the state-of-the-art ontology alignment system, BERTMap, in terms of (1) 10<span>(times )</span> shorter time per AL iteration and (2) requiring half of the AL iterations to achieve the highest convergent F1-score.</p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138544050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AutoML in heavily constrained applications 在严格约束的应用程序中的自动化
Pub Date : 2023-11-17 DOI: 10.1007/s00778-023-00820-1
Felix Neutatz, Marius Lindauer, Ziawasch Abedjan

Optimizing a machine learning pipeline for a task at hand requires careful configuration of various hyperparameters, typically supported by an AutoML system that optimizes the hyperparameters for the given training dataset. Yet, depending on the AutoML system’s own second-order meta-configuration, the performance of the AutoML process can vary significantly. Current AutoML systems cannot automatically adapt their own configuration to a specific use case. Further, they cannot compile user-defined application constraints on the effectiveness and efficiency of the pipeline and its generation. In this paper, we propose Caml, which uses meta-learning to automatically adapt its own AutoML parameters, such as the search strategy, the validation strategy, and the search space, for a task at hand. The dynamic AutoML strategy of Caml takes user-defined constraints into account and obtains constraint-satisfying pipelines with high predictive performance.

为手头的任务优化机器学习管道需要仔细配置各种超参数,通常由为给定训练数据集优化超参数的AutoML系统支持。然而,根据AutoML系统自己的二阶元配置,AutoML过程的性能可能会有很大的不同。当前的AutoML系统不能自动调整自己的配置以适应特定的用例。此外,它们不能对管道及其生成的有效性和效率编译用户定义的应用程序约束。在本文中,我们提出了Caml,它使用元学习来自动调整自己的AutoML参数,如搜索策略、验证策略和搜索空间,以完成手头的任务。Caml的动态AutoML策略考虑了用户自定义约束,获得了具有高预测性能的满足约束的管道。
{"title":"AutoML in heavily constrained applications","authors":"Felix Neutatz, Marius Lindauer, Ziawasch Abedjan","doi":"10.1007/s00778-023-00820-1","DOIUrl":"https://doi.org/10.1007/s00778-023-00820-1","url":null,"abstract":"<p>Optimizing a machine learning pipeline for a task at hand requires careful configuration of various hyperparameters, typically supported by an AutoML system that optimizes the hyperparameters for the given training dataset. Yet, depending on the AutoML system’s own second-order meta-configuration, the performance of the AutoML process can vary significantly. Current AutoML systems cannot automatically adapt their own configuration to a specific use case. Further, they cannot compile user-defined application constraints on the effectiveness and efficiency of the pipeline and its generation. In this paper, we propose <span>Caml</span>, which uses meta-learning to automatically adapt its own AutoML parameters, such as the search strategy, the validation strategy, and the search space, for a task at hand. The dynamic AutoML strategy of <span>Caml</span> takes user-defined constraints into account and obtains constraint-satisfying pipelines with high predictive performance.\u0000</p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"36 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138544048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient and robust active learning methods for interactive database exploration 用于交互式数据库探索的高效、健壮的主动学习方法
Pub Date : 2023-11-16 DOI: 10.1007/s00778-023-00816-x
Enhui Huang, Yanlei Diao, Anna Liu, Liping Peng, Luciano Di Palma

There is an increasing gap between fast growth of data and the limited human ability to comprehend data. Consequently, there has been a growing demand of data management tools that can bridge this gap and help the user retrieve high-value content from data more effectively. In this work, we propose an interactive data exploration system as a new database service, using an approach called “explore-by-example.” Our new system is designed to assist the user in performing highly effective data exploration while reducing the human effort in the process. We cast the explore-by-example problem in a principled “active learning” framework. However, traditional active learning suffers from two fundamental limitations: slow convergence and lack of robustness under label noise. To overcome the slow convergence and label noise problems, we bring the properties of important classes of database queries to bear on the design of new algorithms and optimizations for active learning-based database exploration. Evaluation results using real-world datasets and user interest patterns show that our new system, both in the noise-free case and in the label noise case, significantly outperforms state-of-the-art active learning techniques and data exploration systems in accuracy while achieving the desired efficiency for interactive data exploration.

数据的快速增长与人类有限的理解数据的能力之间的差距越来越大。因此,对能够弥合这一差距并帮助用户更有效地从数据中检索高价值内容的数据管理工具的需求不断增长。在这项工作中,我们提出了一种交互式数据探索系统作为一种新的数据库服务,使用一种称为“按例探索”的方法。我们的新系统旨在帮助用户执行高效的数据探索,同时减少过程中的人力。我们将实例探索问题置于原则性的“主动学习”框架中。然而,传统的主动学习有两个基本的局限性:收敛速度慢和在标签噪声下缺乏鲁棒性。为了克服缓慢的收敛和标签噪声问题,我们将数据库查询的重要类别的属性引入到新算法的设计和基于主动学习的数据库探索的优化中。使用真实数据集和用户兴趣模式的评估结果表明,我们的新系统,无论是在无噪声情况下还是在标签噪声情况下,在准确性方面都明显优于最先进的主动学习技术和数据探索系统,同时实现了交互式数据探索的预期效率。
{"title":"Efficient and robust active learning methods for interactive database exploration","authors":"Enhui Huang, Yanlei Diao, Anna Liu, Liping Peng, Luciano Di Palma","doi":"10.1007/s00778-023-00816-x","DOIUrl":"https://doi.org/10.1007/s00778-023-00816-x","url":null,"abstract":"<p>There is an increasing gap between fast growth of data and the limited human ability to comprehend data. Consequently, there has been a growing demand of data management tools that can bridge this gap and help the user retrieve high-value content from data more effectively. In this work, we propose an interactive data exploration system as a new database service, using an approach called “explore-by-example.” Our new system is designed to assist the user in performing highly effective data exploration while reducing the human effort in the process. We cast the explore-by-example problem in a principled “active learning” framework. However, traditional active learning suffers from two fundamental limitations: slow convergence and lack of robustness under label noise. To overcome the slow convergence and label noise problems, we bring the properties of important classes of database queries to bear on the design of new algorithms and optimizations for active learning-based database exploration. Evaluation results using real-world datasets and user interest patterns show that our new system, both in the noise-free case and in the label noise case, significantly outperforms state-of-the-art active learning techniques and data exploration systems in accuracy while achieving the desired efficiency for interactive data exploration.</p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138544086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Supporting secure dynamic alert zones using searchable encryption and graph embedding 支持使用可搜索加密和图形嵌入的安全动态警报区域
Pub Date : 2023-07-18 DOI: 10.1007/s00778-023-00803-2
Sina Shaham, Gabriel Ghinita, Cyrus Shahabi

Location-based alerts have gained increasing popularity in recent years, whether in the context of healthcare (e.g., COVID-19 contact tracing), marketing (e.g., location-based advertising), or public safety. However, serious privacy concerns arise when location data are used in clear in the process. Several solutions employ searchable encryption (SE) to achieve secure alerts directly on encrypted locations. While doing so preserves privacy, the performance overhead incurred is high. We focus on a prominent SE technique in the public-key setting–hidden vector encryption, and propose a graph embedding technique to encode location data in a way that significantly boosts the performance of processing on ciphertexts. We show that the optimal encoding is NP-hard, and we provide three heuristics that obtain significant performance gains: gray optimizer, multi-seed gray optimizer and scaled gray optimizer. Furthermore, we investigate the more challenging case of dynamic alert zones, where the area of interest changes over time. Our extensive experimental evaluation shows that our solutions can significantly improve computational overhead compared to existing baselines.

近年来,基于位置的警报越来越受欢迎,无论是在医疗保健(例如,COVID-19接触者追踪)、营销(例如,基于位置的广告)还是公共安全领域。然而,当位置数据在这个过程中被使用时,严重的隐私问题就出现了。一些解决方案使用可搜索加密(SE)直接在加密位置上实现安全警报。虽然这样做可以保护隐私,但产生的性能开销很高。我们重点研究了公钥设置隐藏向量加密中的一种突出的SE技术,并提出了一种图嵌入技术来对位置数据进行编码,从而显著提高了对密文的处理性能。我们证明了最优编码是NP-hard的,并且我们提供了三种获得显著性能提升的启发式方法:灰色优化器、多种子灰色优化器和缩放灰色优化器。此外,我们还研究了更具挑战性的动态警报区域,其中感兴趣的区域随时间变化。我们广泛的实验评估表明,与现有基线相比,我们的解决方案可以显着改善计算开销。
{"title":"Supporting secure dynamic alert zones using searchable encryption and graph embedding","authors":"Sina Shaham, Gabriel Ghinita, Cyrus Shahabi","doi":"10.1007/s00778-023-00803-2","DOIUrl":"https://doi.org/10.1007/s00778-023-00803-2","url":null,"abstract":"<p>Location-based alerts have gained increasing popularity in recent years, whether in the context of healthcare (e.g., COVID-19 contact tracing), marketing (e.g., location-based advertising), or public safety. However, serious privacy concerns arise when location data are used in clear in the process. Several solutions employ searchable encryption (SE) to achieve <i>secure</i> alerts directly on encrypted locations. While doing so preserves privacy, the performance overhead incurred is high. We focus on a prominent SE technique in the public-key setting–hidden vector encryption, and propose a graph embedding technique to encode location data in a way that significantly boosts the performance of processing on ciphertexts. We show that the optimal encoding is NP-hard, and we provide three heuristics that obtain significant performance gains: gray optimizer, multi-seed gray optimizer and scaled gray optimizer. Furthermore, we investigate the more challenging case of dynamic alert zones, where the area of interest changes over time. Our extensive experimental evaluation shows that our solutions can significantly improve computational overhead compared to existing baselines.\u0000</p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138544051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Time series indexing by dynamic covering with cross-range constraints 通过跨范围约束的动态覆盖进行时间序列索引
Pub Date : 2020-05-28 DOI: 10.1007/s00778-020-00614-9
Tao Sun, Hongbo Liu, S. McLoone, Shaoxiong Ji, Xindong Wu
{"title":"Time series indexing by dynamic covering with cross-range constraints","authors":"Tao Sun, Hongbo Liu, S. McLoone, Shaoxiong Ji, Xindong Wu","doi":"10.1007/s00778-020-00614-9","DOIUrl":"https://doi.org/10.1007/s00778-020-00614-9","url":null,"abstract":"","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"1 9","pages":"1365 - 1384"},"PeriodicalIF":0.0,"publicationDate":"2020-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141202485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
A game-based framework for crowdsourced data labeling 基于游戏的众包数据标注框架
Pub Date : 2020-05-19 DOI: 10.1007/s00778-020-00613-w
Jingru Yang, Ju Fan, Zhewei Wei, Guoliang Li, Tongyu Liu, Xiaoyong Du
{"title":"A game-based framework for crowdsourced data labeling","authors":"Jingru Yang, Ju Fan, Zhewei Wei, Guoliang Li, Tongyu Liu, Xiaoyong Du","doi":"10.1007/s00778-020-00613-w","DOIUrl":"https://doi.org/10.1007/s00778-020-00613-w","url":null,"abstract":"","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"55 33","pages":"1311 - 1336"},"PeriodicalIF":0.0,"publicationDate":"2020-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141204162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
期刊
The VLDB Journal
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1