首页 > 最新文献

Advances in database technology : proceedings. International Conference on Extending Database Technology最新文献

英文 中文
Efficiently Archiving Photos under Storage Constraints 在存储限制下有效地归档照片
S. Davidson, Shay Gershtein, T. Milo, Slava Novgorodov, May Shoshan
Our ability to collect data is rapidly outstripping our ability to effectively store and use it. Organizations are therefore facing tough decisions of what data to archive (or dispose of) to effectively meet their business goals. We address this general problem in the context of image data (photos) by proposing which photos to archive to meet an online storage budget. The decision is based on factors such as usage patterns and their relative importance, the quality and size of a photo, the relevance of a photo for a usage pattern, the similarity between different photos, as well as policy requirements of what photos must be retained. We formalize the photo archival problem, analyze its complexity, and give two approximation algorithms. One algorithm comes with an optimal approximation guarantee and another, more scalable, algorithm that comes with both worst-case and data-dependent guarantees. Based on these algorithms we implemented an end-to-end system, PHOcus, and discuss how to automatically derive the inputs for this system in many settings. An extensive experimental study based on public as well as private datasets demonstrates the effectiveness and efficiency of PHOcus. Furthermore, a user study using business analysts in a real e-commerce application shows that it can save a tremendous amount of human effort and yield unexpected insights.
我们收集数据的能力正在迅速超越我们有效存储和使用数据的能力。因此,组织面临着要归档(或处理)哪些数据以有效地实现其业务目标的艰难决策。我们通过建议存档哪些照片以满足在线存储预算来解决图像数据(照片)上下文中的这个一般问题。决策是基于诸如使用模式及其相对重要性、照片的质量和大小、照片与使用模式的相关性、不同照片之间的相似性以及必须保留哪些照片的政策要求等因素。我们形式化了照片存档问题,分析了其复杂性,并给出了两种近似算法。一种算法具有最优近似保证,另一种更具可扩展性的算法具有最坏情况和数据依赖保证。基于这些算法,我们实现了一个端到端系统PHOcus,并讨论了如何在许多设置中自动导出该系统的输入。基于公共和私人数据集的广泛实验研究证明了PHOcus的有效性和效率。此外,在一个真实的电子商务应用程序中使用业务分析师进行的用户研究表明,它可以节省大量的人力,并产生意想不到的见解。
{"title":"Efficiently Archiving Photos under Storage Constraints","authors":"S. Davidson, Shay Gershtein, T. Milo, Slava Novgorodov, May Shoshan","doi":"10.48786/edbt.2023.50","DOIUrl":"https://doi.org/10.48786/edbt.2023.50","url":null,"abstract":"Our ability to collect data is rapidly outstripping our ability to effectively store and use it. Organizations are therefore facing tough decisions of what data to archive (or dispose of) to effectively meet their business goals. We address this general problem in the context of image data (photos) by proposing which photos to archive to meet an online storage budget. The decision is based on factors such as usage patterns and their relative importance, the quality and size of a photo, the relevance of a photo for a usage pattern, the similarity between different photos, as well as policy requirements of what photos must be retained. We formalize the photo archival problem, analyze its complexity, and give two approximation algorithms. One algorithm comes with an optimal approximation guarantee and another, more scalable, algorithm that comes with both worst-case and data-dependent guarantees. Based on these algorithms we implemented an end-to-end system, PHOcus, and discuss how to automatically derive the inputs for this system in many settings. An extensive experimental study based on public as well as private datasets demonstrates the effectiveness and efficiency of PHOcus. Furthermore, a user study using business analysts in a real e-commerce application shows that it can save a tremendous amount of human effort and yield unexpected insights.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"34 1","pages":"591-603"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85088942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Subset Approach to Efficient Skyline Computation 高效Skyline计算的子集方法
Dominique H. Li
Skyline query processing is essential to the database commu-nity. Many algorithms have been designed to perform efficient skyline computation, which can be generally categorized into sorting-based and partitioning-based by considering the different mechanisms to reduce the dominance tests. Sorting-based skyline algorithms first sort all points with respect to a monotone score function, for instance the sum of all values of a point, then the dominance tests can be bounded by the score function; partitioning-based algorithms create partitions from the dataset so that the dominance tests can be limited in partitions. On the other hand, the incomparability between points has been considered as an important property, that is, if two points are incomparable, then any dominance test between them is unnec-essary. In fact, the state-of-the-art skyline algorithms effectively reduce the dominance tests by taking the incomparability into account. In this paper, we present a subset-based approach that allows to integrate subspace-based incomparability to existing sorting-based skyline algorithms and can therefore significantly reduce the total number of dominance tests in large multidimensional datasets. Our theoretical and experimental studies show that the proposed subset approach boosts existing sorting-based skyline algorithms and makes them comparable to the state-of-the-art algorithms and even faster with uniform independent data.
Skyline查询处理对数据库社区至关重要。为了实现高效的天际线计算,已经设计了许多算法,通过考虑不同的机制来减少优势测试,大致可分为基于排序和基于分区的算法。基于排序的skyline算法首先对所有点按照单调分数函数进行排序,例如对一个点的所有值求和,然后优势度测试可以以分数函数为界;基于分区的算法从数据集创建分区,以便优势测试可以限制在分区中。另一方面,点之间的不可比较性被认为是一个重要的性质,即如果两个点是不可比较性的,那么它们之间的任何优势检验都是不必要的。事实上,最先进的天际线算法通过考虑到不可比较性,有效地减少了优势测试。在本文中,我们提出了一种基于子集的方法,该方法允许将基于子空间的不可比较性集成到现有的基于排序的天际线算法中,因此可以显着减少大型多维数据集中优势测试的总数。我们的理论和实验研究表明,提出的子集方法提高了现有的基于排序的天际线算法,使它们与最先进的算法相媲美,甚至在统一的独立数据下更快。
{"title":"Subset Approach to Efficient Skyline Computation","authors":"Dominique H. Li","doi":"10.48786/edbt.2023.31","DOIUrl":"https://doi.org/10.48786/edbt.2023.31","url":null,"abstract":"Skyline query processing is essential to the database commu-nity. Many algorithms have been designed to perform efficient skyline computation, which can be generally categorized into sorting-based and partitioning-based by considering the different mechanisms to reduce the dominance tests. Sorting-based skyline algorithms first sort all points with respect to a monotone score function, for instance the sum of all values of a point, then the dominance tests can be bounded by the score function; partitioning-based algorithms create partitions from the dataset so that the dominance tests can be limited in partitions. On the other hand, the incomparability between points has been considered as an important property, that is, if two points are incomparable, then any dominance test between them is unnec-essary. In fact, the state-of-the-art skyline algorithms effectively reduce the dominance tests by taking the incomparability into account. In this paper, we present a subset-based approach that allows to integrate subspace-based incomparability to existing sorting-based skyline algorithms and can therefore significantly reduce the total number of dominance tests in large multidimensional datasets. Our theoretical and experimental studies show that the proposed subset approach boosts existing sorting-based skyline algorithms and makes them comparable to the state-of-the-art algorithms and even faster with uniform independent data.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"63 1","pages":"391-403"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83858579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ExtremeEarth: Managing Water Availability for Crops Using Earth Observation and Machine Learning 极端地球:利用地球观测和机器学习管理作物的水分供应
F. Appel, H. Bach, S. Migdall, Manolis Koubarakis, G. Stamoulis, D. Bilidas, D. Pantazi, L. Bruzzone, C. Paris, Giulio Weikmann
Food security, especially in a changing Earth environment, is one of the most challenging issues of this century. Population growth, increased food consumption and the challenges of climate change will extend over the next decades. To deal with these, both regional and global measures are necessary. Biomass production and thus yield will need to be increased in a sustainable way. It is important to minimize the risks of yield loss even under more extreme environmental conditions, while making sure not to deplete or damage the available resources. Two measures are most important for this: irrigation and fertilization. While fertilization relies mainly on industrial goods, irrigation requires reliable water resources in the area that is being farmed, either from groundwater or surface water. Regarding surface water, a large portion of the world’s fresh-water is linked to snowfall, snow storage and seasonal release of the water. All these components are subject to increased variability due to climate change and the
粮食安全,特别是在不断变化的地球环境中,是本世纪最具挑战性的问题之一。人口增长、粮食消费增加和气候变化的挑战将在未来几十年持续下去。为了解决这些问题,必须采取区域和全球措施。需要以可持续的方式增加生物质生产和产量。重要的是,即使在更极端的环境条件下,也要尽量减少产量损失的风险,同时确保不耗尽或破坏现有资源。其中最重要的两项措施是:灌溉和施肥。施肥主要依靠工业品,而灌溉则需要耕种地区有可靠的水源,要么来自地下水,要么来自地表水。就地表水而言,世界上很大一部分淡水与降雪、雪储存和季节性水释放有关。由于气候变化和气候变化,所有这些组成部分都受到变率增加的影响
{"title":"ExtremeEarth: Managing Water Availability for Crops Using Earth Observation and Machine Learning","authors":"F. Appel, H. Bach, S. Migdall, Manolis Koubarakis, G. Stamoulis, D. Bilidas, D. Pantazi, L. Bruzzone, C. Paris, Giulio Weikmann","doi":"10.48786/edbt.2023.62","DOIUrl":"https://doi.org/10.48786/edbt.2023.62","url":null,"abstract":"Food security, especially in a changing Earth environment, is one of the most challenging issues of this century. Population growth, increased food consumption and the challenges of climate change will extend over the next decades. To deal with these, both regional and global measures are necessary. Biomass production and thus yield will need to be increased in a sustainable way. It is important to minimize the risks of yield loss even under more extreme environmental conditions, while making sure not to deplete or damage the available resources. Two measures are most important for this: irrigation and fertilization. While fertilization relies mainly on industrial goods, irrigation requires reliable water resources in the area that is being farmed, either from groundwater or surface water. Regarding surface water, a large portion of the world’s fresh-water is linked to snowfall, snow storage and seasonal release of the water. All these components are subject to increased variability due to climate change and the","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"3 1","pages":"749-756"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88470977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Balancing Utility and Fairness in Submodular Maximization 次模最大化中效用与公平的平衡
Yanhao Wang, Yuchen Li, F. Bonchi, Ying Wang
Submodular function maximization is a fundamental combinatorial optimization problem with plenty of applications -- including data summarization, influence maximization, and recommendation. In many of these problems, the goal is to find a solution that maximizes the average utility over all users, for each of whom the utility is defined by a monotone submodular function. However, when the population of users is composed of several demographic groups, another critical problem is whether the utility is fairly distributed across different groups. Although the emph{utility} and emph{fairness} objectives are both desirable, they might contradict each other, and, to the best of our knowledge, little attention has been paid to optimizing them jointly. To fill this gap, we propose a new problem called emph{Bicriteria Submodular Maximization} (BSM) to balance utility and fairness. Specifically, it requires finding a fixed-size solution to maximize the utility function, subject to the value of the fairness function not being below a threshold. Since BSM is inapproximable within any constant factor, we focus on designing efficient instance-dependent approximation schemes. Our algorithmic proposal comprises two methods, with different approximation factors, obtained by converting a BSM instance into other submodular optimization problem instances. Using real-world and synthetic datasets, we showcase applications of our proposed methods in three submodular maximization problems: maximum coverage, influence maximization, and facility location.
子模块函数最大化是一个具有大量应用的基本组合优化问题,包括数据汇总、影响最大化和推荐。在许多这样的问题中,目标是找到一个解决方案,使所有用户的平均效用最大化,其中每个用户的效用由单调子模函数定义。然而,当用户人口由几个人口统计组组成时,另一个关键问题是效用是否在不同的组之间公平分布。尽管emph{效用}和emph{公平}目标都是可取的,但它们可能相互矛盾,而且据我们所知,很少有人关注如何同时优化它们。为了填补这一空白,我们提出了一个新的问题,emph{称为双标准次模最大化}(BSM)来平衡效用和公平性。具体来说,它需要找到一个固定大小的解决方案来最大化效用函数,前提是公平函数的值不低于阈值。由于BSM在任何常数因子内都是不可近似的,因此我们着重于设计有效的依赖实例的近似方案。我们提出的算法包括两种方法,它们具有不同的近似因子,通过将一个BSM实例转化为其他子模块优化问题实例来获得。使用真实世界和合成数据集,我们展示了我们提出的方法在三个子模块最大化问题中的应用:最大覆盖范围、影响最大化和设施位置。
{"title":"Balancing Utility and Fairness in Submodular Maximization","authors":"Yanhao Wang, Yuchen Li, F. Bonchi, Ying Wang","doi":"10.48786/edbt.2024.01","DOIUrl":"https://doi.org/10.48786/edbt.2024.01","url":null,"abstract":"Submodular function maximization is a fundamental combinatorial optimization problem with plenty of applications -- including data summarization, influence maximization, and recommendation. In many of these problems, the goal is to find a solution that maximizes the average utility over all users, for each of whom the utility is defined by a monotone submodular function. However, when the population of users is composed of several demographic groups, another critical problem is whether the utility is fairly distributed across different groups. Although the emph{utility} and emph{fairness} objectives are both desirable, they might contradict each other, and, to the best of our knowledge, little attention has been paid to optimizing them jointly. To fill this gap, we propose a new problem called emph{Bicriteria Submodular Maximization} (BSM) to balance utility and fairness. Specifically, it requires finding a fixed-size solution to maximize the utility function, subject to the value of the fairness function not being below a threshold. Since BSM is inapproximable within any constant factor, we focus on designing efficient instance-dependent approximation schemes. Our algorithmic proposal comprises two methods, with different approximation factors, obtained by converting a BSM instance into other submodular optimization problem instances. Using real-world and synthetic datasets, we showcase applications of our proposed methods in three submodular maximization problems: maximum coverage, influence maximization, and facility location.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"51 1","pages":"1-14"},"PeriodicalIF":0.0,"publicationDate":"2022-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75586044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Integration of Skyline Queries into Spark SQL 将Skyline查询集成到Spark SQL
Lukas Grasmann, R. Pichler, Alexander Selzer
Skyline queries are frequently used in data analytics and multi-criteria decision support applications to filter relevant information from big amounts of data. Apache Spark is a popular framework for processing big, distributed data. The framework even provides a convenient SQL-like interface via the Spark SQL module. However, skyline queries are not natively supported and require tedious rewriting to fit the SQL standard or Spark's SQL-like language. The goal of our work is to fill this gap. We thus provide a full-fledged integration of the skyline operator into Spark SQL. This allows for a simple and easy to use syntax to input skyline queries. Moreover, our empirical results show that this integrated solution of skyline queries by far outperforms a solution based on rewriting into standard SQL.
Skyline查询经常用于数据分析和多标准决策支持应用程序,以从大量数据中过滤相关信息。Apache Spark是处理大型分布式数据的流行框架。该框架甚至通过Spark SQL模块提供了一个方便的类似SQL的接口。然而,skyline查询并不是本地支持的,并且需要繁琐的重写以适应SQL标准或Spark的类SQL语言。我们工作的目标就是填补这一空白。因此,我们提供了将skyline操作符完全集成到Spark SQL中的功能。这允许一个简单易用的语法来输入天际线查询。此外,我们的实证结果表明,这种天际线查询的集成解决方案远远优于基于重写为标准SQL的解决方案。
{"title":"Integration of Skyline Queries into Spark SQL","authors":"Lukas Grasmann, R. Pichler, Alexander Selzer","doi":"10.48550/arXiv.2210.03718","DOIUrl":"https://doi.org/10.48550/arXiv.2210.03718","url":null,"abstract":"Skyline queries are frequently used in data analytics and multi-criteria decision support applications to filter relevant information from big amounts of data. Apache Spark is a popular framework for processing big, distributed data. The framework even provides a convenient SQL-like interface via the Spark SQL module. However, skyline queries are not natively supported and require tedious rewriting to fit the SQL standard or Spark's SQL-like language. The goal of our work is to fill this gap. We thus provide a full-fledged integration of the skyline operator into Spark SQL. This allows for a simple and easy to use syntax to input skyline queries. Moreover, our empirical results show that this integrated solution of skyline queries by far outperforms a solution based on rewriting into standard SQL.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"46 1","pages":"337-350"},"PeriodicalIF":0.0,"publicationDate":"2022-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80187603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Frequency Estimation of Evolving Data Under Local Differential Privacy 局部差分隐私下演化数据的频率估计
Héber H. Arcolezi, Carlos Pinz'on, C. Palamidessi, S. Gambs
Collecting and analyzing evolving longitudinal data has become a common practice. One possible approach to protect the users' privacy in this context is to use local differential privacy (LDP) protocols, which ensure the privacy protection of all users even in the case of a breach or data misuse. Existing LDP data collection protocols such as Google's RAPPOR and Microsoft's dBitFlipPM can have longitudinal privacy linear to the domain size k, which is excessive for large domains, such as Internet domains. To solve this issue, in this paper we introduce a new LDP data collection protocol for longitudinal frequency monitoring named LOngitudinal LOcal HAshing (LOLOHA) with formal privacy guarantees. In addition, the privacy-utility trade-off of our protocol is only linear with respect to a reduced domain size $2leq g ll k$. LOLOHA combines a domain reduction approach via local hashing with double randomization to minimize the privacy leakage incurred by data updates. As demonstrated by our theoretical analysis as well as our experimental evaluation, LOLOHA achieves a utility competitive to current state-of-the-art protocols, while substantially minimizing the longitudinal privacy budget consumption by up to k/g orders of magnitude.
收集和分析不断变化的纵向数据已经成为一种常见的做法。在这种情况下,保护用户隐私的一种可能方法是使用本地差异隐私(LDP)协议,该协议确保即使在数据泄露或数据滥用的情况下也能保护所有用户的隐私。现有的LDP数据收集协议(如Google的RAPPOR和Microsoft的dBitFlipPM)可以具有与域大小k线性的纵向隐私,这对于大域(如Internet域)来说是过度的。为了解决这个问题,本文引入了一种新的纵向频率监测LDP数据收集协议,称为纵向局部哈希(LOLOHA),具有正式的隐私保证。此外,我们协议的隐私-效用权衡仅在减小域大小$2leq g ll k$方面是线性的。LOLOHA结合了通过局部哈希和双重随机化的域约简方法,以最大限度地减少数据更新引起的隐私泄漏。正如我们的理论分析和实验评估所证明的那样,LOLOHA实现了与当前最先进协议竞争的实用程序,同时将纵向隐私预算消耗大幅降低了k/g数量级。
{"title":"Frequency Estimation of Evolving Data Under Local Differential Privacy","authors":"Héber H. Arcolezi, Carlos Pinz'on, C. Palamidessi, S. Gambs","doi":"10.48550/arXiv.2210.00262","DOIUrl":"https://doi.org/10.48550/arXiv.2210.00262","url":null,"abstract":"Collecting and analyzing evolving longitudinal data has become a common practice. One possible approach to protect the users' privacy in this context is to use local differential privacy (LDP) protocols, which ensure the privacy protection of all users even in the case of a breach or data misuse. Existing LDP data collection protocols such as Google's RAPPOR and Microsoft's dBitFlipPM can have longitudinal privacy linear to the domain size k, which is excessive for large domains, such as Internet domains. To solve this issue, in this paper we introduce a new LDP data collection protocol for longitudinal frequency monitoring named LOngitudinal LOcal HAshing (LOLOHA) with formal privacy guarantees. In addition, the privacy-utility trade-off of our protocol is only linear with respect to a reduced domain size $2leq g ll k$. LOLOHA combines a domain reduction approach via local hashing with double randomization to minimize the privacy leakage incurred by data updates. As demonstrated by our theoretical analysis as well as our experimental evaluation, LOLOHA achieves a utility competitive to current state-of-the-art protocols, while substantially minimizing the longitudinal privacy budget consumption by up to k/g orders of magnitude.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"59 1","pages":"512-525"},"PeriodicalIF":0.0,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90850424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Offset-value coding in database query processing 数据库查询处理中的偏移值编码
G. Graefe, Thanh Do
Recent work shows how offset-value coding speeds up database query execution, not only sorting but also duplicate removal and grouping (aggregation) in sorted streams, order-preserving exchange (shuffle), merge join, and more. It already saves thousands of CPUs in Google's Napa and F1 Query systems, e.g., in grouping algorithms and in log-structured merge-forests. In order to realize the full benefit of interesting orderings, however, query execution algorithms must not only consume and exploit offset-value codes but also produce offset-value codes for the next operator in the pipeline. Our research has sought ways to produce offset-value codes without comparing successive output rows one-by-one, column-by-column. This short paper introduces a new theorem and, based on its proof and a simple corollary, describes in detail how order-preserving algorithms (from filter to merge join and even shuffle) can compute offset-value codes for their outputs. These computations are surprisingly simple and very efficient.
最近的工作展示了偏移值编码如何加快数据库查询的执行速度,不仅是排序,还包括在排序流中删除重复和分组(聚合)、保持顺序的交换(shuffle)、合并连接等。它已经在Google的Napa和F1查询系统中节省了数千个cpu,例如,在分组算法和日志结构的合并森林中。然而,为了充分实现感兴趣排序的好处,查询执行算法不仅必须使用和利用偏移值代码,还必须为管道中的下一个操作符生成偏移值代码。我们的研究一直在寻找产生偏移值代码的方法,而不需要逐个、逐列比较连续的输出行。这篇短文介绍了一个新的定理,并基于它的证明和一个简单的推论,详细描述了保序算法(从过滤到合并连接甚至洗牌)如何为它们的输出计算偏移值代码。这些计算出奇的简单和高效。
{"title":"Offset-value coding in database query processing","authors":"G. Graefe, Thanh Do","doi":"10.48550/arXiv.2210.00034","DOIUrl":"https://doi.org/10.48550/arXiv.2210.00034","url":null,"abstract":"Recent work shows how offset-value coding speeds up database query execution, not only sorting but also duplicate removal and grouping (aggregation) in sorted streams, order-preserving exchange (shuffle), merge join, and more. It already saves thousands of CPUs in Google's Napa and F1 Query systems, e.g., in grouping algorithms and in log-structured merge-forests. In order to realize the full benefit of interesting orderings, however, query execution algorithms must not only consume and exploit offset-value codes but also produce offset-value codes for the next operator in the pipeline. Our research has sought ways to produce offset-value codes without comparing successive output rows one-by-one, column-by-column. This short paper introduces a new theorem and, based on its proof and a simple corollary, describes in detail how order-preserving algorithms (from filter to merge join and even shuffle) can compute offset-value codes for their outputs. These computations are surprisingly simple and very efficient.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"5 1","pages":"464-470"},"PeriodicalIF":0.0,"publicationDate":"2022-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83841912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
bloomRF: On Performing Range-Queries in Bloom-Filters with Piecewise-Monotone Hash Functions and Prefix Hashing 用分段单调哈希函数和前缀哈希在bloom过滤器中执行范围查询
B. Mößner, Christian Riegger, Arthur Bernhardt, Ilia Petrov
We introduce bloomRF as a unified method for approximate membership testing that supports both point- and range-queries. As a first core idea, bloomRF introduces novel prefix hashing to efficiently encode range information in the hash-code of the key itself. As a second key concept, bloomRF proposes novel piecewise-monotone hash-functions that preserve local order and support fast range-lookups with fewer memory accesses. bloomRF has near-optimal space complexity and constant query complexity. Although, bloomRF is designed for integer domains, it supports floating-points, and can serve as a multi-attribute filter. The evaluation in RocksDB and in a standalone library shows that it is more efficient and outperforms existing point-range-filters by up to 4x across a range of settings and distributions, while keeping the false-positive rate low.
我们引入bloomRF作为一种统一的近似成员测试方法,它支持点查询和范围查询。作为第一个核心思想,bloomRF引入了新颖的前缀哈希,以便在键本身的哈希码中有效地编码范围信息。作为第二个关键概念,bloomRF提出了新颖的分段单调哈希函数,它保留了局部顺序,并支持快速范围查找,减少了内存访问。bloomRF具有接近最优的空间复杂度和恒定的查询复杂度。虽然bloomRF是为整数域设计的,但它支持浮点数,并且可以用作多属性过滤器。在RocksDB和独立库中的评估表明,它更高效,在一系列设置和分布中比现有的点范围过滤器性能高出4倍,同时保持低误报率。
{"title":"bloomRF: On Performing Range-Queries in Bloom-Filters with Piecewise-Monotone Hash Functions and Prefix Hashing","authors":"B. Mößner, Christian Riegger, Arthur Bernhardt, Ilia Petrov","doi":"10.48550/arXiv.2207.04789","DOIUrl":"https://doi.org/10.48550/arXiv.2207.04789","url":null,"abstract":"We introduce bloomRF as a unified method for approximate membership testing that supports both point- and range-queries. As a first core idea, bloomRF introduces novel prefix hashing to efficiently encode range information in the hash-code of the key itself. As a second key concept, bloomRF proposes novel piecewise-monotone hash-functions that preserve local order and support fast range-lookups with fewer memory accesses. bloomRF has near-optimal space complexity and constant query complexity. Although, bloomRF is designed for integer domains, it supports floating-points, and can serve as a multi-attribute filter. The evaluation in RocksDB and in a standalone library shows that it is more efficient and outperforms existing point-range-filters by up to 4x across a range of settings and distributions, while keeping the false-positive rate low.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"22 1","pages":"131-143"},"PeriodicalIF":0.0,"publicationDate":"2022-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88751086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
User Customizable and Robust Geo-Indistinguishability for Location Privacy 位置隐私的用户可定制和鲁棒地理不可分辨性
Primal Pappachan, Chenxi Qiu, A. Squicciarini, Vishnu Sharma Hunsur Manjunath
Location obfuscation functions generated by existing systems for ensuring location privacy are monolithic and do not allow users to customize their obfuscation range. This can lead to the user being mapped in undesirable locations (e.g., shady neighborhoods) to the location-requesting services. Modifying the obfuscation function generated by a centralized server on the user side can result in poor privacy as the original function is not robust against such updates. Users themselves might find it challenging to understand the parameters involved in obfuscation mechanisms (e.g., obfuscation range and granularity of location representation) and therefore struggle to set realistic trade-offs between privacy, utility, and customization. In this paper, we propose a new framework called, CORGI, i.e., CustOmizable Robust Geo-Indistinguishability, which generates location obfuscation functions that are robust against user customization while providing strong privacy guarantees based on the Geo-Indistinguishability paradigm. CORGI utilizes a tree representation of a given region to assist users in specifying their privacy and customization requirements. The server side of CORGI takes these requirements as inputs and generates an obfuscation function that satisfies Geo-Indistinguishability requirements and is robust against customization on the user side. The obfuscation function is returned to the user who can then choose to update the obfuscation function (e.g., obfuscation range, granularity of location representation). The experimental results on a real dataset demonstrate that CORGI can efficiently generate obfuscation matrices that are more robust to the customization by users.
现有系统为保证位置隐私而生成的位置模糊功能是单一的,不允许用户自定义其模糊范围。这可能导致用户在不受欢迎的位置(例如,阴暗的社区)被映射到位置请求服务。修改集中式服务器在用户端生成的混淆函数可能会导致较差的隐私性,因为原始函数对此类更新的鲁棒性不高。用户自己可能会发现很难理解混淆机制中涉及的参数(例如,位置表示的混淆范围和粒度),因此很难在隐私、实用程序和自定义之间做出现实的权衡。在本文中,我们提出了一个名为CORGI的新框架,即自定义鲁棒地理不可分辨性,它生成的位置混淆函数对用户自定义具有鲁棒性,同时基于地理不可分辨性范式提供了强大的隐私保证。CORGI利用给定区域的树表示来帮助用户指定他们的隐私和定制需求。CORGI的服务器端将这些需求作为输入,并生成一个混淆函数,该函数满足地理不可区分性需求,并且对用户端的自定义具有鲁棒性。混淆函数返回给用户,然后用户可以选择更新混淆函数(例如,混淆范围,位置表示的粒度)。在真实数据集上的实验结果表明,CORGI能够有效地生成对用户自定义具有较强鲁棒性的混淆矩阵。
{"title":"User Customizable and Robust Geo-Indistinguishability for Location Privacy","authors":"Primal Pappachan, Chenxi Qiu, A. Squicciarini, Vishnu Sharma Hunsur Manjunath","doi":"10.48786/edbt.2023.55","DOIUrl":"https://doi.org/10.48786/edbt.2023.55","url":null,"abstract":"Location obfuscation functions generated by existing systems for ensuring location privacy are monolithic and do not allow users to customize their obfuscation range. This can lead to the user being mapped in undesirable locations (e.g., shady neighborhoods) to the location-requesting services. Modifying the obfuscation function generated by a centralized server on the user side can result in poor privacy as the original function is not robust against such updates. Users themselves might find it challenging to understand the parameters involved in obfuscation mechanisms (e.g., obfuscation range and granularity of location representation) and therefore struggle to set realistic trade-offs between privacy, utility, and customization. In this paper, we propose a new framework called, CORGI, i.e., CustOmizable Robust Geo-Indistinguishability, which generates location obfuscation functions that are robust against user customization while providing strong privacy guarantees based on the Geo-Indistinguishability paradigm. CORGI utilizes a tree representation of a given region to assist users in specifying their privacy and customization requirements. The server side of CORGI takes these requirements as inputs and generates an obfuscation function that satisfies Geo-Indistinguishability requirements and is robust against customization on the user side. The obfuscation function is returned to the user who can then choose to update the obfuscation function (e.g., obfuscation range, granularity of location representation). The experimental results on a real dataset demonstrate that CORGI can efficiently generate obfuscation matrices that are more robust to the customization by users.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"45 3","pages":"658-670"},"PeriodicalIF":0.0,"publicationDate":"2022-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72634522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Unsupervised Space Partitioning for Nearest Neighbor Search 最近邻搜索的无监督空间划分
Abrar Fahim, Mohammed Eunus Ali, M. A. Cheema
Approximate Nearest Neighbor Search (ANNS) in high dimensional spaces is crucial for many real-life applications (e.g., e-commerce, web, multimedia, etc.) dealing with an abundance of data. This paper proposes an end-to-end learning framework that couples the partitioning (one critical step of ANNS) and learning-to-search steps using a custom loss function. A key advantage of our proposed solution is that it does not require any expensive pre-processing of the dataset, which is one of the critical limitations of the state-of-the-art approach. We achieve the above edge by formulating a multi-objective custom loss function that does not need ground truth labels to quantify the quality of a given data-space partition, making it entirely unsupervised. We also propose an ensembling technique by adding varying input weights to the loss function to train an ensemble of models to enhance the search quality. On several standard benchmarks for ANNS, we show that our method beats the state-of-the-art space partitioning method and the ubiquitous K-means clustering method while using fewer parameters and shorter offline training times. We also show that incorporating our space-partitioning strategy into state-of-the-art ANNS techniques such as ScaNN can improve their performance significantly. Finally, we present our unsupervised partitioning approach as a promising alternative to many widely used clustering methods, such as K-means clustering and DBSCAN.
高维空间中的近似最近邻搜索(ANNS)对于处理大量数据的许多现实应用(例如,电子商务,网络,多媒体等)至关重要。本文提出了一个端到端学习框架,该框架使用自定义损失函数将划分(ANNS的一个关键步骤)和学习搜索步骤耦合在一起。我们提出的解决方案的一个关键优势是它不需要任何昂贵的数据集预处理,这是最先进方法的关键限制之一。我们通过制定一个多目标自定义损失函数来实现上述边缘,该函数不需要地面真值标签来量化给定数据空间分区的质量,使其完全无监督。我们还提出了一种集成技术,通过在损失函数中添加不同的输入权值来训练模型的集成,以提高搜索质量。在ANNS的几个标准基准测试中,我们表明我们的方法击败了最先进的空间划分方法和无处不在的k均值聚类方法,同时使用更少的参数和更短的离线训练时间。我们还表明,将我们的空间分区策略结合到最先进的ann技术(如ScaNN)中可以显着提高其性能。最后,我们提出了我们的无监督分区方法,作为许多广泛使用的聚类方法(如K-means聚类和DBSCAN)的有前途的替代方法。
{"title":"Unsupervised Space Partitioning for Nearest Neighbor Search","authors":"Abrar Fahim, Mohammed Eunus Ali, M. A. Cheema","doi":"10.48550/arXiv.2206.08091","DOIUrl":"https://doi.org/10.48550/arXiv.2206.08091","url":null,"abstract":"Approximate Nearest Neighbor Search (ANNS) in high dimensional spaces is crucial for many real-life applications (e.g., e-commerce, web, multimedia, etc.) dealing with an abundance of data. This paper proposes an end-to-end learning framework that couples the partitioning (one critical step of ANNS) and learning-to-search steps using a custom loss function. A key advantage of our proposed solution is that it does not require any expensive pre-processing of the dataset, which is one of the critical limitations of the state-of-the-art approach. We achieve the above edge by formulating a multi-objective custom loss function that does not need ground truth labels to quantify the quality of a given data-space partition, making it entirely unsupervised. We also propose an ensembling technique by adding varying input weights to the loss function to train an ensemble of models to enhance the search quality. On several standard benchmarks for ANNS, we show that our method beats the state-of-the-art space partitioning method and the ubiquitous K-means clustering method while using fewer parameters and shorter offline training times. We also show that incorporating our space-partitioning strategy into state-of-the-art ANNS techniques such as ScaNN can improve their performance significantly. Finally, we present our unsupervised partitioning approach as a promising alternative to many widely used clustering methods, such as K-means clustering and DBSCAN.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"36 1","pages":"351-363"},"PeriodicalIF":0.0,"publicationDate":"2022-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81640004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Advances in database technology : proceedings. International Conference on Extending Database Technology
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1