首页 > 最新文献

Advances in database technology : proceedings. International Conference on Extending Database Technology最新文献

英文 中文
Fair Spatial Indexing: A paradigm for Group Spatial Fairness. 公平空间索引:群体空间公平范例
Sina Shaham, Gabriel Ghinita, Cyrus Shahabi

Machine learning (ML) is playing an increasing role in decision-making tasks that directly affect individuals, e.g., loan approvals, or job applicant screening. Significant concerns arise that, without special provisions, individuals from under-privileged backgrounds may not get equitable access to services and opportunities. Existing research studies fairness with respect to protected attributes such as gender, race or income, but the impact of location data on fairness has been largely overlooked. With the widespread adoption of mobile apps, geospatial attributes are increasingly used in ML, and their potential to introduce unfair bias is significant, given their high correlation with protected attributes. We propose techniques to mitigate location bias in machine learning. Specifically, we consider the issue of miscalibration when dealing with geospatial attributes. We focus on spatial group fairness and we propose a spatial indexing algorithm that accounts for fairness. Our KD-tree inspired approach significantly improves fairness while maintaining high learning accuracy, as shown by extensive experimental results on real data.

机器学习(ML)在直接影响个人的决策任务(如贷款审批或求职者筛选)中发挥着越来越重要的作用。如果没有特殊规定,来自弱势背景的个人可能无法公平地获得服务和机会,这引起了人们的极大关注。现有研究对性别、种族或收入等受保护属性的公平性进行了研究,但位置数据对公平性的影响在很大程度上被忽视了。随着移动应用程序的广泛采用,地理空间属性越来越多地用于 ML,鉴于其与受保护属性的高度相关性,它们引入不公平偏见的可能性非常大。我们提出了在机器学习中减轻位置偏差的技术。具体来说,我们考虑了处理地理空间属性时的误判问题。我们将重点放在空间组公平性上,并提出了一种考虑公平性的空间索引算法。我们的 KD 树启发方法在保持高学习准确性的同时,显著提高了公平性,这一点已通过在真实数据上的大量实验结果得到证明。
{"title":"Fair Spatial Indexing: A paradigm for Group Spatial Fairness.","authors":"Sina Shaham, Gabriel Ghinita, Cyrus Shahabi","doi":"10.48786/edbt.2024.14","DOIUrl":"10.48786/edbt.2024.14","url":null,"abstract":"<p><p>Machine learning (ML) is playing an increasing role in decision-making tasks that directly affect individuals, e.g., loan approvals, or job applicant screening. Significant concerns arise that, without special provisions, individuals from under-privileged backgrounds may not get equitable access to services and opportunities. Existing research studies <i>fairness</i> with respect to protected attributes such as gender, race or income, but the impact of location data on fairness has been largely overlooked. With the widespread adoption of mobile apps, geospatial attributes are increasingly used in ML, and their potential to introduce unfair bias is significant, given their high correlation with protected attributes. We propose techniques to mitigate location bias in machine learning. Specifically, we consider the issue of miscalibration when dealing with geospatial attributes. We focus on <i>spatial group fairness</i> and we propose a spatial indexing algorithm that accounts for fairness. Our KD-tree inspired approach significantly improves fairness while maintaining high learning accuracy, as shown by extensive experimental results on real data.</p>","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"27 2","pages":"150-161"},"PeriodicalIF":0.0,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11531788/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142570639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Computing Generic Abstractions from Application Datasets 从应用程序数据集计算通用抽象
Nelly Barret, I. Manolescu, P. Upadhyay
Digital data plays a central role in sciences, journalism, environment, digital humanities, etc. Open Data sharing initiatives lead to many large, interesting datasets being shared online. Some of these are RDF graphs, but other formats like CSV, relational, property graphs, JSON or XML documents are also frequent. Practitioners need to understand a dataset to decide whether it is suited to their needs. Datasets may come with a schema and/or may be summarized, however the first is not always provided and the latter is often too technical for non-IT users. To overcome these limitations, we present an end-to-end dataset abstraction approach, which ( 𝑖 ) applies on any (semi)structured data model; ( 𝑖𝑖 ) computes a description meant for human users, in the form of an Entity-Relationship diagram; ( 𝑖𝑖𝑖 ) integrates Information Extraction and data profiling to classify dataset entities among a large set of intelligible categories. We implemented our approach in a system called Abstra, and detail its performance on various datasets.
数字数据在科学、新闻、环境、数字人文等领域发挥着核心作用。开放数据共享计划导致许多大型、有趣的数据集在网上共享。其中一些是RDF图,但其他格式,如CSV、关系图、属性图、JSON或XML文档也很常见。从业者需要了解数据集,以决定它是否适合他们的需求。数据集可能带有模式和/或汇总,但是前者并不总是提供,而后者对于非it用户来说往往过于技术性。为了克服这些限制,我们提出了一种端到端数据抽象方法,该方法适用于任何(半)结构化数据模型;(s)以实体关系图的形式计算用于人类用户的描述;将信息抽取和数据分析集成在一起,将数据集实体划分为大量可理解的类别。我们在一个名为Abstra的系统中实现了我们的方法,并详细说明了它在各种数据集上的性能。
{"title":"Computing Generic Abstractions from Application Datasets","authors":"Nelly Barret, I. Manolescu, P. Upadhyay","doi":"10.48786/edbt.2024.09","DOIUrl":"https://doi.org/10.48786/edbt.2024.09","url":null,"abstract":"Digital data plays a central role in sciences, journalism, environment, digital humanities, etc. Open Data sharing initiatives lead to many large, interesting datasets being shared online. Some of these are RDF graphs, but other formats like CSV, relational, property graphs, JSON or XML documents are also frequent. Practitioners need to understand a dataset to decide whether it is suited to their needs. Datasets may come with a schema and/or may be summarized, however the first is not always provided and the latter is often too technical for non-IT users. To overcome these limitations, we present an end-to-end dataset abstraction approach, which ( 𝑖 ) applies on any (semi)structured data model; ( 𝑖𝑖 ) computes a description meant for human users, in the form of an Entity-Relationship diagram; ( 𝑖𝑖𝑖 ) integrates Information Extraction and data profiling to classify dataset entities among a large set of intelligible categories. We implemented our approach in a system called Abstra, and detail its performance on various datasets.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"14 1","pages":"94-107"},"PeriodicalIF":0.0,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87206329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Data Coverage for Detecting Representation Bias in Image Datasets: A Crowdsourcing Approach 用于检测图像数据集中表示偏差的数据覆盖:一种众包方法
Melika Mousavi, N. Shahbazi, Abolfazl Asudeh
Existing machine learning models have proven to fail when it comes to their performance for minority groups, mainly due to biases in data. In particular, datasets, especially social data, are often not representative of minorities. In this paper, we consider the problem of representation bias identification on image datasets without explicit attribute values. Using the notion of data coverage for detecting a lack of representation, we develop multiple crowdsourcing approaches. Our core approach, at a high level, is a divide and conquer algorithm that applies a search space pruning strategy to efficiently identify if a dataset misses proper coverage for a given group. We provide a different theoretical analysis of our algorithm, including a tight upper bound on its performance which guarantees its near-optimality. Using this algorithm as the core, we propose multiple heuristics to reduce the coverage detection cost across different cases with multiple intersectional/non-intersectional groups. We demonstrate how the pre-trained predictors are not reliable and hence not sufficient for detecting representation bias in the data. Finally, we adjust our core algorithm to utilize existing models for predicting image group(s) to minimize the coverage identification cost. We conduct extensive experiments, including live experiments on Amazon Mechanical Turk to validate our problem and evaluate our algorithms' performance.
事实证明,现有的机器学习模型在少数群体中的表现是失败的,这主要是由于数据中的偏见。特别是,数据集,尤其是社会数据,往往不能代表少数群体。在本文中,我们考虑了在没有显式属性值的图像数据集上的表示偏差识别问题。利用数据覆盖的概念来检测缺乏代表性,我们开发了多种众包方法。在高层次上,我们的核心方法是一种分而治之的算法,它应用搜索空间修剪策略来有效地识别数据集是否错过了给定组的适当覆盖。我们对我们的算法进行了不同的理论分析,包括其性能的严格上界,以保证其接近最优性。以该算法为核心,提出了多种启发式算法,以降低多个相交/非相交组在不同情况下的覆盖检测成本。我们证明了预训练的预测器是如何不可靠的,因此不足以检测数据中的表示偏差。最后,我们调整了我们的核心算法,利用现有的模型来预测图像组,以最小化覆盖识别成本。我们进行了大量的实验,包括在Amazon Mechanical Turk上进行的现场实验,以验证我们的问题并评估我们的算法性能。
{"title":"Data Coverage for Detecting Representation Bias in Image Datasets: A Crowdsourcing Approach","authors":"Melika Mousavi, N. Shahbazi, Abolfazl Asudeh","doi":"10.48550/arXiv.2306.13868","DOIUrl":"https://doi.org/10.48550/arXiv.2306.13868","url":null,"abstract":"Existing machine learning models have proven to fail when it comes to their performance for minority groups, mainly due to biases in data. In particular, datasets, especially social data, are often not representative of minorities. In this paper, we consider the problem of representation bias identification on image datasets without explicit attribute values. Using the notion of data coverage for detecting a lack of representation, we develop multiple crowdsourcing approaches. Our core approach, at a high level, is a divide and conquer algorithm that applies a search space pruning strategy to efficiently identify if a dataset misses proper coverage for a given group. We provide a different theoretical analysis of our algorithm, including a tight upper bound on its performance which guarantees its near-optimality. Using this algorithm as the core, we propose multiple heuristics to reduce the coverage detection cost across different cases with multiple intersectional/non-intersectional groups. We demonstrate how the pre-trained predictors are not reliable and hence not sufficient for detecting representation bias in the data. Finally, we adjust our core algorithm to utilize existing models for predicting image group(s) to minimize the coverage identification cost. We conduct extensive experiments, including live experiments on Amazon Mechanical Turk to validate our problem and evaluate our algorithms' performance.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"3 9 1","pages":"47-60"},"PeriodicalIF":0.0,"publicationDate":"2023-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84356880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Auditing for Spatial Fairness 空间公平性审计
Dimitris Sacharidis, G. Giannopoulos, George Papastefanatos, K. Stefanidis
This paper studies algorithmic fairness when the protected attribute is location. To handle protected attributes that are continuous, such as age or income, the standard approach is to discretize the domain into predefined groups, and compare algorithmic outcomes across groups. However, applying this idea to location raises concerns of gerrymandering and may introduce statistical bias. Prior work addresses these concerns but only for regularly spaced locations, while raising other issues, most notably its inability to discern regions that are likely to exhibit spatial unfairness. Similar to established notions of algorithmic fairness, we define spatial fairness as the statistical independence of outcomes from location. This translates into requiring that for each region of space, the distribution of outcomes is identical inside and outside the region. To allow for localized discrepancies in the distribution of outcomes, we compare how well two competing hypotheses explain the observed outcomes. The null hypothesis assumes spatial fairness, while the alternate allows different distributions inside and outside regions. Their goodness of fit is then assessed by a likelihood ratio test. If there is no significant difference in how well the two hypotheses explain the observed outcomes, we conclude that the algorithm is spatially fair.
研究了保护属性为位置时的算法公平性问题。要处理连续的受保护属性,如年龄或收入,标准的方法是将域离散到预定义的组中,并跨组比较算法结果。然而,将这一想法应用于地理位置会引起对不公正划分选区的担忧,并可能引入统计偏差。先前的工作解决了这些问题,但只针对有规律间隔的地点,同时提出了其他问题,最明显的是它无法辨别可能表现出空间不公平的区域。与既定的算法公平概念类似,我们将空间公平定义为结果与位置的统计独立性。这就要求对于每个空间区域,结果的分布在区域内外是相同的。为了考虑结果分布的局部差异,我们比较了两个相互竞争的假设如何很好地解释观察到的结果。零假设假设空间公平,而交替假设允许区域内外的不同分布。然后通过似然比检验评估它们的拟合优度。如果两种假设在解释观察结果的程度上没有显著差异,我们得出结论,该算法在空间上是公平的。
{"title":"Auditing for Spatial Fairness","authors":"Dimitris Sacharidis, G. Giannopoulos, George Papastefanatos, K. Stefanidis","doi":"10.48550/arXiv.2302.12333","DOIUrl":"https://doi.org/10.48550/arXiv.2302.12333","url":null,"abstract":"This paper studies algorithmic fairness when the protected attribute is location. To handle protected attributes that are continuous, such as age or income, the standard approach is to discretize the domain into predefined groups, and compare algorithmic outcomes across groups. However, applying this idea to location raises concerns of gerrymandering and may introduce statistical bias. Prior work addresses these concerns but only for regularly spaced locations, while raising other issues, most notably its inability to discern regions that are likely to exhibit spatial unfairness. Similar to established notions of algorithmic fairness, we define spatial fairness as the statistical independence of outcomes from location. This translates into requiring that for each region of space, the distribution of outcomes is identical inside and outside the region. To allow for localized discrepancies in the distribution of outcomes, we compare how well two competing hypotheses explain the observed outcomes. The null hypothesis assumes spatial fairness, while the alternate allows different distributions inside and outside regions. Their goodness of fit is then assessed by a likelihood ratio test. If there is no significant difference in how well the two hypotheses explain the observed outcomes, we conclude that the algorithm is spatially fair.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"35 1","pages":"485-491"},"PeriodicalIF":0.0,"publicationDate":"2023-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74184759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
TransEdge: Supporting Efficient Read Queries Across Untrusted Edge Nodes transsedge:支持跨不可信边缘节点的高效读查询
Abhishek A. Singh, Aasim Khan, S. Mehrotra, Faisal Nawab
We propose Transactional Edge (TransEdge), a distributed transaction processing system for untrusted environments such as edge computing systems. What distinguishes TransEdge is its focus on efficient support for read-only transactions. TransEdge allows reading from different partitions consistently using one round in most cases and no more than two rounds in the worst case. TransEdge design is centered around this dependency tracking scheme including the consensus and transaction processing protocols. Our performance evaluation shows that TransEdge's snapshot read-only transactions achieve an 9-24x speedup compared to current byzantine systems.
我们提出事务性边缘(transsedge),这是一种分布式事务处理系统,适用于边缘计算系统等不可信环境。TransEdge的不同之处在于它专注于对只读事务的有效支持。transsedge允许在大多数情况下使用一轮从不同分区持续读取,在最坏的情况下不超过两轮。TransEdge的设计围绕着这个依赖跟踪方案,包括共识和事务处理协议。我们的性能评估显示,与当前的拜占庭系统相比,TransEdge的快照只读事务实现了9-24倍的加速。
{"title":"TransEdge: Supporting Efficient Read Queries Across Untrusted Edge Nodes","authors":"Abhishek A. Singh, Aasim Khan, S. Mehrotra, Faisal Nawab","doi":"10.48550/arXiv.2302.08019","DOIUrl":"https://doi.org/10.48550/arXiv.2302.08019","url":null,"abstract":"We propose Transactional Edge (TransEdge), a distributed transaction processing system for untrusted environments such as edge computing systems. What distinguishes TransEdge is its focus on efficient support for read-only transactions. TransEdge allows reading from different partitions consistently using one round in most cases and no more than two rounds in the worst case. TransEdge design is centered around this dependency tracking scheme including the consensus and transaction processing protocols. Our performance evaluation shows that TransEdge's snapshot read-only transactions achieve an 9-24x speedup compared to current byzantine systems.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"22 1","pages":"684-696"},"PeriodicalIF":0.0,"publicationDate":"2023-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91113987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Implementing and Evaluating E2LSH on Storage 在存储上实施和评估E2LSH
Yuuichi Nakanishi, Kazuhiro Hiwada, Yosuke Bando, Tomoya Suzuki, H. Kajihara, Shintarou Sano, Tatsuro Endo, Tatsuo Shiozawa
Locality sensitive hashing (LSH) is one of the widely-used approaches to approximate nearest neighbor search (ANNS) in high-dimensional spaces. The first work on LSH for the Euclidean distance, E2LSH, showed how ANNS can be solved efficiently at a sublinear query time in the database size with theoretically-guaranteed accuracy, although it required a large hash index size. Since then, several LSH variants having much smaller index sizes have been proposed. Their query time is linear or superlinear, but they have been shown to run effectively faster because they require fewer I/Os when the index is stored on hard disk drives and because they also permit in-memory execution with modern DRAM capacity. In this paper, we show that E2LSH is regaining the advantage in query speed with the advent of modern flash storage devices such as solid-state drives (SSDs). We evaluate E2LSH on a modern single-node computing environment and analyze its computational cost and I/O cost, from which we derive storage performance requirements for its external memory execution. Our analysis indicates that E2LSH on a single consumer-grade SSD can run faster than the state-of-the-art small-index methods executed in-memory. It also indicates that E2LSH with emerging high-performance storage devices and interfaces can approach in-memory E2LSH speeds. We implement a simple adaptation of E2LSH to external memory, E2LSH-on-Storage (E2LSHoS), and evaluate it for practical large datasets of up to one billion objects using different combinations of modern storage devices and interfaces. We demonstrate that our E2LSHoS implementation runs much faster than small-index methods and can approach in-memory E2LSH speeds, and also that its query time scales sublinearly with the database size beyond the index size limit of in-memory E2LSH.
局部敏感哈希(LSH)是高维空间中最近邻搜索(ANNS)的近似方法之一。欧几里得距离LSH的第一个工作,E2LSH,展示了如何在数据库大小的次线性查询时间内有效地求解ANNS,并具有理论上保证的准确性,尽管它需要很大的哈希索引大小。从那时起,已经提出了几个索引大小小得多的LSH变体。它们的查询时间是线性或超线性的,但是它们的运行速度更快,因为当索引存储在硬盘驱动器上时,它们需要更少的I/ o,而且它们还允许使用现代DRAM容量在内存中执行。在本文中,我们展示了E2LSH随着现代闪存设备(如固态驱动器(ssd))的出现而在查询速度上重新获得优势。我们在现代单节点计算环境中评估了E2LSH,并分析了其计算成本和I/O成本,从中得出了其外部内存执行的存储性能要求。我们的分析表明,单个消费级SSD上的E2LSH比在内存中执行的最先进的小索引方法运行得更快。它还表明,具有新兴高性能存储设备和接口的E2LSH可以接近内存中的E2LSH速度。我们实现了E2LSH对外部存储器的简单适应,E2LSH-on- storage (E2LSHoS),并使用不同的现代存储设备和接口组合对多达10亿个对象的实际大型数据集进行了评估。我们证明了我们的E2LSHoS实现比小索引方法运行得快得多,并且可以接近内存中的E2LSH速度,而且它的查询时间随数据库大小的次线性扩展,超出了内存中的E2LSH的索引大小限制。
{"title":"Implementing and Evaluating E2LSH on Storage","authors":"Yuuichi Nakanishi, Kazuhiro Hiwada, Yosuke Bando, Tomoya Suzuki, H. Kajihara, Shintarou Sano, Tatsuro Endo, Tatsuo Shiozawa","doi":"10.48786/edbt.2023.35","DOIUrl":"https://doi.org/10.48786/edbt.2023.35","url":null,"abstract":"Locality sensitive hashing (LSH) is one of the widely-used approaches to approximate nearest neighbor search (ANNS) in high-dimensional spaces. The first work on LSH for the Euclidean distance, E2LSH, showed how ANNS can be solved efficiently at a sublinear query time in the database size with theoretically-guaranteed accuracy, although it required a large hash index size. Since then, several LSH variants having much smaller index sizes have been proposed. Their query time is linear or superlinear, but they have been shown to run effectively faster because they require fewer I/Os when the index is stored on hard disk drives and because they also permit in-memory execution with modern DRAM capacity. In this paper, we show that E2LSH is regaining the advantage in query speed with the advent of modern flash storage devices such as solid-state drives (SSDs). We evaluate E2LSH on a modern single-node computing environment and analyze its computational cost and I/O cost, from which we derive storage performance requirements for its external memory execution. Our analysis indicates that E2LSH on a single consumer-grade SSD can run faster than the state-of-the-art small-index methods executed in-memory. It also indicates that E2LSH with emerging high-performance storage devices and interfaces can approach in-memory E2LSH speeds. We implement a simple adaptation of E2LSH to external memory, E2LSH-on-Storage (E2LSHoS), and evaluate it for practical large datasets of up to one billion objects using different combinations of modern storage devices and interfaces. We demonstrate that our E2LSHoS implementation runs much faster than small-index methods and can approach in-memory E2LSH speeds, and also that its query time scales sublinearly with the database size beyond the index size limit of in-memory E2LSH.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"22 1","pages":"437-449"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77757083","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Efficient Approach for Indoor Facility Location Selection 室内设施选址的一种有效方法
Yeasir Rayhan, T. Hashem, M. A. Cheema, Hua Lu, Mohammed Eunus Ali
The advancement of indoor location-aware technologies enables a wide range of location based services in indoor spaces. In this paper, we formulate a novel Indoor Facility Location Selection (IFLS) query that finds the optimal location for placing a new facility (e.g., a coffee station) in an indoor venue (e.g., a university building) such that the maximum distance of all clients (e.g., staffs/students) to their nearest facility is minimized. To the best of our knowledge we are the first to address this problem in an indoor setting. We first adapt the state-of-the-art solution in road networks for indoor settings, which exposes the limitations of existing approaches to solve our problem in an indoor space. Therefore, we propose an efficient approach which prunes the search space in terms of the number of clients considered, and the total number of facilities retrieved from the database, thus reducing the total number of indoor distance calculations required. The key idea of our approach is to use a single pass on a state-of-the-art index for an indoor space, and reuse the nearest neighbor computation of clients to prune irrelevant facilities and clients. We evaluate the performance of both approaches on four indoor datasets. Our approach achieves a speedup from 2 . 84 × to 71 . 29 × for synthetic data and 97 . 74 × for real data over the baseline.
室内位置感知技术的进步使室内空间的各种基于位置的服务成为可能。在本文中,我们制定了一个新颖的室内设施位置选择(IFLS)查询,该查询可以在室内场地(例如大学大楼)中找到放置新设施(例如咖啡站)的最佳位置,从而使所有客户(例如员工/学生)到最近设施的最大距离最小化。据我们所知,我们是第一个在室内环境中解决这个问题的人。我们首先将最先进的道路网络解决方案应用于室内环境,这暴露了现有方法在室内空间解决问题的局限性。因此,我们提出了一种有效的方法,根据考虑的客户数量和从数据库中检索到的设施总数来修剪搜索空间,从而减少所需的室内距离计算总数。我们方法的关键思想是对室内空间的最先进的索引使用单个通道,并重用客户端的最近邻计算来修剪不相关的设施和客户端。我们在四个室内数据集上评估了这两种方法的性能。我们的方法实现了2的加速。84 × 71。合成数据为29 × 97。实际数据的基线值为74 ×。
{"title":"An Efficient Approach for Indoor Facility Location Selection","authors":"Yeasir Rayhan, T. Hashem, M. A. Cheema, Hua Lu, Mohammed Eunus Ali","doi":"10.48786/edbt.2023.53","DOIUrl":"https://doi.org/10.48786/edbt.2023.53","url":null,"abstract":"The advancement of indoor location-aware technologies enables a wide range of location based services in indoor spaces. In this paper, we formulate a novel Indoor Facility Location Selection (IFLS) query that finds the optimal location for placing a new facility (e.g., a coffee station) in an indoor venue (e.g., a university building) such that the maximum distance of all clients (e.g., staffs/students) to their nearest facility is minimized. To the best of our knowledge we are the first to address this problem in an indoor setting. We first adapt the state-of-the-art solution in road networks for indoor settings, which exposes the limitations of existing approaches to solve our problem in an indoor space. Therefore, we propose an efficient approach which prunes the search space in terms of the number of clients considered, and the total number of facilities retrieved from the database, thus reducing the total number of indoor distance calculations required. The key idea of our approach is to use a single pass on a state-of-the-art index for an indoor space, and reuse the nearest neighbor computation of clients to prune irrelevant facilities and clients. We evaluate the performance of both approaches on four indoor datasets. Our approach achieves a speedup from 2 . 84 × to 71 . 29 × for synthetic data and 97 . 74 × for real data over the baseline.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"33 1","pages":"632-644"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77819825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A simplified Architecture for Fast, Adaptive Compilation and Execution of SQL Queries 用于快速、自适应编译和执行SQL查询的简化架构
Immanuel Haffner, J. Dittrich
Query compilation is crucial to efficiently execute query plans. In the past decade, we have witnessed considerable progress in this field, including compilation with LLVM, adaptively switching from interpretation to compiled code, as well as adaptively switching from non-optimized to optimized code. All of these ideas aim to reduce latency and/or increase throughput. However, these approaches require immense engineering effort, a considerable part of which includes reengineering very fundamental techniques from the compiler construction community, like register allocation or machine code generation – techniques studied in this field for decades. In this paper, we argue
查询编译是有效执行查询计划的关键。在过去的十年中,我们见证了该领域的长足进步,包括使用LLVM进行编译,自适应地从解释代码切换到编译代码,以及自适应地从非优化代码切换到优化代码。所有这些想法都旨在减少延迟和/或提高吞吐量。然而,这些方法需要巨大的工程努力,其中相当大的一部分包括对编译器构建社区中非常基本的技术进行再工程,比如寄存器分配或机器码生成——这些技术在该领域已经研究了几十年。在本文中,我们进行了论证
{"title":"A simplified Architecture for Fast, Adaptive Compilation and Execution of SQL Queries","authors":"Immanuel Haffner, J. Dittrich","doi":"10.48786/edbt.2023.01","DOIUrl":"https://doi.org/10.48786/edbt.2023.01","url":null,"abstract":"Query compilation is crucial to efficiently execute query plans. In the past decade, we have witnessed considerable progress in this field, including compilation with LLVM, adaptively switching from interpretation to compiled code, as well as adaptively switching from non-optimized to optimized code. All of these ideas aim to reduce latency and/or increase throughput. However, these approaches require immense engineering effort, a considerable part of which includes reengineering very fundamental techniques from the compiler construction community, like register allocation or machine code generation – techniques studied in this field for decades. In this paper, we argue","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"1 1","pages":"1-13"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88929963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Supporting Complex Query Time Enrichment For Analytics 支持复杂的查询时间丰富分析
Dhrubajyoti Ghosh, Peeyush Gupta, S. Mehrotra, Shantanu Sharma
Several application domains require data to be enriched prior to its use. Data enrichment is often performed using expensive machine learning models to interpret low-level data ( e . g ., models for face detection) into semantically meaningful observation. Col-lecting and enriching data offline before loading it to a database is infeasible if one desires online analysis on data as it arrives. Enriching data on the fly at insertion could result in redundant work (if applications require only a fraction of the data to be enriched) and could result in a bottleneck (if enrichment functions are expensive). Any scalable solution requires enrichment during query processing. This paper explores two different architectures for integrating enrichment into query processing – a loosely coupled approach wherein enrichment is performed outside of the DBMS and a tightly coupled approach wherein it is performed within the DBMS. The paper addresses the challenges of increased query latency due to query time enrichment.
一些应用程序域需要在使用数据之前对其进行充实。数据丰富通常使用昂贵的机器学习模型来解释低级数据(例如:G .人脸检测模型)转化为语义上有意义的观察。如果希望在数据到达时对其进行在线分析,那么在将数据加载到数据库之前离线收集和丰富数据是不可行的。在插入时动态地充实数据可能会导致冗余工作(如果应用程序只需要充实一小部分数据),并可能导致瓶颈(如果充实函数很昂贵)。任何可扩展的解决方案都需要在查询处理期间进行充实。本文探讨了将浓缩集成到查询处理中的两种不同的体系结构——一种是松耦合的方法,其中浓缩在DBMS之外执行,另一种是紧耦合的方法,其中浓缩在DBMS内执行。本文解决了由于查询时间丰富而增加的查询延迟的挑战。
{"title":"Supporting Complex Query Time Enrichment For Analytics","authors":"Dhrubajyoti Ghosh, Peeyush Gupta, S. Mehrotra, Shantanu Sharma","doi":"10.48786/edbt.2023.08","DOIUrl":"https://doi.org/10.48786/edbt.2023.08","url":null,"abstract":"Several application domains require data to be enriched prior to its use. Data enrichment is often performed using expensive machine learning models to interpret low-level data ( e . g ., models for face detection) into semantically meaningful observation. Col-lecting and enriching data offline before loading it to a database is infeasible if one desires online analysis on data as it arrives. Enriching data on the fly at insertion could result in redundant work (if applications require only a fraction of the data to be enriched) and could result in a bottleneck (if enrichment functions are expensive). Any scalable solution requires enrichment during query processing. This paper explores two different architectures for integrating enrichment into query processing – a loosely coupled approach wherein enrichment is performed outside of the DBMS and a tightly coupled approach wherein it is performed within the DBMS. The paper addresses the challenges of increased query latency due to query time enrichment.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"91 1","pages":"92-104"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80872252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
FELIP: A local Differentially Private approach to frequency estimation on multidimensional datasets 基于局部差分私有的多维数据集频率估计方法
José S. Costa Filho, Javam C. Machado
Local Differential Privacy (LDP) allows answering queries on users data while maintaining their privacy. Queries are often is-sued on multidimensional datasets with categorical and numeric dimensions. In this paper, we tackle the problem of answering counting queries over multidimensional datasets with categorical and numeric dimensions under LDP. In the setting without a trusted central agent, the user’s private dimensions are firstly perturbed locally to preserve privacy and then sent to an aggregator who will be able to estimate answers to queries. We build our approach on the existing idea of using grids. Mapping users dimensions into grids which are perturbed and sent to the aggregator so it can estimate the real data distributions to answer different queries on the dimensions collected. Finer-grained grids lead to greater error due to noises, while coarser-grained ones result in greater error due to biases. We propose optimizing the construction of grids taking into consideration a number of different factors to obtain better accuracy. Also, we propose to adaptively select the LDP algorithm that based on the grid characteristics will provide the better utility. We conduct experiments on real and synthetic datasets and compare our solution with existing approaches.
本地差分隐私(LDP)允许在保持用户隐私的同时回答对用户数据的查询。查询通常在具有分类维度和数字维度的多维数据集上执行。在本文中,我们解决了在LDP下对具有分类和数字维度的多维数据集的计数查询的回答问题。在没有可信中央代理的情况下,首先对用户的私有维度进行局部扰动以保护隐私,然后将其发送给能够估计查询答案的聚合器。我们基于现有的使用网格的想法构建我们的方法。将用户的维度映射到网格中,这些网格被扰动后发送给聚合器,这样聚合器就可以估计真实的数据分布,以回答对收集到的维度的不同查询。细粒度网格由于噪声导致误差较大,粗粒度网格由于偏差导致误差较大。为了获得更好的精度,我们建议考虑许多不同的因素来优化网格的构造。同时,我们提出了基于网格特征的自适应选择LDP算法,以提供更好的效用。我们在真实和合成数据集上进行了实验,并将我们的解决方案与现有方法进行了比较。
{"title":"FELIP: A local Differentially Private approach to frequency estimation on multidimensional datasets","authors":"José S. Costa Filho, Javam C. Machado","doi":"10.48786/edbt.2023.56","DOIUrl":"https://doi.org/10.48786/edbt.2023.56","url":null,"abstract":"Local Differential Privacy (LDP) allows answering queries on users data while maintaining their privacy. Queries are often is-sued on multidimensional datasets with categorical and numeric dimensions. In this paper, we tackle the problem of answering counting queries over multidimensional datasets with categorical and numeric dimensions under LDP. In the setting without a trusted central agent, the user’s private dimensions are firstly perturbed locally to preserve privacy and then sent to an aggregator who will be able to estimate answers to queries. We build our approach on the existing idea of using grids. Mapping users dimensions into grids which are perturbed and sent to the aggregator so it can estimate the real data distributions to answer different queries on the dimensions collected. Finer-grained grids lead to greater error due to noises, while coarser-grained ones result in greater error due to biases. We propose optimizing the construction of grids taking into consideration a number of different factors to obtain better accuracy. Also, we propose to adaptively select the LDP algorithm that based on the grid characteristics will provide the better utility. We conduct experiments on real and synthetic datasets and compare our solution with existing approaches.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"58 1","pages":"671-683"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84830983","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
期刊
Advances in database technology : proceedings. International Conference on Extending Database Technology
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1