On Finding Rank Regret Representatives

IF 1.7 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS ACM Transactions on Database Systems Pub Date : 2022-08-18 DOI:https://dl.acm.org/doi/10.1145/3531054

Abolfazl Asudeh, Gautam Das, H. V. Jagadish, Shangqi Lu, Azade Nazi, Yufei Tao, Nan Zhang, Jianwen Zhao

{"title":"On Finding Rank Regret Representatives","authors":"Abolfazl Asudeh, Gautam Das, H. V. Jagadish, Shangqi Lu, Azade Nazi, Yufei Tao, Nan Zhang, Jianwen Zhao","doi":"https://dl.acm.org/doi/10.1145/3531054","DOIUrl":null,"url":null,"abstract":"Selecting the best items in a dataset is a common task in data exploration. However, the concept of “best” lies in the eyes of the beholder: Different users may consider different attributes more important and, hence, arrive at different rankings. Nevertheless, one can remove “dominated” items and create a “representative” subset of the data, comprising the “best items” in it. A Pareto-optimal representative is guaranteed to contain the best item of each possible ranking, but it can be a large portion of data. A much smaller representative can be found if we relax the requirement of including the best item for each user and instead just limit the users’ “regret.” Existing work defines regret as the loss in score by limiting consideration to the representative instead of the full dataset, for any chosen ranking function.However, the score is often not a meaningful number, and users may not understand its absolute value. Sometimes small ranges in score can include large fractions of the dataset. In contrast, users do understand the notion of rank ordering. Therefore, we consider items’ positions in the ranked list in defining the regret and propose the rank-regret representative as the minimal subset of the data containing at least one of the top-k of any possible ranking function. This problem is polynomial time solvable in two-dimensional space but is NP-hard on three or more dimensions. We design a suite of algorithms to fulfill different purposes, such as whether relaxation is permitted on k, the result size, or both, whether a distribution is known, whether theoretical guarantees or practical efficiency is important, and so on. Experiments on real datasets demonstrate that we can efficiently find small subsets with small rank-regrets.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"9 1-2","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2022-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Database Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/https://dl.acm.org/doi/10.1145/3531054","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Selecting the best items in a dataset is a common task in data exploration. However, the concept of “best” lies in the eyes of the beholder: Different users may consider different attributes more important and, hence, arrive at different rankings. Nevertheless, one can remove “dominated” items and create a “representative” subset of the data, comprising the “best items” in it. A Pareto-optimal representative is guaranteed to contain the best item of each possible ranking, but it can be a large portion of data. A much smaller representative can be found if we relax the requirement of including the best item for each user and instead just limit the users’ “regret.” Existing work defines regret as the loss in score by limiting consideration to the representative instead of the full dataset, for any chosen ranking function.

However, the score is often not a meaningful number, and users may not understand its absolute value. Sometimes small ranges in score can include large fractions of the dataset. In contrast, users do understand the notion of rank ordering. Therefore, we consider items’ positions in the ranked list in defining the regret and propose the rank-regret representative as the minimal subset of the data containing at least one of the top-k of any possible ranking function. This problem is polynomial time solvable in two-dimensional space but is NP-hard on three or more dimensions. We design a suite of algorithms to fulfill different purposes, such as whether relaxation is permitted on k, the result size, or both, whether a distribution is known, whether theoretical guarantees or practical efficiency is important, and so on. Experiments on real datasets demonstrate that we can efficiently find small subsets with small rank-regrets.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

关于寻找等级遗憾代表

在数据集中选择最佳项是数据探索中的常见任务。然而，“最佳”的概念存在于观察者的眼中:不同的用户可能认为不同的属性更重要，因此得出不同的排名。然而，我们可以删除“主导”项目，并创建数据的“代表性”子集，包括其中的“最佳项目”。帕累托最优代表保证包含每个可能排名的最佳项，但它可能是数据的很大一部分。如果我们放宽为每个用户提供最佳商品的要求，而不是仅仅限制用户的“后悔”，那么代表性就会小得多。现有的工作将遗憾定义为对任何选择的排名函数，通过限制对代表性而不是完整数据集的考虑而导致的分数损失。然而，分数通常不是一个有意义的数字，用户可能不理解它的绝对值。有时分数的小范围可以包含数据集的很大一部分。相比之下，用户确实理解排名排序的概念。因此，我们在定义后悔时考虑了项目在排名列表中的位置，并提出了排名-后悔代表作为包含任何可能的排名函数的top-k中至少一个的数据的最小子集。这个问题在二维空间中是多项式时间可解的，但在三维或更多维度上是np困难的。我们设计了一套算法来满足不同的目的，例如是否允许k松弛，结果大小，或两者兼有，是否已知分布，理论保证或实际效率是重要的，等等。在真实数据集上的实验表明，我们可以有效地找到具有小秩遗憾的小子集。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ACM Transactions on Database Systems 工程技术-计算机：软件工程

CiteScore

5.60

自引率

0.00%

发文量

审稿时长

>12 weeks

期刊介绍： Heavily used in both academic and corporate R&D settings, ACM Transactions on Database Systems (TODS) is a key publication for computer scientists working in data abstraction, data modeling, and designing data management systems. Topics include storage and retrieval, transaction management, distributed and federated databases, semantics of data, intelligent databases, and operations and algorithms relating to these areas. In this rapidly changing field, TODS provides insights into the thoughts of the best minds in database R&D.

期刊最新文献

Automated Category Tree Construction: Hardness Bounds and Algorithms Database Repairing with Soft Functional Dependencies Sharing Queries with Nonequivalent User-Defined Aggregate Functions A family of centrality measures for graph data based on subgraphs GraphZeppelin: How to Find Connected Components (Even When Graphs Are Dense, Dynamic, and Massive)