自动分类树构建：硬度界限和算法

IF 1.7 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS ACM Transactions on Database Systems Pub Date : 2024-05-09 DOI:10.1145/3664283

Shay Gershtein, Uri Avron, Ido Guy, Tova Milo, Slava Novgorodov

{"title":"自动分类树构建：硬度界限和算法","authors":"Shay Gershtein, Uri Avron, Ido Guy, Tova Milo, Slava Novgorodov","doi":"10.1145/3664283","DOIUrl":null,"url":null,"abstract":"Category trees, or taxonomies, are rooted trees where each node, called a category, corresponds to a set of related items. The construction of taxonomies has been studied in various domains, including e-commerce, document management, and question answering. Multiple algorithms for automating construction have been proposed, employing a variety of clustering approaches and crowdsourcing. However, no formal model to capture such categorization problems has been devised, and their complexity has not been studied. To address this, we propose in this work a combinatorial model that captures many practical settings and show that the aforementioned empirical approach has been warranted, as we prove strong inapproximability bounds for various problem variants and special cases when the goal is to produce a categorization of the maximum utility. In our model, the input is a set of n weighted item sets that the tree would ideally contain as categories. Each category, rather than perfectly match the corresponding input set, is allowed to exceed a given threshold for a given similarity function. The goal is to produce a tree that maximizes the total weight of the sets for which it contains a matching category. A key parameter is an upper bound on the number of categories an item may belong to, which produces the hardness of the problem, as initially each item may be contained in an arbitrary number of input sets. For this model, we prove inapproximability bounds, of order \\(\\tilde{\\Theta }(\\sqrt {n}) \\) or \\(\\tilde{\\Theta }(n) \\), for various problem variants and special cases, loosely justifying the aforementioned heuristic approach. Our work includes reductions based on parameterized randomized constructions that highlight how various problem parameters and properties of the input may affect the hardness. Moreover, for the special case where the category must be identical to the corresponding input set, we devise an algorithm whose approximation guarantee depends solely on a more granular parameter, allowing improved worst-case guarantees, as well as the application of practical exact solvers. We further provide efficient algorithms with much improved approximation guarantees for practical special cases where the cardinalities of the input sets or the number of input sets each items belongs to are not too large. Finally, we also generalize our results to DAG-based and non-hierarchical categorization.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"2016 1","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Automated Category Tree Construction: Hardness Bounds and Algorithms\",\"authors\":\"Shay Gershtein, Uri Avron, Ido Guy, Tova Milo, Slava Novgorodov\",\"doi\":\"10.1145/3664283\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Category trees, or taxonomies, are rooted trees where each node, called a category, corresponds to a set of related items. The construction of taxonomies has been studied in various domains, including e-commerce, document management, and question answering. Multiple algorithms for automating construction have been proposed, employing a variety of clustering approaches and crowdsourcing. However, no formal model to capture such categorization problems has been devised, and their complexity has not been studied. To address this, we propose in this work a combinatorial model that captures many practical settings and show that the aforementioned empirical approach has been warranted, as we prove strong inapproximability bounds for various problem variants and special cases when the goal is to produce a categorization of the maximum utility. In our model, the input is a set of n weighted item sets that the tree would ideally contain as categories. Each category, rather than perfectly match the corresponding input set, is allowed to exceed a given threshold for a given similarity function. The goal is to produce a tree that maximizes the total weight of the sets for which it contains a matching category. A key parameter is an upper bound on the number of categories an item may belong to, which produces the hardness of the problem, as initially each item may be contained in an arbitrary number of input sets. For this model, we prove inapproximability bounds, of order \\\\(\\\\tilde{\\\\Theta }(\\\\sqrt {n}) \\\\) or \\\\(\\\\tilde{\\\\Theta }(n) \\\\), for various problem variants and special cases, loosely justifying the aforementioned heuristic approach. Our work includes reductions based on parameterized randomized constructions that highlight how various problem parameters and properties of the input may affect the hardness. Moreover, for the special case where the category must be identical to the corresponding input set, we devise an algorithm whose approximation guarantee depends solely on a more granular parameter, allowing improved worst-case guarantees, as well as the application of practical exact solvers. We further provide efficient algorithms with much improved approximation guarantees for practical special cases where the cardinalities of the input sets or the number of input sets each items belongs to are not too large. Finally, we also generalize our results to DAG-based and non-hierarchical categorization.\",\"PeriodicalId\":50915,\"journal\":{\"name\":\"ACM Transactions on Database Systems\",\"volume\":\"2016 1\",\"pages\":\"\"},\"PeriodicalIF\":1.7000,\"publicationDate\":\"2024-05-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Database Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/3664283\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Database Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3664283","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

分类树或分类法是有根的树，其中每个节点（称为类别）对应一组相关项目。分类法的构建已在多个领域得到研究，包括电子商务、文档管理和问题解答。目前已经提出了多种自动构建算法，其中包括各种聚类方法和众包方法。然而，目前还没有设计出捕捉此类分类问题的正式模型，也没有对其复杂性进行过研究。为了解决这个问题，我们在这项工作中提出了一个组合模型，它能捕捉到许多实际情况，并证明上述经验方法是有道理的，因为当目标是产生最大效用的分类时，我们证明了各种问题变体和特例的强不可逼近性边界。在我们的模型中，输入是一组 n 个加权项目集，理想情况下，树会将这些项目集作为类别包含在内。每个类别不是完全匹配相应的输入集，而是允许超过给定相似度函数的给定阈值。我们的目标是生成一棵树，使其包含匹配类别的集合的总权重最大化。一个关键参数是一个项目可能属于的类别数量的上限，它决定了问题的难易程度，因为最初每个项目可能包含在任意数量的输入集合中。对于这个模型，我们证明了各种问题变体和特例的不可逼近性边界，阶数为\(\tilde/{Theta }(\sqrt {n}) \)或\(\tilde/{Theta }(n) \)，这在一定程度上证明了上述启发式方法的合理性。我们的工作包括基于参数化随机构造的还原，这些还原突出了各种问题参数和输入属性可能如何影响硬度。此外，对于类别必须与相应输入集完全相同的特殊情况，我们设计了一种算法，其近似保证仅取决于一个更细化的参数，从而改进了最坏情况保证，并应用了实用的精确求解器。我们还进一步提供了高效算法，在输入集的万有引力或每个项所属的输入集数量不太大的实际特殊情况下，其近似保证得到了极大改善。最后，我们还将结果推广到了基于 DAG 的非层次分类。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Automated Category Tree Construction: Hardness Bounds and Algorithms

Category trees, or taxonomies, are rooted trees where each node, called a category, corresponds to a set of related items. The construction of taxonomies has been studied in various domains, including e-commerce, document management, and question answering. Multiple algorithms for automating construction have been proposed, employing a variety of clustering approaches and crowdsourcing. However, no formal model to capture such categorization problems has been devised, and their complexity has not been studied. To address this, we propose in this work a combinatorial model that captures many practical settings and show that the aforementioned empirical approach has been warranted, as we prove strong inapproximability bounds for various problem variants and special cases when the goal is to produce a categorization of the maximum utility.

In our model, the input is a set of n weighted item sets that the tree would ideally contain as categories. Each category, rather than perfectly match the corresponding input set, is allowed to exceed a given threshold for a given similarity function. The goal is to produce a tree that maximizes the total weight of the sets for which it contains a matching category. A key parameter is an upper bound on the number of categories an item may belong to, which produces the hardness of the problem, as initially each item may be contained in an arbitrary number of input sets.

For this model, we prove inapproximability bounds, of order \(\tilde{\Theta }(\sqrt {n}) \) or \(\tilde{\Theta }(n) \), for various problem variants and special cases, loosely justifying the aforementioned heuristic approach. Our work includes reductions based on parameterized randomized constructions that highlight how various problem parameters and properties of the input may affect the hardness. Moreover, for the special case where the category must be identical to the corresponding input set, we devise an algorithm whose approximation guarantee depends solely on a more granular parameter, allowing improved worst-case guarantees, as well as the application of practical exact solvers. We further provide efficient algorithms with much improved approximation guarantees for practical special cases where the cardinalities of the input sets or the number of input sets each items belongs to are not too large. Finally, we also generalize our results to DAG-based and non-hierarchical categorization.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACM Transactions on Database Systems 工程技术-计算机：软件工程

CiteScore

5.60

自引率

0.00%

发文量

审稿时长

>12 weeks

期刊介绍： Heavily used in both academic and corporate R&D settings, ACM Transactions on Database Systems (TODS) is a key publication for computer scientists working in data abstraction, data modeling, and designing data management systems. Topics include storage and retrieval, transaction management, distributed and federated databases, semantics of data, intelligent databases, and operations and algorithms relating to these areas. In this rapidly changing field, TODS provides insights into the thoughts of the best minds in database R&D.

期刊最新文献

Automated Category Tree Construction: Hardness Bounds and Algorithms Database Repairing with Soft Functional Dependencies Sharing Queries with Nonequivalent User-Defined Aggregate Functions A family of centrality measures for graph data based on subgraphs GraphZeppelin: How to Find Connected Components (Even When Graphs Are Dense, Dynamic, and Massive)