首页 > 最新文献

The VLDB Journal最新文献

英文 中文
Efficient algorithms for reachability and path queries on temporal bipartite graphs 时态二叉图上可达性和路径查询的高效算法
Pub Date : 2024-05-23 DOI: 10.1007/s00778-024-00854-z
Kai Wang, Minghao Cai, Xiaoshuang Chen, Xuemin Lin, Wenjie Zhang, Lu Qin, Ying Zhang
{"title":"Efficient algorithms for reachability and path queries on temporal bipartite graphs","authors":"Kai Wang, Minghao Cai, Xiaoshuang Chen, Xuemin Lin, Wenjie Zhang, Lu Qin, Ying Zhang","doi":"10.1007/s00778-024-00854-z","DOIUrl":"https://doi.org/10.1007/s00778-024-00854-z","url":null,"abstract":"","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"23 9","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141108028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Discovering approximate implicit domain orders through order dependencies 通过阶次依赖关系发现近似隐式域阶次
Pub Date : 2024-05-21 DOI: 10.1007/s00778-024-00847-y
Reza Karegar, Melicaalsadat Mirsafian, P. Godfrey, Lukasz Golab, M. Kargar, Divesh Srivastava, Jaroslaw Szlichta
{"title":"Discovering approximate implicit domain orders through order dependencies","authors":"Reza Karegar, Melicaalsadat Mirsafian, P. Godfrey, Lukasz Golab, M. Kargar, Divesh Srivastava, Jaroslaw Szlichta","doi":"10.1007/s00778-024-00847-y","DOIUrl":"https://doi.org/10.1007/s00778-024-00847-y","url":null,"abstract":"","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"139 19","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141114526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Similarity-driven and task-driven models for diversity of opinion in crowdsourcing markets 众包市场中意见多样性的相似性驱动模型和任务驱动模型
Pub Date : 2024-05-17 DOI: 10.1007/s00778-024-00853-0
Chen Jason Zhang, Yunrui Liu, Pengcheng Zeng, Ting Wu, Lei Chen, Pan Hui, Fei Hao

The recent boom in crowdsourcing has opened up a new avenue for utilizing human intelligence in the realm of data analysis. This innovative approach provides a powerful means for connecting online workers to tasks that cannot effectively be done solely by machines or conducted by professional experts due to cost constraints. Within the field of social science, four elements are required to construct a sound crowd—Diversity of Opinion, Independence, Decentralization and Aggregation. However, while the other three components have already been investigated and implemented in existing crowdsourcing platforms, ‘Diversity of Opinion’ has not been functionally enabled yet. From a computational point of view, constructing a wise crowd necessitates quantitatively modeling and taking diversity into account. There are usually two paradigms in a crowdsourcing marketplace for worker selection: building a crowd to wait for tasks to come and selecting workers for a given task. We propose similarity-driven and task-driven models for both paradigms. Also, we develop efficient and effective algorithms for recruiting a limited number of workers with optimal diversity in both models. To validate our solutions, we conduct extensive experiments using both synthetic datasets and real data sets.

最近,众包的蓬勃发展为在数据分析领域利用人类智慧开辟了一条新途径。这种创新方法提供了一种强有力的手段,将在线工作者与那些由于成本限制而无法完全由机器或专业专家有效完成的任务联系起来。在社会科学领域,构建一个完善的人群需要具备四个要素--意见多样性、独立性、分散性和聚合性。然而,虽然其他三个要素已经在现有的众包平台中得到了研究和实施,但 "意见多样性 "尚未在功能上得以实现。从计算的角度来看,要构建一个明智的人群,就必须对多样性进行量化建模并将其考虑在内。在众包市场中,通常有两种工人选择模式:建立一个人群等待任务到来,以及为给定任务选择工人。我们为这两种模式提出了相似性驱动模型和任务驱动模型。此外,我们还开发了高效的算法,用于在这两种模式中招募数量有限且具有最佳多样性的工人。为了验证我们的解决方案,我们使用合成数据集和真实数据集进行了大量实验。
{"title":"Similarity-driven and task-driven models for diversity of opinion in crowdsourcing markets","authors":"Chen Jason Zhang, Yunrui Liu, Pengcheng Zeng, Ting Wu, Lei Chen, Pan Hui, Fei Hao","doi":"10.1007/s00778-024-00853-0","DOIUrl":"https://doi.org/10.1007/s00778-024-00853-0","url":null,"abstract":"<p>The recent boom in crowdsourcing has opened up a new avenue for utilizing human intelligence in the realm of data analysis. This innovative approach provides a powerful means for connecting online workers to tasks that cannot effectively be done solely by machines or conducted by professional experts due to cost constraints. Within the field of social science, four elements are required to construct a sound crowd—Diversity of Opinion, Independence, Decentralization and Aggregation. However, while the other three components have already been investigated and implemented in existing crowdsourcing platforms, ‘Diversity of Opinion’ has not been functionally enabled yet. From a computational point of view, constructing a wise crowd necessitates quantitatively modeling and taking diversity into account. There are usually two paradigms in a crowdsourcing marketplace for worker selection: building a crowd to wait for tasks to come and selecting workers for a given task. We propose similarity-driven and task-driven models for both paradigms. Also, we develop efficient and effective algorithms for recruiting a limited number of workers with optimal diversity in both models. To validate our solutions, we conduct extensive experiments using both synthetic datasets and real data sets.</p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"129 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141058780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient and effective algorithms for densest subgraph discovery and maintenance 发现和维护最密集子图的高效算法
Pub Date : 2024-05-08 DOI: 10.1007/s00778-024-00855-y
Yichen Xu, Chenhao Ma, Yixiang Fang, Zhifeng Bao

The densest subgraph problem (DSP) is of great significance due to its wide applications in different domains. Meanwhile, diverse requirements in various applications lead to different density variants for DSP. Unfortunately, existing DSP algorithms cannot be easily extended to handle those variants efficiently and accurately. To fill this gap, we first unify different density metrics into a generalized density definition. We further propose a new model, c-core, to locate the general densest subgraph and show its advantage in accelerating the search process. Extensive experiments show that our c-core-based optimization can provide up to three orders of magnitude speedup over baselines. Methods for maintenance of c-core location are designed to accelerate updates on dynamic graphs. Moreover, we study an important variant of DSP under a size constraint, namely the densest-at-least-k-subgraph (DalkS) problem. We propose an algorithm based on graph decomposition, and it is likely to give a solution that is at least 0.8 of the optimal density in our experiments, while the state-of-the-art method can only ensure a solution with a density of at least 0.5 of the optimal density. Our experiments show that our DalkS algorithm can achieve at least 0.99 of the optimal density for over one-third of all possible size constraints. In addition, we develop an approximation algorithm for the DalkS problem that can be more efficient than the state-of-the-art algorithm while keeping the same approximation ratio of (frac{1}{3}).

最密子图问题(DSP)因其在不同领域的广泛应用而具有重要意义。同时,各种应用中的不同要求导致了 DSP 的不同密度变体。遗憾的是,现有的 DSP 算法无法轻松扩展到高效、准确地处理这些变体。为了填补这一空白,我们首先将不同的密度指标统一为一个广义密度定义。我们进一步提出了一个新模型--c-core,用于定位一般最密集子图,并展示了它在加速搜索过程方面的优势。广泛的实验表明,与基线相比,我们基于 c-core 的优化最多可加快三个数量级。维护 c 核位置的方法旨在加速动态图的更新。此外,我们还研究了 DSP 在大小限制条件下的一个重要变体,即最密集-至少-k-子图(DalkS)问题。我们提出了一种基于图分解的算法,在我们的实验中,该算法很可能给出至少为最优密度 0.8 的解,而最先进的方法只能确保解的密度至少为最优密度的 0.5。我们的实验表明,我们的 DalkS 算法可以在超过三分之一的可能尺寸约束条件下实现至少 0.99 的最优密度。此外,我们还为 DalkS 问题开发了一种近似算法,在保持相同的近似率(frac{1}{3})的情况下,它比最先进的算法更加高效。
{"title":"Efficient and effective algorithms for densest subgraph discovery and maintenance","authors":"Yichen Xu, Chenhao Ma, Yixiang Fang, Zhifeng Bao","doi":"10.1007/s00778-024-00855-y","DOIUrl":"https://doi.org/10.1007/s00778-024-00855-y","url":null,"abstract":"<p>The densest subgraph problem (DSP) is of great significance due to its wide applications in different domains. Meanwhile, diverse requirements in various applications lead to different density variants for DSP. Unfortunately, existing DSP algorithms cannot be easily extended to handle those variants efficiently and accurately. To fill this gap, we first unify different density metrics into a generalized density definition. We further propose a new model, <i>c</i>-core, to locate the general densest subgraph and show its advantage in accelerating the search process. Extensive experiments show that our <i>c</i>-core-based optimization can provide up to three orders of magnitude speedup over baselines. Methods for maintenance of <i>c</i>-core location are designed to accelerate updates on dynamic graphs. Moreover, we study an important variant of DSP under a size constraint, namely the densest-at-least-k-subgraph (Dal<i>k</i>S) problem. We propose an algorithm based on graph decomposition, and it is likely to give a solution that is at least 0.8 of the optimal density in our experiments, while the state-of-the-art method can only ensure a solution with a density of at least 0.5 of the optimal density. Our experiments show that our Dal<i>k</i>S algorithm can achieve at least 0.99 of the optimal density for over one-third of all possible size constraints. In addition, we develop an approximation algorithm for the Dal<i>k</i>S problem that can be more efficient than the state-of-the-art algorithm while keeping the same approximation ratio of <span>(frac{1}{3})</span>.\u0000</p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"35 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140925182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Lero: applying learning-to-rank in query optimizer Lero:在查询优化器中应用 "从学习到排名 "技术
Pub Date : 2024-04-25 DOI: 10.1007/s00778-024-00850-3
Xingguang Chen, Rong Zhu, Bolin Ding, Sibo Wang, Jingren Zhou
{"title":"Lero: applying learning-to-rank in query optimizer","authors":"Xingguang Chen, Rong Zhu, Bolin Ding, Sibo Wang, Jingren Zhou","doi":"10.1007/s00778-024-00850-3","DOIUrl":"https://doi.org/10.1007/s00778-024-00850-3","url":null,"abstract":"","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"90 13","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140654807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hyper-distance oracles in hypergraphs 超图中的超距规则
Pub Date : 2024-04-19 DOI: 10.1007/s00778-024-00851-2
Giulia Preti, Gianmarco De Francisci Morales, Francesco Bonchi

We study point-to-point distance estimation in hypergraphs, where the query is parameterized by a positive integer s, which defines the required level of overlap for two hyperedges to be considered adjacent. To answer s-distance queries, we first explore an oracle based on the line graph of the given hypergraph and discuss its limitations: The line graph is typically orders of magnitude larger than the original hypergraph. We then introduce HypED, a landmark-based oracle with a predefined size, built directly on the hypergraph, thus avoiding the materialization of the line graph. Our framework allows to approximately answer vertex-to-vertex, vertex-to-hyperedge, and hyperedge-to-hyperedge s-distance queries for any value of s. A key observation at the basis of our framework is that as s increases, the hypergraph becomes more fragmented. We show how this can be exploited to improve the placement of landmarks, by identifying the s-connected components of the hypergraph. For this latter task, we devise an efficient algorithm based on the union-find technique and a dynamic inverted index. We experimentally evaluate HypED on several real-world hypergraphs and prove its versatility in answering s-distance queries for different values of s. Our framework allows answering such queries in fractions of a millisecond while allowing fine-grained control of the trade-off between index size and approximation error at creation time. Finally, we prove the usefulness of the s-distance oracle in two applications, namely hypergraph-based recommendation and the approximation of the s-closeness centrality of vertices and hyperedges in the context of protein-protein interactions.

我们研究的是超图中的点到点距离估计,其中查询的参数是一个正整数 s,它定义了将两个超边视为相邻所需的重叠程度。为了回答 s-距离查询,我们首先探讨了基于给定超图的线图的算法,并讨论了它的局限性:线图通常比原始超图大几个数量级。然后,我们引入了 HypED,这是一种基于地标的神谕,具有预定义的大小,直接建立在超图上,从而避免了线图的实体化。我们的框架可以近似回答任意 s 值的顶点到顶点、顶点到超边以及超边到超边的 s 距离查询。我们展示了如何利用这一点,通过识别超图的 s 连接组件来改进地标的放置。对于后一项任务,我们设计了一种基于联合查找技术和动态倒排索引的高效算法。我们在几个真实世界的超图上对 HypED 进行了实验评估,并证明了它在回答不同 s 值的 s 距离查询时的通用性。我们的框架允许在几毫秒内回答此类查询,同时允许在创建时对索引大小和近似误差之间的权衡进行细粒度控制。最后,我们证明了 s-distance 神谕在两个应用中的实用性,即基于超图的推荐以及蛋白质-蛋白质相互作用背景下顶点和超门的 s-closeness 中心性近似。
{"title":"Hyper-distance oracles in hypergraphs","authors":"Giulia Preti, Gianmarco De Francisci Morales, Francesco Bonchi","doi":"10.1007/s00778-024-00851-2","DOIUrl":"https://doi.org/10.1007/s00778-024-00851-2","url":null,"abstract":"<p>We study point-to-point distance estimation in hypergraphs, where the query is parameterized by a positive integer <i>s</i>, which defines the required level of overlap for two hyperedges to be considered adjacent. To answer <i>s</i>-distance queries, we first explore an oracle based on the line graph of the given hypergraph and discuss its limitations: The line graph is typically orders of magnitude larger than the original hypergraph. We then introduce <span>HypED</span>, a landmark-based oracle with a predefined size, built directly on the hypergraph, thus avoiding the materialization of the line graph. Our framework allows to approximately answer vertex-to-vertex, vertex-to-hyperedge, and hyperedge-to-hyperedge <i>s</i>-distance queries for any value of <i>s</i>. A key observation at the basis of our framework is that as <i>s</i> increases, the hypergraph becomes more fragmented. We show how this can be exploited to improve the placement of landmarks, by identifying the <i>s</i>-connected components of the hypergraph. For this latter task, we devise an efficient algorithm based on the union-find technique and a dynamic inverted index. We experimentally evaluate <span>HypED</span> on several real-world hypergraphs and prove its versatility in answering <i>s</i>-distance queries for different values of <i>s</i>. Our framework allows answering such queries in fractions of a millisecond while allowing fine-grained control of the trade-off between index size and approximation error at creation time. Finally, we prove the usefulness of the <i>s</i>-distance oracle in two applications, namely hypergraph-based recommendation and the approximation of the <i>s</i>-closeness centrality of vertices and hyperedges in the context of protein-protein interactions.</p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"38 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140631145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Special issue on “Machine learning and databases” 机器学习与数据库 "特刊
Pub Date : 2024-04-17 DOI: 10.1007/s00778-024-00848-x
Matthias Boehm, Nesime Tatbul
{"title":"Special issue on “Machine learning and databases”","authors":"Matthias Boehm, Nesime Tatbul","doi":"10.1007/s00778-024-00848-x","DOIUrl":"https://doi.org/10.1007/s00778-024-00848-x","url":null,"abstract":"","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":" 11","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140692731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Data distribution tailoring revisited: cost-efficient integration of representative data 再论数据分布定制:具有成本效益的代表性数据整合
Pub Date : 2024-04-12 DOI: 10.1007/s00778-024-00849-w
Jiwon Chang, Bohan Cui, Fatemeh Nargesian, Abolfazl Asudeh, H. V. Jagadish

Data scientists often develop data sets for analysis by drawing upon available data sources. A major challenge is ensuring that the data set used for analysis adequately represents relevant demographic groups or other variables. Whether data is obtained from an experiment or a data provider, a single data source may not meet the desired distribution requirements. Therefore, combining data from multiple sources is often necessary. The data distribution tailoring (DT) problem aims to cost-efficiently collect a unified data set from multiple sources. In this paper, we present major optimizations and generalizations to previous algorithms for this problem. In situations when group distributions are known in sources, we present a novel algorithm RatioColl that outperforms the existing algorithm, based on the coupon collector’s problem. If distributions are unknown, we propose decaying exploration rate multi-armed-bandit algorithms that, unlike the existing algorithm used for unknown DT, does not require prior information. Through theoretical analysis and extensive experiments, we demonstrate the effectiveness of our proposed algorithms.

数据科学家通常通过利用现有数据源开发数据集进行分析。一个主要挑战是确保用于分析的数据集充分代表相关人口群体或其他变量。无论是从实验还是从数据提供者那里获取数据,单一的数据源可能无法满足所需的分布要求。因此,通常需要将多个来源的数据结合起来。数据分布裁剪(DT)问题旨在从多个来源经济高效地收集统一的数据集。在本文中,我们针对这一问题提出了对以往算法的主要优化和概括。在已知数据源分组分布的情况下,我们提出了一种基于优惠券收集者问题的新型算法 RatioColl,其性能优于现有算法。如果分布未知,我们提出了衰减探索率多臂比特算法,与用于未知 DT 的现有算法不同,该算法不需要先验信息。通过理论分析和大量实验,我们证明了所提算法的有效性。
{"title":"Data distribution tailoring revisited: cost-efficient integration of representative data","authors":"Jiwon Chang, Bohan Cui, Fatemeh Nargesian, Abolfazl Asudeh, H. V. Jagadish","doi":"10.1007/s00778-024-00849-w","DOIUrl":"https://doi.org/10.1007/s00778-024-00849-w","url":null,"abstract":"<p>Data scientists often develop data sets for analysis by drawing upon available data sources. A major challenge is ensuring that the data set used for analysis adequately represents relevant demographic groups or other variables. Whether data is obtained from an experiment or a data provider, a single data source may not meet the desired distribution requirements. Therefore, combining data from multiple sources is often necessary. The data distribution tailoring (DT) problem aims to cost-efficiently collect a unified data set from multiple sources. In this paper, we present major optimizations and generalizations to previous algorithms for this problem. In situations when group distributions are known in sources, we present a novel algorithm <span>RatioColl</span> that outperforms the existing algorithm, based on the coupon collector’s problem. If distributions are unknown, we propose decaying exploration rate multi-armed-bandit algorithms that, unlike the existing algorithm used for unknown DT, does not require prior information. Through theoretical analysis and extensive experiments, we demonstrate the effectiveness of our proposed algorithms.</p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"53 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140592973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Stochastic gradient descent without full data shuffle: with applications to in-database machine learning and deep learning systems 无需完全数据洗牌的随机梯度下降:在数据库内机器学习和深度学习系统中的应用
Pub Date : 2024-04-12 DOI: 10.1007/s00778-024-00845-0
Lijie Xu, Shuang Qiu, Binhang Yuan, Jiawei Jiang, Cedric Renggli, Shaoduo Gan, Kaan Kara, Guoliang Li, Ji Liu, Wentao Wu, Jieping Ye, Ce Zhang

Modern machine learning (ML) systems commonly use stochastic gradient descent (SGD) to train ML models. However, SGD relies on random data order to converge, which usually requires a full data shuffle. For in-DB ML systems and deep learning systems with large datasets stored on block-addressable secondary storage such as HDD and SSD, this full data shuffle leads to low I/O performance—the data shuffling time can be even longer than the training itself, due to massive random data accesses. To balance the convergence rate of SGD (which favors data randomness) and its I/O performance (which favors sequential access), previous work has proposed several data shuffling strategies. In this paper, we first perform an empirical study on existing data shuffling strategies, showing that these strategies suffer from either low performance or low convergence rate. To solve this problem, we propose a simple but novel two-level data shuffling strategy named CorgiPile, which can avoid a full data shuffle while maintaining comparable convergence rate of SGD as if a full shuffle were performed. We further theoretically analyze the convergence behavior of CorgiPile and empirically evaluate its efficacy in both in-DB ML and deep learning systems. For in-DB ML systems, we integrate CorgiPile into PostgreSQL by introducing three new physical operators with optimizations. For deep learning systems, we extend single-process CorgiPile to multi-process CorgiPile for the parallel/distributed environment and integrate it into PyTorch. Our evaluation shows that CorgiPile can achieve comparable convergence rate with the full-shuffle-based SGD for both linear models and deep learning models. For in-DB ML with linear models, CorgiPile is 1.6(times ) (-)12.8(times ) faster than two state-of-the-art systems, Apache MADlib and Bismarck, on both HDD and SSD. For deep learning models on ImageNet, CorgiPile is 1.5(times ) faster than PyTorch with full data shuffle.

现代机器学习(ML)系统通常使用随机梯度下降法(SGD)来训练 ML 模型。然而,随机梯度下降依赖于随机数据顺序来收敛,这通常需要进行全面的数据洗牌。对于在 HDD 和 SSD 等可分块寻址的二级存储上存储大型数据集的 In-DB ML 系统和深度学习系统来说,这种全数据洗牌会导致 I/O 性能低下--由于大量的随机数据访问,数据洗牌时间可能比训练本身还要长。为了平衡 SGD 的收敛速度(倾向于数据随机性)和 I/O 性能(倾向于顺序访问),前人提出了几种数据洗牌策略。在本文中,我们首先对现有的数据洗牌策略进行了实证研究,结果表明这些策略要么性能低下,要么收敛率低。为了解决这个问题,我们提出了一种名为 CorgiPile 的简单而新颖的两级数据洗牌策略,它可以避免完全的数据洗牌,同时保持与完全洗牌相当的 SGD 收敛率。我们进一步从理论上分析了 CorgiPile 的收敛行为,并在数据库内 ML 和深度学习系统中对其功效进行了实证评估。对于数据库内 ML 系统,我们将 CorgiPile 集成到 PostgreSQL 中,引入了三个新的物理算子并进行了优化。对于深度学习系统,我们将单进程 CorgiPile 扩展为并行/分布式环境下的多进程 CorgiPile,并将其集成到 PyTorch 中。我们的评估结果表明,对于线性模型和深度学习模型,CorgiPile 可以达到与基于全洗牌的 SGD 相当的收敛速度。对于线性模型的数据库内 ML,在 HDD 和 SSD 上,CorgiPile 比 Apache MADlib 和 Bismarck 这两个最先进的系统快 1.6(times )(-)12.8(times )。对于 ImageNet 上的深度学习模型,CorgiPile 比完全数据洗牌的 PyTorch 快 1.5(times )。
{"title":"Stochastic gradient descent without full data shuffle: with applications to in-database machine learning and deep learning systems","authors":"Lijie Xu, Shuang Qiu, Binhang Yuan, Jiawei Jiang, Cedric Renggli, Shaoduo Gan, Kaan Kara, Guoliang Li, Ji Liu, Wentao Wu, Jieping Ye, Ce Zhang","doi":"10.1007/s00778-024-00845-0","DOIUrl":"https://doi.org/10.1007/s00778-024-00845-0","url":null,"abstract":"<p>Modern machine learning (ML) systems commonly use stochastic gradient descent (SGD) to train ML models. However, SGD relies on random data order to converge, which usually requires a full data shuffle. For in-DB ML systems and deep learning systems with large datasets stored on <i>block-addressable secondary storage</i> such as HDD and SSD, this full data shuffle leads to low I/O performance—the data shuffling time can be even longer than the training itself, due to massive random data accesses. To balance the convergence rate of SGD (which favors data randomness) and its I/O performance (which favors sequential access), previous work has proposed several data shuffling strategies. In this paper, we first perform an empirical study on existing data shuffling strategies, showing that these strategies suffer from either low performance or low convergence rate. To solve this problem, we propose a simple but novel <i>two-level</i> data shuffling strategy named <span>CorgiPile</span>, which can <i>avoid</i> a full data shuffle while maintaining <i>comparable</i> convergence rate of SGD as if a full shuffle were performed. We further theoretically analyze the convergence behavior of <span>CorgiPile</span> and empirically evaluate its efficacy in both in-DB ML and deep learning systems. For in-DB ML systems, we integrate <span>CorgiPile</span> into PostgreSQL by introducing three new <i>physical</i> operators with optimizations. For deep learning systems, we extend single-process <span>CorgiPile</span> to multi-process <span>CorgiPile</span> for the parallel/distributed environment and integrate it into PyTorch. Our evaluation shows that <span>CorgiPile</span> can achieve comparable convergence rate with the full-shuffle-based SGD for both linear models and deep learning models. For in-DB ML with linear models, <span>CorgiPile</span> is 1.6<span>(times )</span> <span>(-)</span>12.8<span>(times )</span> faster than two state-of-the-art systems, Apache MADlib and Bismarck, on both HDD and SSD. For deep learning models on ImageNet, <span>CorgiPile</span> is 1.5<span>(times )</span> faster than PyTorch with full data shuffle.</p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"44 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140593052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hilogx: noise-aware log-based anomaly detection with human feedback Hilogx:基于人为反馈的噪声感知日志式异常检测
Pub Date : 2024-03-28 DOI: 10.1007/s00778-024-00843-2

Abstract

Log-based anomaly detection is essential for maintaining system reliability. Although existing log-based anomaly detection approaches perform well in certain experimental systems, they are ineffective in real-world industrial systems with noisy log data. This paper focuses on mitigating the impact of noisy log data. To this aim, we first conduct an empirical study on the system logs of four large-scale industrial software systems. Through the study, we find five typical noise patterns that are the root causes of unsatisfactory results of existing anomaly detection models. Based on the study, we propose HiLogx, a noise-aware log-based anomaly detection approach that integrates human knowledge to identify these noise patterns and further modify the anomaly detection model with human feedback. Experimental results on four large-scale industrial software systems and two open datasets show that our approach improves over 30% precision and 15% recall on average.

摘要 基于日志的异常检测对于维护系统可靠性至关重要。虽然现有的基于日志的异常检测方法在某些实验系统中表现良好,但在日志数据嘈杂的实际工业系统中却难以奏效。本文的重点是减轻噪声日志数据的影响。为此,我们首先对四个大型工业软件系统的系统日志进行了实证研究。通过研究,我们发现了五种典型的噪声模式,它们是导致现有异常检测模型效果不理想的根本原因。在此基础上,我们提出了基于日志的噪声感知异常检测方法 HiLogx,该方法结合了人类知识来识别这些噪声模式,并通过人类反馈来进一步修改异常检测模型。在四个大型工业软件系统和两个开放数据集上的实验结果表明,我们的方法平均提高了 30% 以上的精确度和 15% 以上的召回率。
{"title":"Hilogx: noise-aware log-based anomaly detection with human feedback","authors":"","doi":"10.1007/s00778-024-00843-2","DOIUrl":"https://doi.org/10.1007/s00778-024-00843-2","url":null,"abstract":"<h3>Abstract</h3> <p>Log-based anomaly detection is essential for maintaining system reliability. Although existing log-based anomaly detection approaches perform well in certain experimental systems, they are ineffective in real-world industrial systems with noisy log data. This paper focuses on mitigating the impact of noisy log data. To this aim, we first conduct an empirical study on the system logs of four large-scale industrial software systems. Through the study, we find five typical noise patterns that are the root causes of unsatisfactory results of existing anomaly detection models. Based on the study, we propose HiLogx, a noise-aware log-based anomaly detection approach that integrates human knowledge to identify these noise patterns and further modify the anomaly detection model with human feedback. Experimental results on four large-scale industrial software systems and two open datasets show that our approach improves over 30% precision and 15% recall on average. </p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"220 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140325588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
The VLDB Journal
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1