Pub Date : 2024-05-17DOI: 10.1007/s00778-024-00853-0
Chen Jason Zhang, Yunrui Liu, Pengcheng Zeng, Ting Wu, Lei Chen, Pan Hui, Fei Hao
The recent boom in crowdsourcing has opened up a new avenue for utilizing human intelligence in the realm of data analysis. This innovative approach provides a powerful means for connecting online workers to tasks that cannot effectively be done solely by machines or conducted by professional experts due to cost constraints. Within the field of social science, four elements are required to construct a sound crowd—Diversity of Opinion, Independence, Decentralization and Aggregation. However, while the other three components have already been investigated and implemented in existing crowdsourcing platforms, ‘Diversity of Opinion’ has not been functionally enabled yet. From a computational point of view, constructing a wise crowd necessitates quantitatively modeling and taking diversity into account. There are usually two paradigms in a crowdsourcing marketplace for worker selection: building a crowd to wait for tasks to come and selecting workers for a given task. We propose similarity-driven and task-driven models for both paradigms. Also, we develop efficient and effective algorithms for recruiting a limited number of workers with optimal diversity in both models. To validate our solutions, we conduct extensive experiments using both synthetic datasets and real data sets.
{"title":"Similarity-driven and task-driven models for diversity of opinion in crowdsourcing markets","authors":"Chen Jason Zhang, Yunrui Liu, Pengcheng Zeng, Ting Wu, Lei Chen, Pan Hui, Fei Hao","doi":"10.1007/s00778-024-00853-0","DOIUrl":"https://doi.org/10.1007/s00778-024-00853-0","url":null,"abstract":"<p>The recent boom in crowdsourcing has opened up a new avenue for utilizing human intelligence in the realm of data analysis. This innovative approach provides a powerful means for connecting online workers to tasks that cannot effectively be done solely by machines or conducted by professional experts due to cost constraints. Within the field of social science, four elements are required to construct a sound crowd—Diversity of Opinion, Independence, Decentralization and Aggregation. However, while the other three components have already been investigated and implemented in existing crowdsourcing platforms, ‘Diversity of Opinion’ has not been functionally enabled yet. From a computational point of view, constructing a wise crowd necessitates quantitatively modeling and taking diversity into account. There are usually two paradigms in a crowdsourcing marketplace for worker selection: building a crowd to wait for tasks to come and selecting workers for a given task. We propose similarity-driven and task-driven models for both paradigms. Also, we develop efficient and effective algorithms for recruiting a limited number of workers with optimal diversity in both models. To validate our solutions, we conduct extensive experiments using both synthetic datasets and real data sets.</p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"129 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141058780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-08DOI: 10.1007/s00778-024-00855-y
Yichen Xu, Chenhao Ma, Yixiang Fang, Zhifeng Bao
The densest subgraph problem (DSP) is of great significance due to its wide applications in different domains. Meanwhile, diverse requirements in various applications lead to different density variants for DSP. Unfortunately, existing DSP algorithms cannot be easily extended to handle those variants efficiently and accurately. To fill this gap, we first unify different density metrics into a generalized density definition. We further propose a new model, c-core, to locate the general densest subgraph and show its advantage in accelerating the search process. Extensive experiments show that our c-core-based optimization can provide up to three orders of magnitude speedup over baselines. Methods for maintenance of c-core location are designed to accelerate updates on dynamic graphs. Moreover, we study an important variant of DSP under a size constraint, namely the densest-at-least-k-subgraph (DalkS) problem. We propose an algorithm based on graph decomposition, and it is likely to give a solution that is at least 0.8 of the optimal density in our experiments, while the state-of-the-art method can only ensure a solution with a density of at least 0.5 of the optimal density. Our experiments show that our DalkS algorithm can achieve at least 0.99 of the optimal density for over one-third of all possible size constraints. In addition, we develop an approximation algorithm for the DalkS problem that can be more efficient than the state-of-the-art algorithm while keeping the same approximation ratio of (frac{1}{3}).
{"title":"Efficient and effective algorithms for densest subgraph discovery and maintenance","authors":"Yichen Xu, Chenhao Ma, Yixiang Fang, Zhifeng Bao","doi":"10.1007/s00778-024-00855-y","DOIUrl":"https://doi.org/10.1007/s00778-024-00855-y","url":null,"abstract":"<p>The densest subgraph problem (DSP) is of great significance due to its wide applications in different domains. Meanwhile, diverse requirements in various applications lead to different density variants for DSP. Unfortunately, existing DSP algorithms cannot be easily extended to handle those variants efficiently and accurately. To fill this gap, we first unify different density metrics into a generalized density definition. We further propose a new model, <i>c</i>-core, to locate the general densest subgraph and show its advantage in accelerating the search process. Extensive experiments show that our <i>c</i>-core-based optimization can provide up to three orders of magnitude speedup over baselines. Methods for maintenance of <i>c</i>-core location are designed to accelerate updates on dynamic graphs. Moreover, we study an important variant of DSP under a size constraint, namely the densest-at-least-k-subgraph (Dal<i>k</i>S) problem. We propose an algorithm based on graph decomposition, and it is likely to give a solution that is at least 0.8 of the optimal density in our experiments, while the state-of-the-art method can only ensure a solution with a density of at least 0.5 of the optimal density. Our experiments show that our Dal<i>k</i>S algorithm can achieve at least 0.99 of the optimal density for over one-third of all possible size constraints. In addition, we develop an approximation algorithm for the Dal<i>k</i>S problem that can be more efficient than the state-of-the-art algorithm while keeping the same approximation ratio of <span>(frac{1}{3})</span>.\u0000</p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"35 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140925182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-19DOI: 10.1007/s00778-024-00851-2
Giulia Preti, Gianmarco De Francisci Morales, Francesco Bonchi
We study point-to-point distance estimation in hypergraphs, where the query is parameterized by a positive integer s, which defines the required level of overlap for two hyperedges to be considered adjacent. To answer s-distance queries, we first explore an oracle based on the line graph of the given hypergraph and discuss its limitations: The line graph is typically orders of magnitude larger than the original hypergraph. We then introduce HypED, a landmark-based oracle with a predefined size, built directly on the hypergraph, thus avoiding the materialization of the line graph. Our framework allows to approximately answer vertex-to-vertex, vertex-to-hyperedge, and hyperedge-to-hyperedge s-distance queries for any value of s. A key observation at the basis of our framework is that as s increases, the hypergraph becomes more fragmented. We show how this can be exploited to improve the placement of landmarks, by identifying the s-connected components of the hypergraph. For this latter task, we devise an efficient algorithm based on the union-find technique and a dynamic inverted index. We experimentally evaluate HypED on several real-world hypergraphs and prove its versatility in answering s-distance queries for different values of s. Our framework allows answering such queries in fractions of a millisecond while allowing fine-grained control of the trade-off between index size and approximation error at creation time. Finally, we prove the usefulness of the s-distance oracle in two applications, namely hypergraph-based recommendation and the approximation of the s-closeness centrality of vertices and hyperedges in the context of protein-protein interactions.
我们研究的是超图中的点到点距离估计,其中查询的参数是一个正整数 s,它定义了将两个超边视为相邻所需的重叠程度。为了回答 s-距离查询,我们首先探讨了基于给定超图的线图的算法,并讨论了它的局限性:线图通常比原始超图大几个数量级。然后,我们引入了 HypED,这是一种基于地标的神谕,具有预定义的大小,直接建立在超图上,从而避免了线图的实体化。我们的框架可以近似回答任意 s 值的顶点到顶点、顶点到超边以及超边到超边的 s 距离查询。我们展示了如何利用这一点,通过识别超图的 s 连接组件来改进地标的放置。对于后一项任务,我们设计了一种基于联合查找技术和动态倒排索引的高效算法。我们在几个真实世界的超图上对 HypED 进行了实验评估,并证明了它在回答不同 s 值的 s 距离查询时的通用性。我们的框架允许在几毫秒内回答此类查询,同时允许在创建时对索引大小和近似误差之间的权衡进行细粒度控制。最后,我们证明了 s-distance 神谕在两个应用中的实用性,即基于超图的推荐以及蛋白质-蛋白质相互作用背景下顶点和超门的 s-closeness 中心性近似。
{"title":"Hyper-distance oracles in hypergraphs","authors":"Giulia Preti, Gianmarco De Francisci Morales, Francesco Bonchi","doi":"10.1007/s00778-024-00851-2","DOIUrl":"https://doi.org/10.1007/s00778-024-00851-2","url":null,"abstract":"<p>We study point-to-point distance estimation in hypergraphs, where the query is parameterized by a positive integer <i>s</i>, which defines the required level of overlap for two hyperedges to be considered adjacent. To answer <i>s</i>-distance queries, we first explore an oracle based on the line graph of the given hypergraph and discuss its limitations: The line graph is typically orders of magnitude larger than the original hypergraph. We then introduce <span>HypED</span>, a landmark-based oracle with a predefined size, built directly on the hypergraph, thus avoiding the materialization of the line graph. Our framework allows to approximately answer vertex-to-vertex, vertex-to-hyperedge, and hyperedge-to-hyperedge <i>s</i>-distance queries for any value of <i>s</i>. A key observation at the basis of our framework is that as <i>s</i> increases, the hypergraph becomes more fragmented. We show how this can be exploited to improve the placement of landmarks, by identifying the <i>s</i>-connected components of the hypergraph. For this latter task, we devise an efficient algorithm based on the union-find technique and a dynamic inverted index. We experimentally evaluate <span>HypED</span> on several real-world hypergraphs and prove its versatility in answering <i>s</i>-distance queries for different values of <i>s</i>. Our framework allows answering such queries in fractions of a millisecond while allowing fine-grained control of the trade-off between index size and approximation error at creation time. Finally, we prove the usefulness of the <i>s</i>-distance oracle in two applications, namely hypergraph-based recommendation and the approximation of the <i>s</i>-closeness centrality of vertices and hyperedges in the context of protein-protein interactions.</p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"38 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140631145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-12DOI: 10.1007/s00778-024-00849-w
Jiwon Chang, Bohan Cui, Fatemeh Nargesian, Abolfazl Asudeh, H. V. Jagadish
Data scientists often develop data sets for analysis by drawing upon available data sources. A major challenge is ensuring that the data set used for analysis adequately represents relevant demographic groups or other variables. Whether data is obtained from an experiment or a data provider, a single data source may not meet the desired distribution requirements. Therefore, combining data from multiple sources is often necessary. The data distribution tailoring (DT) problem aims to cost-efficiently collect a unified data set from multiple sources. In this paper, we present major optimizations and generalizations to previous algorithms for this problem. In situations when group distributions are known in sources, we present a novel algorithm RatioColl that outperforms the existing algorithm, based on the coupon collector’s problem. If distributions are unknown, we propose decaying exploration rate multi-armed-bandit algorithms that, unlike the existing algorithm used for unknown DT, does not require prior information. Through theoretical analysis and extensive experiments, we demonstrate the effectiveness of our proposed algorithms.
{"title":"Data distribution tailoring revisited: cost-efficient integration of representative data","authors":"Jiwon Chang, Bohan Cui, Fatemeh Nargesian, Abolfazl Asudeh, H. V. Jagadish","doi":"10.1007/s00778-024-00849-w","DOIUrl":"https://doi.org/10.1007/s00778-024-00849-w","url":null,"abstract":"<p>Data scientists often develop data sets for analysis by drawing upon available data sources. A major challenge is ensuring that the data set used for analysis adequately represents relevant demographic groups or other variables. Whether data is obtained from an experiment or a data provider, a single data source may not meet the desired distribution requirements. Therefore, combining data from multiple sources is often necessary. The data distribution tailoring (DT) problem aims to cost-efficiently collect a unified data set from multiple sources. In this paper, we present major optimizations and generalizations to previous algorithms for this problem. In situations when group distributions are known in sources, we present a novel algorithm <span>RatioColl</span> that outperforms the existing algorithm, based on the coupon collector’s problem. If distributions are unknown, we propose decaying exploration rate multi-armed-bandit algorithms that, unlike the existing algorithm used for unknown DT, does not require prior information. Through theoretical analysis and extensive experiments, we demonstrate the effectiveness of our proposed algorithms.</p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"53 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140592973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-12DOI: 10.1007/s00778-024-00845-0
Lijie Xu, Shuang Qiu, Binhang Yuan, Jiawei Jiang, Cedric Renggli, Shaoduo Gan, Kaan Kara, Guoliang Li, Ji Liu, Wentao Wu, Jieping Ye, Ce Zhang
Modern machine learning (ML) systems commonly use stochastic gradient descent (SGD) to train ML models. However, SGD relies on random data order to converge, which usually requires a full data shuffle. For in-DB ML systems and deep learning systems with large datasets stored on block-addressable secondary storage such as HDD and SSD, this full data shuffle leads to low I/O performance—the data shuffling time can be even longer than the training itself, due to massive random data accesses. To balance the convergence rate of SGD (which favors data randomness) and its I/O performance (which favors sequential access), previous work has proposed several data shuffling strategies. In this paper, we first perform an empirical study on existing data shuffling strategies, showing that these strategies suffer from either low performance or low convergence rate. To solve this problem, we propose a simple but novel two-level data shuffling strategy named CorgiPile, which can avoid a full data shuffle while maintaining comparable convergence rate of SGD as if a full shuffle were performed. We further theoretically analyze the convergence behavior of CorgiPile and empirically evaluate its efficacy in both in-DB ML and deep learning systems. For in-DB ML systems, we integrate CorgiPile into PostgreSQL by introducing three new physical operators with optimizations. For deep learning systems, we extend single-process CorgiPile to multi-process CorgiPile for the parallel/distributed environment and integrate it into PyTorch. Our evaluation shows that CorgiPile can achieve comparable convergence rate with the full-shuffle-based SGD for both linear models and deep learning models. For in-DB ML with linear models, CorgiPile is 1.6(times )(-)12.8(times ) faster than two state-of-the-art systems, Apache MADlib and Bismarck, on both HDD and SSD. For deep learning models on ImageNet, CorgiPile is 1.5(times ) faster than PyTorch with full data shuffle.
{"title":"Stochastic gradient descent without full data shuffle: with applications to in-database machine learning and deep learning systems","authors":"Lijie Xu, Shuang Qiu, Binhang Yuan, Jiawei Jiang, Cedric Renggli, Shaoduo Gan, Kaan Kara, Guoliang Li, Ji Liu, Wentao Wu, Jieping Ye, Ce Zhang","doi":"10.1007/s00778-024-00845-0","DOIUrl":"https://doi.org/10.1007/s00778-024-00845-0","url":null,"abstract":"<p>Modern machine learning (ML) systems commonly use stochastic gradient descent (SGD) to train ML models. However, SGD relies on random data order to converge, which usually requires a full data shuffle. For in-DB ML systems and deep learning systems with large datasets stored on <i>block-addressable secondary storage</i> such as HDD and SSD, this full data shuffle leads to low I/O performance—the data shuffling time can be even longer than the training itself, due to massive random data accesses. To balance the convergence rate of SGD (which favors data randomness) and its I/O performance (which favors sequential access), previous work has proposed several data shuffling strategies. In this paper, we first perform an empirical study on existing data shuffling strategies, showing that these strategies suffer from either low performance or low convergence rate. To solve this problem, we propose a simple but novel <i>two-level</i> data shuffling strategy named <span>CorgiPile</span>, which can <i>avoid</i> a full data shuffle while maintaining <i>comparable</i> convergence rate of SGD as if a full shuffle were performed. We further theoretically analyze the convergence behavior of <span>CorgiPile</span> and empirically evaluate its efficacy in both in-DB ML and deep learning systems. For in-DB ML systems, we integrate <span>CorgiPile</span> into PostgreSQL by introducing three new <i>physical</i> operators with optimizations. For deep learning systems, we extend single-process <span>CorgiPile</span> to multi-process <span>CorgiPile</span> for the parallel/distributed environment and integrate it into PyTorch. Our evaluation shows that <span>CorgiPile</span> can achieve comparable convergence rate with the full-shuffle-based SGD for both linear models and deep learning models. For in-DB ML with linear models, <span>CorgiPile</span> is 1.6<span>(times )</span> <span>(-)</span>12.8<span>(times )</span> faster than two state-of-the-art systems, Apache MADlib and Bismarck, on both HDD and SSD. For deep learning models on ImageNet, <span>CorgiPile</span> is 1.5<span>(times )</span> faster than PyTorch with full data shuffle.</p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"44 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140593052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-28DOI: 10.1007/s00778-024-00843-2
Abstract
Log-based anomaly detection is essential for maintaining system reliability. Although existing log-based anomaly detection approaches perform well in certain experimental systems, they are ineffective in real-world industrial systems with noisy log data. This paper focuses on mitigating the impact of noisy log data. To this aim, we first conduct an empirical study on the system logs of four large-scale industrial software systems. Through the study, we find five typical noise patterns that are the root causes of unsatisfactory results of existing anomaly detection models. Based on the study, we propose HiLogx, a noise-aware log-based anomaly detection approach that integrates human knowledge to identify these noise patterns and further modify the anomaly detection model with human feedback. Experimental results on four large-scale industrial software systems and two open datasets show that our approach improves over 30% precision and 15% recall on average.
{"title":"Hilogx: noise-aware log-based anomaly detection with human feedback","authors":"","doi":"10.1007/s00778-024-00843-2","DOIUrl":"https://doi.org/10.1007/s00778-024-00843-2","url":null,"abstract":"<h3>Abstract</h3> <p>Log-based anomaly detection is essential for maintaining system reliability. Although existing log-based anomaly detection approaches perform well in certain experimental systems, they are ineffective in real-world industrial systems with noisy log data. This paper focuses on mitigating the impact of noisy log data. To this aim, we first conduct an empirical study on the system logs of four large-scale industrial software systems. Through the study, we find five typical noise patterns that are the root causes of unsatisfactory results of existing anomaly detection models. Based on the study, we propose HiLogx, a noise-aware log-based anomaly detection approach that integrates human knowledge to identify these noise patterns and further modify the anomaly detection model with human feedback. Experimental results on four large-scale industrial software systems and two open datasets show that our approach improves over 30% precision and 15% recall on average. </p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"220 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140325588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}