Zhen Zheng, Zaifeng Pan, Dalin Wang, Kai Zhu, Wenyi Zhao, Tianyou Guo, Xiafei Qiu, Minmin Sun, Junjie Bai, Feng Zhang, Xiaoyong Du, Jidong Zhai, Wei Lin
Compiler optimization plays an increasingly important role to boost the performance of machine learning models for data processing and management. With increasingly complex data, the dynamic tensor shape phenomenon emerges for ML models. However, existing ML compilers either can only handle static shape models or expose a series of performance problems for both operator fusion optimization and code generation in dynamic shape scenes. This paper tackles the main challenges of dynamic shape optimization: the fusion optimization without shape value, and code generation supporting arbitrary shapes. To tackle the fundamental challenge of the absence of shape values, it systematically abstracts and excavates the shape information and designs a cross-level symbolic shape representation. With the insight that what fusion optimization relies upon is tensor shape relationships between adjacent operators rather than exact shape values, it proposes the dynamic shape fusion approach based on shape information propagation. To generate code that adapts to arbitrary shapes efficiently, it proposes a compile-time and runtime combined code generation approach. Finally, it presents a complete optimization pipeline for dynamic shape models and implements an industrial-grade ML compiler, named BladeDISC. The extensive evaluation demonstrates that BladeDISC outperforms PyTorch, TorchScript, TVM, ONNX Runtime, XLA, Torch Inductor (dynamic shape), and TensorRT by up to 6.95×, 6.25×, 4.08×, 2.04×, 2.06×, 7.92×, and 4.16× (3.54×, 3.12×, 1.95×, 1.47×, 1.24×, 2.93×, and 1.46× on average) in terms of end-to-end inference speedup on the A10 and T4 GPU, respectively. BladeDISC's source code is publicly available at https://github.com/alibaba/BladeDISC.
{"title":"BladeDISC: Optimizing Dynamic Shape Machine Learning Workloads via Compiler Approach","authors":"Zhen Zheng, Zaifeng Pan, Dalin Wang, Kai Zhu, Wenyi Zhao, Tianyou Guo, Xiafei Qiu, Minmin Sun, Junjie Bai, Feng Zhang, Xiaoyong Du, Jidong Zhai, Wei Lin","doi":"10.1145/3617327","DOIUrl":"https://doi.org/10.1145/3617327","url":null,"abstract":"Compiler optimization plays an increasingly important role to boost the performance of machine learning models for data processing and management. With increasingly complex data, the dynamic tensor shape phenomenon emerges for ML models. However, existing ML compilers either can only handle static shape models or expose a series of performance problems for both operator fusion optimization and code generation in dynamic shape scenes. This paper tackles the main challenges of dynamic shape optimization: the fusion optimization without shape value, and code generation supporting arbitrary shapes. To tackle the fundamental challenge of the absence of shape values, it systematically abstracts and excavates the shape information and designs a cross-level symbolic shape representation. With the insight that what fusion optimization relies upon is tensor shape relationships between adjacent operators rather than exact shape values, it proposes the dynamic shape fusion approach based on shape information propagation. To generate code that adapts to arbitrary shapes efficiently, it proposes a compile-time and runtime combined code generation approach. Finally, it presents a complete optimization pipeline for dynamic shape models and implements an industrial-grade ML compiler, named BladeDISC. The extensive evaluation demonstrates that BladeDISC outperforms PyTorch, TorchScript, TVM, ONNX Runtime, XLA, Torch Inductor (dynamic shape), and TensorRT by up to 6.95×, 6.25×, 4.08×, 2.04×, 2.06×, 7.92×, and 4.16× (3.54×, 3.12×, 1.95×, 1.47×, 1.24×, 2.93×, and 1.46× on average) in terms of end-to-end inference speedup on the A10 and T4 GPU, respectively. BladeDISC's source code is publicly available at https://github.com/alibaba/BladeDISC.","PeriodicalId":498157,"journal":{"name":"Proceedings of the ACM on Management of Data","volume":"36 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136281618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We are excited to introduce this new issue of PACMMOD (Proceedings of the ACM on Management of Data). PACMMOD is a new journal, concerned with the principles, algorithms, techniques, systems, and applications of database management systems, data management technology, and science and engineering of data. It includes articles reporting cutting-edge data management, data engineering, and data science research. Articles published at PACMMOD address data challenges at various stages of the data lifecycle, from modeling, acquisition, cleaning, integration, indexing, querying, analysis, exploration, visualization, interpretation, and explanation. They focus on dataintensive components of data pipelines, and solve problems in areas of interest to our community (e.g., data curation, optimization, performance, storage, systems), operating within accuracy, privacy, fairness, and diversity constraints. Articles reporting deployed systems and solutions to data science pipelines and/or fundamental experiences and insights from evaluating real-world data engineering problems are especially encouraged.
{"title":"PACMMOD Volume 1, Issue 3: Editorial","authors":"Divyakant Agrawal, Alexandra Meliou, S. Sudarshan","doi":"10.1145/3617307","DOIUrl":"https://doi.org/10.1145/3617307","url":null,"abstract":"We are excited to introduce this new issue of PACMMOD (Proceedings of the ACM on Management of Data). PACMMOD is a new journal, concerned with the principles, algorithms, techniques, systems, and applications of database management systems, data management technology, and science and engineering of data. It includes articles reporting cutting-edge data management, data engineering, and data science research. Articles published at PACMMOD address data challenges at various stages of the data lifecycle, from modeling, acquisition, cleaning, integration, indexing, querying, analysis, exploration, visualization, interpretation, and explanation. They focus on dataintensive components of data pipelines, and solve problems in areas of interest to our community (e.g., data curation, optimization, performance, storage, systems), operating within accuracy, privacy, fairness, and diversity constraints. Articles reporting deployed systems and solutions to data science pipelines and/or fundamental experiences and insights from evaluating real-world data engineering problems are especially encouraged.","PeriodicalId":498157,"journal":{"name":"Proceedings of the ACM on Management of Data","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136281619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Laxman Dhulipala, Jakub Łącki, Jason Lee, Vahab Mirrokni
We introduce TeraHAC, a (1+ε)-approximate hierarchical agglomerative clustering (HAC) algorithm which scales to trillion-edge graphs. Our algorithm is based on a new approach to computing (1+ε)-approximate HAC, which is a novel combination of the nearest-neighbor chain algorithm and the notion of (1+ε)-approximate HAC. Our approach allows us to partition the graph among multiple machines and make significant progress in computing the clustering within each partition before any communication with other partitions is needed. We evaluate TeraHAC on a number of real-world and synthetic graphs of up to 8 trillion edges. We show that TeraHAC requires over 100x fewer rounds compared to previously known approaches for computing HAC. It is up to 8.3x faster than SCC, the state-of-the-art distributed algorithm for hierarchical clustering, while achieving 1.16x higher quality. In fact, TeraHAC essentially retains the quality of the celebrated HAC algorithm while significantly improving the running time.
{"title":"TeraHAC: Hierarchical Agglomerative Clustering of Trillion-Edge Graphs","authors":"Laxman Dhulipala, Jakub Łącki, Jason Lee, Vahab Mirrokni","doi":"10.1145/3617341","DOIUrl":"https://doi.org/10.1145/3617341","url":null,"abstract":"We introduce TeraHAC, a (1+ε)-approximate hierarchical agglomerative clustering (HAC) algorithm which scales to trillion-edge graphs. Our algorithm is based on a new approach to computing (1+ε)-approximate HAC, which is a novel combination of the nearest-neighbor chain algorithm and the notion of (1+ε)-approximate HAC. Our approach allows us to partition the graph among multiple machines and make significant progress in computing the clustering within each partition before any communication with other partitions is needed. We evaluate TeraHAC on a number of real-world and synthetic graphs of up to 8 trillion edges. We show that TeraHAC requires over 100x fewer rounds compared to previously known approaches for computing HAC. It is up to 8.3x faster than SCC, the state-of-the-art distributed algorithm for hierarchical clustering, while achieving 1.16x higher quality. In fact, TeraHAC essentially retains the quality of the celebrated HAC algorithm while significantly improving the running time.","PeriodicalId":498157,"journal":{"name":"Proceedings of the ACM on Management of Data","volume":"34 10","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136281461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the exploratory data science lifecycle, data scientists often spent the majority of their time finding, integrating, validating and cleaning relevant datasets. Despite recent work on data validation, and numerous error detection and correction algorithms, in practice, data cleaning for ML remains largely a manual, unpleasant, and labor-intensive trial and error process, especially in large-scale, distributed computation. The target ML application---such as classification or regression models---can be used as a signal of valuable feedback though, for selecting effective data cleaning strategies. In this paper, we introduce SAGA, a framework for automatically generating the top-K most effective data cleaning pipelines. SAGA adopts ideas from Auto-ML, feature selection, and hyper-parameter tuning. Our framework is extensible for user-provided constraints, new data cleaning primitives, and ML applications; automatically generates hybrid runtime plans of local and distributed operations; and performs pruning by interesting properties (e.g., monotonicity). Instead of full automation---which is rather unrealistic---SAGA simplifies the mechanical aspects of data cleaning. Our experiments show that SAGA yields robust accuracy improvements over state-of-the-art, and good scalability regarding increasing data sizes and number of evaluated pipelines.
{"title":"SAGA: A Scalable Framework for Optimizing Data Cleaning Pipelines for Machine Learning Applications","authors":"Shafaq Siddiqi, Roman Kern, Matthias Boehm","doi":"10.1145/3617338","DOIUrl":"https://doi.org/10.1145/3617338","url":null,"abstract":"In the exploratory data science lifecycle, data scientists often spent the majority of their time finding, integrating, validating and cleaning relevant datasets. Despite recent work on data validation, and numerous error detection and correction algorithms, in practice, data cleaning for ML remains largely a manual, unpleasant, and labor-intensive trial and error process, especially in large-scale, distributed computation. The target ML application---such as classification or regression models---can be used as a signal of valuable feedback though, for selecting effective data cleaning strategies. In this paper, we introduce SAGA, a framework for automatically generating the top-K most effective data cleaning pipelines. SAGA adopts ideas from Auto-ML, feature selection, and hyper-parameter tuning. Our framework is extensible for user-provided constraints, new data cleaning primitives, and ML applications; automatically generates hybrid runtime plans of local and distributed operations; and performs pruning by interesting properties (e.g., monotonicity). Instead of full automation---which is rather unrealistic---SAGA simplifies the mechanical aspects of data cleaning. Our experiments show that SAGA yields robust accuracy improvements over state-of-the-art, and good scalability regarding increasing data sizes and number of evaluated pipelines.","PeriodicalId":498157,"journal":{"name":"Proceedings of the ACM on Management of Data","volume":"36 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136281616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
k-closest pair (KCP for short) search is a fundamental problem in database research. Given a set of d-dimensional streaming data S, KCP search aims to retrieve k pairs with the shortest distances between them. While existing works have studied continuous 1-closest pair query (i.e., k=1) over dynamic data environments, which allow for object insertions/deletions, they require high computational costs and cannot easily support KCP search with k>1. This paper investigates the problem of KCP search over data stream, aiming to incrementally maintain as few pairs as possible to support KCP search with arbitrarily k. To achieve this, we introduce the concept of NNS (short for Nearest Neighbour pair-Set), which consists of all the nearest neighbour pairs and allows us to support KCP search via only accessing O(k) objects. We further observe that in most cases, we only need to use a small portion of NNS to answer KCP search as typically kłl n. Based on this observation, we propose TNNS (short for Threshold-based NNpair Set), which contains a small number of high-quality NN pairs, and a partition named τ-DLBP (short for τ-Distance Lower-Bound based Partition) to organize objects, with τ being an integer significantly smaller than n. τ-DLBP organizes objects using up to O(łog n / τ) partitions and is able to support the construction and update of TNNS efficiently.
{"title":"Closest Pairs Search Over Data Stream","authors":"Rui Zhu, Bin Wang, Xiaochun Yang, Baihua Zheng","doi":"10.1145/3617326","DOIUrl":"https://doi.org/10.1145/3617326","url":null,"abstract":"k-closest pair (KCP for short) search is a fundamental problem in database research. Given a set of d-dimensional streaming data S, KCP search aims to retrieve k pairs with the shortest distances between them. While existing works have studied continuous 1-closest pair query (i.e., k=1) over dynamic data environments, which allow for object insertions/deletions, they require high computational costs and cannot easily support KCP search with k>1. This paper investigates the problem of KCP search over data stream, aiming to incrementally maintain as few pairs as possible to support KCP search with arbitrarily k. To achieve this, we introduce the concept of NNS (short for <u>N</u>earest <u>N</u>eighbour pair-<u>S</u>et), which consists of all the nearest neighbour pairs and allows us to support KCP search via only accessing O(k) objects. We further observe that in most cases, we only need to use a small portion of NNS to answer KCP search as typically kłl n. Based on this observation, we propose TNNS (short for <u>T</u>hreshold-based <u>NN</u>pair <u>S</u>et), which contains a small number of high-quality NN pairs, and a partition named τ-DLBP (short for τ-<u>D</u>istance <u>L</u>ower-<u>B</u>ound based <u>P</u>artition) to organize objects, with τ being an integer significantly smaller than n. τ-DLBP organizes objects using up to O(łog n / τ) partitions and is able to support the construction and update of TNNS efficiently.","PeriodicalId":498157,"journal":{"name":"Proceedings of the ACM on Management of Data","volume":"34 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136282516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
He Huang, Jiakun Yu, Yang Du, Jia Liu, Haipeng Dai, Yu-E Sun
Heavy-hitter detection is a fundamental task in network traffic measurement and security. Existing work faces the dilemma of suffering dynamic and imbalanced traffic characteristics or lowering the detection efficiency and flexibility. In this paper, we propose a flexible sketch called SwitchSketch that embraces dynamic and skewed traffic for efficient and accurate heavy-hitter detection. The key idea of SwitchSketch is allowing the sketch to dynamically switch among different modes and take full use of each bit of the memory. We present an encoding-based switching scheme together with a flexible bucket structure to jointly achieve this goal by using a combination of design features, including variable-length cells, shrunk counters, embedded metadata, and switchable modes. We further implement SwitchSketch on the NetFPGA-1G-CML board. Experimental results based on real Internet traces show that SwitchSketch achieves a high Fβ-Score of threshold-t detection (consistently higher than 0.938) and over 99% precision rate of top-k detection under a tight memory size (e.g., 100KB). Besides, it outperforms the state-of-the-art by reducing the ARE by 30.77%sim99.96%. All related implementations are open-sourced.
{"title":"Memory-Efficient and Flexible Detection of Heavy Hitters in High-Speed Networks","authors":"He Huang, Jiakun Yu, Yang Du, Jia Liu, Haipeng Dai, Yu-E Sun","doi":"10.1145/3617334","DOIUrl":"https://doi.org/10.1145/3617334","url":null,"abstract":"Heavy-hitter detection is a fundamental task in network traffic measurement and security. Existing work faces the dilemma of suffering dynamic and imbalanced traffic characteristics or lowering the detection efficiency and flexibility. In this paper, we propose a flexible sketch called SwitchSketch that embraces dynamic and skewed traffic for efficient and accurate heavy-hitter detection. The key idea of SwitchSketch is allowing the sketch to dynamically switch among different modes and take full use of each bit of the memory. We present an encoding-based switching scheme together with a flexible bucket structure to jointly achieve this goal by using a combination of design features, including variable-length cells, shrunk counters, embedded metadata, and switchable modes. We further implement SwitchSketch on the NetFPGA-1G-CML board. Experimental results based on real Internet traces show that SwitchSketch achieves a high Fβ-Score of threshold-t detection (consistently higher than 0.938) and over 99% precision rate of top-k detection under a tight memory size (e.g., 100KB). Besides, it outperforms the state-of-the-art by reducing the ARE by 30.77%sim99.96%. All related implementations are open-sourced.","PeriodicalId":498157,"journal":{"name":"Proceedings of the ACM on Management of Data","volume":"34 8","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136282513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Modern memory-optimized indexes often use optimistic locks for concurrent accesses. Read operations can proceed optimistically without taking the lock, greatly improving performance on multicore CPUs. But this is at the cost of robustness against contention where many threads contend on a small set of locks, causing excessive cacheline invalidation, interconnect traffic and eventually performance collapse. Yet existing solutions often sacrifice desired properties such as compact 8-byte lock size and fairness among lock requesters. This paper presents optimistic queuing lock (OptiQL), a new optimistic lock for database indexing to solve this problem. OptiQL extends the classic MCS lock---a fair, compact and robust mutual exclusion lock---with optimistic read capabilities for index workloads to achieve both robustness and high performance while maintaining various desirable properties. Evaluation using memory-optimized B+-trees on a 40-core, dual-socket server shows that OptiQL matches existing optimistic locks for read operations, while avoiding performance collapse under high contention.
{"title":"OptiQL: Robust Optimistic Locking for Memory-Optimized Indexes","authors":"Ge Shi, Ziyi Yan, Tianzheng Wang","doi":"10.1145/3617336","DOIUrl":"https://doi.org/10.1145/3617336","url":null,"abstract":"Modern memory-optimized indexes often use optimistic locks for concurrent accesses. Read operations can proceed optimistically without taking the lock, greatly improving performance on multicore CPUs. But this is at the cost of robustness against contention where many threads contend on a small set of locks, causing excessive cacheline invalidation, interconnect traffic and eventually performance collapse. Yet existing solutions often sacrifice desired properties such as compact 8-byte lock size and fairness among lock requesters. This paper presents optimistic queuing lock (OptiQL), a new optimistic lock for database indexing to solve this problem. OptiQL extends the classic MCS lock---a fair, compact and robust mutual exclusion lock---with optimistic read capabilities for index workloads to achieve both robustness and high performance while maintaining various desirable properties. Evaluation using memory-optimized B+-trees on a 40-core, dual-socket server shows that OptiQL matches existing optimistic locks for read operations, while avoiding performance collapse under high contention.","PeriodicalId":498157,"journal":{"name":"Proceedings of the ACM on Management of Data","volume":"34 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136282518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Federated Learning (FL) enables a large number of data owners (a.k.a. FL clients) to jointly train a machine learning model without disclosing private local data. The importance of local data samples to the FL model vary widely. This is exacerbated by the presence of noisy data, which exhibit large losses similar to important (hard) samples. Currently, there lacks an FL approach that can effectively distinguish hard samples (which are beneficial) from noisy samples (which are harmful). To bridge this gap, we propose the Federated Client and Sample Selection (FedCSS) approach. It is a bilevel optimization approach for FL client-and-sample selection to achieve hard sample-aware noise-robust learning in a privacy preserving manner. It performs meta-learning based online approximation to iteratively update global FL models, select the most positively influential samples and deal with training data noise. Theoretical analysis shows that it is guaranteed to converge in an efficient manner. Experimental comparison against six state-of-the-art baselines on five real-world datasets in the presence of data noise and heterogeneity shows that it achieves up to 26.4% higher test accuracy, while saving communication and computation costs by at least 41.5% and 1.2%, respectively.
联邦学习(FL)使大量数据所有者(也称为FL客户端)能够在不泄露私有本地数据的情况下共同训练机器学习模型。局部数据样本对FL模型的重要性差别很大。噪声数据的存在加剧了这种情况,这些数据表现出与重要(硬)样本相似的巨大损失。目前,缺乏一种能够有效区分硬样本(有益)和噪声样本(有害)的FL方法。为了弥补这一差距,我们提出了联邦客户端和样本选择(federalclient and Sample Selection, federcss)方法。这是一种双层优化方法,用于FL客户端和样本选择,以保护隐私的方式实现硬样本感知噪声鲁棒学习。它执行基于元学习的在线逼近来迭代更新全局FL模型,选择最具积极影响的样本并处理训练数据噪声。理论分析表明,该算法能保证有效收敛。在存在数据噪声和异质性的五个真实数据集上与六个最先进的基线进行的实验比较表明,该方法的测试精度提高了26.4%,同时通信和计算成本分别节省了至少41.5%和1.2%。
{"title":"FedCSS: Joint Client-and-Sample Selection for Hard Sample-Aware Noise-Robust Federated Learning","authors":"Anran Li, Yue Cao, Jiabao Guo, Hongyi Peng, Qing Guo, Han Yu","doi":"10.1145/3617332","DOIUrl":"https://doi.org/10.1145/3617332","url":null,"abstract":"Federated Learning (FL) enables a large number of data owners (a.k.a. FL clients) to jointly train a machine learning model without disclosing private local data. The importance of local data samples to the FL model vary widely. This is exacerbated by the presence of noisy data, which exhibit large losses similar to important (hard) samples. Currently, there lacks an FL approach that can effectively distinguish hard samples (which are beneficial) from noisy samples (which are harmful). To bridge this gap, we propose the Federated Client and Sample Selection (FedCSS) approach. It is a bilevel optimization approach for FL client-and-sample selection to achieve hard sample-aware noise-robust learning in a privacy preserving manner. It performs meta-learning based online approximation to iteratively update global FL models, select the most positively influential samples and deal with training data noise. Theoretical analysis shows that it is guaranteed to converge in an efficient manner. Experimental comparison against six state-of-the-art baselines on five real-world datasets in the presence of data noise and heterogeneity shows that it achieves up to 26.4% higher test accuracy, while saving communication and computation costs by at least 41.5% and 1.2%, respectively.","PeriodicalId":498157,"journal":{"name":"Proceedings of the ACM on Management of Data","volume":"34 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136282519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
LSM-trees are widely adopted as the storage backend of key-value stores. However, optimizing the system performance under dynamic workloads has not been sufficiently studied or evaluated in previous work. To fill the gap, we present RusKey, a key-value store with the following new features: (1) RusKey is a first attempt to orchestrate LSM-tree structures online to enable robust performance under the context of dynamic workloads; (2) RusKey is the first study to use Reinforcement Learning (RL) to guide LSM-tree transformations; (3) RusKey includes a new LSM-tree design, named FLSM-tree, for an efficient transition between different compaction policies -- the bottleneck of dynamic key-value stores. We justify the superiority of the new design with theoretical analysis; (4) RusKey requires no prior workload knowledge for system adjustment, in contrast to state-of-the-art techniques. Experiments show that RusKey exhibits strong performance robustness in diverse workloads, achieving up to 4x better end-to-end performance than the RocksDB system under various settings.
{"title":"Learning to Optimize LSM-trees: Towards A Reinforcement Learning based Key-Value Store for Dynamic Workloads","authors":"Dingheng Mo, Fanchao Chen, Siqiang Luo, Caihua Shan","doi":"10.1145/3617333","DOIUrl":"https://doi.org/10.1145/3617333","url":null,"abstract":"LSM-trees are widely adopted as the storage backend of key-value stores. However, optimizing the system performance under dynamic workloads has not been sufficiently studied or evaluated in previous work. To fill the gap, we present RusKey, a key-value store with the following new features: (1) RusKey is a first attempt to orchestrate LSM-tree structures online to enable robust performance under the context of dynamic workloads; (2) RusKey is the first study to use Reinforcement Learning (RL) to guide LSM-tree transformations; (3) RusKey includes a new LSM-tree design, named FLSM-tree, for an efficient transition between different compaction policies -- the bottleneck of dynamic key-value stores. We justify the superiority of the new design with theoretical analysis; (4) RusKey requires no prior workload knowledge for system adjustment, in contrast to state-of-the-art techniques. Experiments show that RusKey exhibits strong performance robustness in diverse workloads, achieving up to 4x better end-to-end performance than the RocksDB system under various settings.","PeriodicalId":498157,"journal":{"name":"Proceedings of the ACM on Management of Data","volume":"33 10","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136282523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper proposes RecLogic, a framework for improving the accuracy of machine learning (ML) models for recommendation. It aims to enhance existing ML models with logic conditions to reduce false positives and false negatives, without training a new model. Underlying RecLogic are (a) a class of prediction rules on graphs, denoted by TIEs, (b) a new approach to learning TIEs, and (c) a new paradigm for recommendation with TIEs. TIEs may embed ML recommendation models as predicates; as opposed to prior graph rules, it is tractable to decide whether a graph satisfies a set of TIEs. To enrich ML models, RecLogic iteratively trains a generator with feedback from each round, to learn TIEs with a probabilistic bound. RecLogic also provides a PTIME parallel algorithm for making recommendations with the learned TIEs. Using real-life data, we empirically verify that RecLogic improves the accuracy of ML predictions by 22.89% on average in an area where the prediction strength is neither sufficiently large nor sufficiently small, up to 33.10%.
{"title":"Enriching Recommendation Models with Logic Conditions","authors":"Lihang Fan, Wenfei Fan, Ping Lu, Chao Tian, Qiang Yin","doi":"10.1145/3617330","DOIUrl":"https://doi.org/10.1145/3617330","url":null,"abstract":"This paper proposes RecLogic, a framework for improving the accuracy of machine learning (ML) models for recommendation. It aims to enhance existing ML models with logic conditions to reduce false positives and false negatives, without training a new model. Underlying RecLogic are (a) a class of prediction rules on graphs, denoted by TIEs, (b) a new approach to learning TIEs, and (c) a new paradigm for recommendation with TIEs. TIEs may embed ML recommendation models as predicates; as opposed to prior graph rules, it is tractable to decide whether a graph satisfies a set of TIEs. To enrich ML models, RecLogic iteratively trains a generator with feedback from each round, to learn TIEs with a probabilistic bound. RecLogic also provides a PTIME parallel algorithm for making recommendations with the learned TIEs. Using real-life data, we empirically verify that RecLogic improves the accuracy of ML predictions by 22.89% on average in an area where the prediction strength is neither sufficiently large nor sufficiently small, up to 33.10%.","PeriodicalId":498157,"journal":{"name":"Proceedings of the ACM on Management of Data","volume":"33 7","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136282526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}