Jianjun Li;Yuhui Deng;Jiande Huang;Yi Zhou;Qifen Yang;Geyong Min
{"title":"Gecko: Efficient Sliding Window Aggregation With Granular-Based Bulk Eviction Over Big Data Streams","authors":"Jianjun Li;Yuhui Deng;Jiande Huang;Yi Zhou;Qifen Yang;Geyong Min","doi":"10.1109/TKDE.2024.3511334","DOIUrl":null,"url":null,"abstract":"Sliding window aggregation, which extracts summaries from data streams, is a core operation in streaming analysis. Though existing sliding window algorithms that perform single eviction and insertion operations can achieve a worst-case time complexity of \n<inline-formula><tex-math>$O(1)$</tex-math></inline-formula>\n for in-order streams, real-world data streams often involve out-of-order data and exhibit burst data characteristics, which pose performance challenges to these sliding window algorithms. To address this challenging issue, we propose \n<i>Gecko</i>\n - a novel sliding window aggregation algorithm that supports bulk eviction. Gecko leverages a granular-based eviction strategy for various bulk sizes, enabling efficient bulk eviction while maintaining the performance close to that of in-order stream algorithms for single evictions. For large data bulks, Gecko performs coarse-grained eviction at the chunk level, followed by fine-grained eviction using leftward binary tree aggregation (LTA) as a complementary method. Moreover, Gecko partitions data based on chunks to prevent the impacts of out-of-order data on other chunks, thereby enabling efficient handling of out-of-order data streams. We conduct extensive experiments to evaluate the performance of Gecko. Experimental results demonstrate that Gecko exhibits superior performance over other solutions, which is consistent with theoretical expectations. In real-world data scenarios, Gecko improves the average throughput of the state-of-the-art algorithm b_FiBA by 1.7 times, with a maximum improvement of up to 3.5 times. Gecko also demonstrates the best latency performance among all compared schemes.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 2","pages":"698-709"},"PeriodicalIF":8.9000,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Knowledge and Data Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10777062/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Sliding window aggregation, which extracts summaries from data streams, is a core operation in streaming analysis. Though existing sliding window algorithms that perform single eviction and insertion operations can achieve a worst-case time complexity of
$O(1)$
for in-order streams, real-world data streams often involve out-of-order data and exhibit burst data characteristics, which pose performance challenges to these sliding window algorithms. To address this challenging issue, we propose
Gecko
- a novel sliding window aggregation algorithm that supports bulk eviction. Gecko leverages a granular-based eviction strategy for various bulk sizes, enabling efficient bulk eviction while maintaining the performance close to that of in-order stream algorithms for single evictions. For large data bulks, Gecko performs coarse-grained eviction at the chunk level, followed by fine-grained eviction using leftward binary tree aggregation (LTA) as a complementary method. Moreover, Gecko partitions data based on chunks to prevent the impacts of out-of-order data on other chunks, thereby enabling efficient handling of out-of-order data streams. We conduct extensive experiments to evaluate the performance of Gecko. Experimental results demonstrate that Gecko exhibits superior performance over other solutions, which is consistent with theoretical expectations. In real-world data scenarios, Gecko improves the average throughput of the state-of-the-art algorithm b_FiBA by 1.7 times, with a maximum improvement of up to 3.5 times. Gecko also demonstrates the best latency performance among all compared schemes.
期刊介绍:
The IEEE Transactions on Knowledge and Data Engineering encompasses knowledge and data engineering aspects within computer science, artificial intelligence, electrical engineering, computer engineering, and related fields. It provides an interdisciplinary platform for disseminating new developments in knowledge and data engineering and explores the practicality of these concepts in both hardware and software. Specific areas covered include knowledge-based and expert systems, AI techniques for knowledge and data management, tools, and methodologies, distributed processing, real-time systems, architectures, data management practices, database design, query languages, security, fault tolerance, statistical databases, algorithms, performance evaluation, and applications.