{"title":"Keyed watermarks: A fine-grained watermark generation for Apache Flink","authors":"Tawfik Yasser , Tamer Arafa , Mohamed ElHelw , Ahmed Awad","doi":"10.1016/j.future.2025.107796","DOIUrl":null,"url":null,"abstract":"<div><div>Big Data Stream processing engines, exemplified by tools like Apache Flink, employ windowing techniques to manage unbounded streams of events. Aggregating relevant data within Windows is important for event-time windowing due to its impact on result accuracy. A pivotal role in this process is attributed to watermarks, unique timestamps signifying event progression in time. Nonetheless, the existing watermark generation method within Apache Flink, operating at the input stream level, exhibits a bias towards faster sub-streams, causing the omission of events from slower counterparts. Our analysis determined that Apache Flink’s standard watermark generation approach results in an approximate 33% data loss when 50% of median-proximate keys experience delays. Furthermore, this loss exceeds 37% in cases where 50% of randomly selected keys encounter delays. In this paper, we introduce a pioneering approach termed <em>keyed watermarks</em> to address data loss concerns and enhance data processing precision to a minimum of 99% in most scenarios. Our strategy facilitates distinct progress monitoring by creating individualized watermarks for each sub-stream (key). Within our investigation, we delineate the essential architectural and API modifications requisite for integrating keyed watermarks while also highlighting our experience in navigating the expansion of Apache Flink’s extensive codebase. Moreover, we conduct a comparative evaluation between the efficacy of our approach and the conventional watermark generation technique concerning the accuracy of event-time tracking, the latency of watermark processing, and the growth of Flink’s maintained state.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"169 ","pages":"Article 107796"},"PeriodicalIF":6.2000,"publicationDate":"2025-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Generation Computer Systems-The International Journal of Escience","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167739X25000913","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
Big Data Stream processing engines, exemplified by tools like Apache Flink, employ windowing techniques to manage unbounded streams of events. Aggregating relevant data within Windows is important for event-time windowing due to its impact on result accuracy. A pivotal role in this process is attributed to watermarks, unique timestamps signifying event progression in time. Nonetheless, the existing watermark generation method within Apache Flink, operating at the input stream level, exhibits a bias towards faster sub-streams, causing the omission of events from slower counterparts. Our analysis determined that Apache Flink’s standard watermark generation approach results in an approximate 33% data loss when 50% of median-proximate keys experience delays. Furthermore, this loss exceeds 37% in cases where 50% of randomly selected keys encounter delays. In this paper, we introduce a pioneering approach termed keyed watermarks to address data loss concerns and enhance data processing precision to a minimum of 99% in most scenarios. Our strategy facilitates distinct progress monitoring by creating individualized watermarks for each sub-stream (key). Within our investigation, we delineate the essential architectural and API modifications requisite for integrating keyed watermarks while also highlighting our experience in navigating the expansion of Apache Flink’s extensive codebase. Moreover, we conduct a comparative evaluation between the efficacy of our approach and the conventional watermark generation technique concerning the accuracy of event-time tracking, the latency of watermark processing, and the growth of Flink’s maintained state.
期刊介绍:
Computing infrastructures and systems are constantly evolving, resulting in increasingly complex and collaborative scientific applications. To cope with these advancements, there is a growing need for collaborative tools that can effectively map, control, and execute these applications.
Furthermore, with the explosion of Big Data, there is a requirement for innovative methods and infrastructures to collect, analyze, and derive meaningful insights from the vast amount of data generated. This necessitates the integration of computational and storage capabilities, databases, sensors, and human collaboration.
Future Generation Computer Systems aims to pioneer advancements in distributed systems, collaborative environments, high-performance computing, and Big Data analytics. It strives to stay at the forefront of developments in grids, clouds, and the Internet of Things (IoT) to effectively address the challenges posed by these wide-area, fully distributed sensing and computing systems.