A Worst-Case Latency and Age Analysis of Coded Distributed Computing With Unreliable Workers and Periodic Tasks

IF 6.3 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC IEEE Open Journal of the Communications Society Pub Date : 2024-09-11 DOI:10.1109/OJCOMS.2024.3458802
Federico Chiariotti;Beatriz Soret;Petar Popovski
{"title":"A Worst-Case Latency and Age Analysis of Coded Distributed Computing With Unreliable Workers and Periodic Tasks","authors":"Federico Chiariotti;Beatriz Soret;Petar Popovski","doi":"10.1109/OJCOMS.2024.3458802","DOIUrl":null,"url":null,"abstract":"Over the past decade, the deep learning revolution has led to ever-increasing demands for computing power and working memory to support larger and larger neural networks. As this coincided with the end of Moore’s law, distributed solutions have emerged as a natural answer: in particular, the novel Coded Distributed Computing (CDC) paradigm exploits results from coding theory to divide large tasks into redundant sets of smaller subtasks to be processed across multiple workers, making the computation more robust to stragglers and malicious worker nodes. Optimizing the use of these distributed computing resources is critical, as excessive redundancy might impact on performance and energy consumption. This work considers a CDC system receiving periodic tasks, deriving the full distribution of the latency, reliability, and Peak Age of Information (PAoI) under worker diversity and random failures. The CDC system is modeled as a fork-join \n<inline-formula> <tex-math>$D/M/(K, N)/L$ </tex-math></inline-formula>\n queue, where only K of the coded N subtasks are necessary to solve the overall task, and workers can hold up to L subtasks in their queues. Our results are useful for resource optimization, showing the relationship between system load, redundancy, and latency, as well as the trade-off between latency, reliability, and age performance.","PeriodicalId":33803,"journal":{"name":"IEEE Open Journal of the Communications Society","volume":"5 ","pages":"5874-5889"},"PeriodicalIF":6.3000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10677483","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Open Journal of the Communications Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10677483/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

Abstract

Over the past decade, the deep learning revolution has led to ever-increasing demands for computing power and working memory to support larger and larger neural networks. As this coincided with the end of Moore’s law, distributed solutions have emerged as a natural answer: in particular, the novel Coded Distributed Computing (CDC) paradigm exploits results from coding theory to divide large tasks into redundant sets of smaller subtasks to be processed across multiple workers, making the computation more robust to stragglers and malicious worker nodes. Optimizing the use of these distributed computing resources is critical, as excessive redundancy might impact on performance and energy consumption. This work considers a CDC system receiving periodic tasks, deriving the full distribution of the latency, reliability, and Peak Age of Information (PAoI) under worker diversity and random failures. The CDC system is modeled as a fork-join $D/M/(K, N)/L$ queue, where only K of the coded N subtasks are necessary to solve the overall task, and workers can hold up to L subtasks in their queues. Our results are useful for resource optimization, showing the relationship between system load, redundancy, and latency, as well as the trade-off between latency, reliability, and age performance.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
具有不可靠工作者和周期性任务的编码分布式计算的最坏情况延迟和时长分析
过去十年间,深度学习革命导致对计算能力和工作内存的需求不断增加,以支持越来越大的神经网络。由于这与摩尔定律的终结不谋而合,分布式解决方案自然应运而生:特别是,新颖的编码分布式计算(CDC)范例利用编码理论的结果,将大型任务划分为冗余的较小子任务集,由多个工作者处理,从而使计算对落伍者和恶意工作者节点更具鲁棒性。优化使用这些分布式计算资源至关重要,因为过多的冗余可能会影响性能和能耗。本研究考虑了一个接收周期性任务的 CDC 系统,推导出了工人多样性和随机故障下的延迟、可靠性和峰值信息年龄(PAoI)的完整分布。CDC 系统被建模为叉接 $D/M/(K, N)/L$ 队列,其中只有编码的 N 个子任务中的 K 个是解决整个任务所必需的,而工人的队列中最多可容纳 L 个子任务。我们的结果有助于资源优化,显示了系统负载、冗余和延迟之间的关系,以及延迟、可靠性和年龄性能之间的权衡。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
13.70
自引率
3.80%
发文量
94
审稿时长
10 weeks
期刊介绍: The IEEE Open Journal of the Communications Society (OJ-COMS) is an open access, all-electronic journal that publishes original high-quality manuscripts on advances in the state of the art of telecommunications systems and networks. The papers in IEEE OJ-COMS are included in Scopus. Submissions reporting new theoretical findings (including novel methods, concepts, and studies) and practical contributions (including experiments and development of prototypes) are welcome. Additionally, survey and tutorial articles are considered. The IEEE OJCOMS received its debut impact factor of 7.9 according to the Journal Citation Reports (JCR) 2023. The IEEE Open Journal of the Communications Society covers science, technology, applications and standards for information organization, collection and transfer using electronic, optical and wireless channels and networks. Some specific areas covered include: Systems and network architecture, control and management Protocols, software, and middleware Quality of service, reliability, and security Modulation, detection, coding, and signaling Switching and routing Mobile and portable communications Terminals and other end-user devices Networks for content distribution and distributed computing Communications-based distributed resources control.
期刊最新文献
vFFR: A Very Fast Failure Recovery Strategy Implemented in Devices With Programmable Data Plane Scalable High-Throughput and Low-Latency DVB-S2(x) LDPC Decoders on SIMD Devices Service Continuity in Edge Computing Through Edge Proxies and HTTP Alternative Services Delay Guarantees for a Swarm of Mobile Sensors in Safety-Critical Applications Distributed Massive MIMO for Wireless Power Transfer in the Industrial Internet of Things
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1