Cover Your Bases: How to Minimize the Sequencing Coverage in DNA Storage Systems

IF 2.2 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS IEEE Transactions on Information Theory Pub Date : 2024-11-12 DOI:10.1109/TIT.2024.3496587

Daniella Bar-Lev;Omer Sabary;Ryan Gabrys;Eitan Yaakobi

{"title":"Cover Your Bases: How to Minimize the Sequencing Coverage in DNA Storage Systems","authors":"Daniella Bar-Lev;Omer Sabary;Ryan Gabrys;Eitan Yaakobi","doi":"10.1109/TIT.2024.3496587","DOIUrl":null,"url":null,"abstract":"Although the expenses associated with DNA sequencing have been rapidly decreasing, the current cost of sequencing information stands at roughly \n<inline-formula> <tex-math>${\\$}120$ </tex-math></inline-formula>\n/GB, which is dramatically more expensive than reading from existing archival storage solutions today. In this work, we aim to reduce not only the cost but also the latency of DNA storage by initiating the study of the DNA coverage depth problem, which aims to reduce the required number of reads to retrieve information from the storage system. Under this framework, our main goal is to understand the effect of error-correcting codes and retrieval algorithms on the required sequencing coverage depth. We establish that the expected number of reads that are required for information retrieval is minimized when the channel follows a uniform distribution. We also derive upper and lower bounds on the probability distribution of this number of required reads and provide a comprehensive upper and lower bound on its expected value. We further prove that for a noiseless channel and uniform distribution, MDS codes are optimal in terms of minimizing the expected number of reads. Additionally, we study the DNA coverage depth problem under the random-access setup, in which the user aims to retrieve just a specific information unit from the entire DNA storage system. We prove that the expected retrieval time is at least k for \n<inline-formula> <tex-math>$[n,k]$ </tex-math></inline-formula>\n MDS codes as well as for other families of codes. Furthermore, we present explicit code constructions that achieve expected retrieval times below k and evaluate their performance through analytical methods and simulations. Lastly, we provide lower bounds on the maximum expected retrieval time. Our findings offer valuable insights for reducing the cost and latency of DNA storage.","PeriodicalId":13494,"journal":{"name":"IEEE Transactions on Information Theory","volume":"71 1","pages":"192-218"},"PeriodicalIF":2.2000,"publicationDate":"2024-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Information Theory","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10750859/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Although the expenses associated with DNA sequencing have been rapidly decreasing, the current cost of sequencing information stands at roughly

${\$}120$

/GB, which is dramatically more expensive than reading from existing archival storage solutions today. In this work, we aim to reduce not only the cost but also the latency of DNA storage by initiating the study of the DNA coverage depth problem, which aims to reduce the required number of reads to retrieve information from the storage system. Under this framework, our main goal is to understand the effect of error-correcting codes and retrieval algorithms on the required sequencing coverage depth. We establish that the expected number of reads that are required for information retrieval is minimized when the channel follows a uniform distribution. We also derive upper and lower bounds on the probability distribution of this number of required reads and provide a comprehensive upper and lower bound on its expected value. We further prove that for a noiseless channel and uniform distribution, MDS codes are optimal in terms of minimizing the expected number of reads. Additionally, we study the DNA coverage depth problem under the random-access setup, in which the user aims to retrieve just a specific information unit from the entire DNA storage system. We prove that the expected retrieval time is at least k for

$[n,k]$

MDS codes as well as for other families of codes. Furthermore, we present explicit code constructions that achieve expected retrieval times below k and evaluate their performance through analytical methods and simulations. Lastly, we provide lower bounds on the maximum expected retrieval time. Our findings offer valuable insights for reducing the cost and latency of DNA storage.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

覆盖你的碱基：如何在DNA存储系统中最小化测序覆盖

虽然与DNA测序相关的费用已经迅速下降，但目前测序信息的成本大约为120美元/GB，这比目前从现有的档案存储解决方案中读取要贵得多。在这项工作中，我们旨在通过启动DNA覆盖深度问题的研究来降低DNA存储的成本和延迟，该问题旨在减少从存储系统中检索信息所需的读取次数。在这个框架下，我们的主要目标是了解纠错码和检索算法对所需测序覆盖深度的影响。我们确定，当信道遵循均匀分布时，信息检索所需的预期读次数是最小的。我们还推导了所需读取数量的概率分布的上界和下界，并提供了其期望值的综合上界和下界。我们进一步证明，对于无噪声信道和均匀分布，MDS码在最小化预期读取次数方面是最优的。此外，我们研究了随机访问设置下的DNA覆盖深度问题，其中用户的目标是从整个DNA存储系统中检索特定的信息单元。我们证明$[n,k]$ MDS码以及其他码族的期望检索时间至少为k。此外，我们提出了显式代码结构，实现了低于k的预期检索时间，并通过分析方法和模拟评估了它们的性能。最后，我们给出了最大期望检索时间的下界。我们的发现为降低DNA存储的成本和延迟提供了有价值的见解。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Transactions on Information Theory 工程技术-工程：电子与电气

CiteScore

5.70

自引率

20.00%

发文量

514

审稿时长

12 months

期刊介绍： The IEEE Transactions on Information Theory is a journal that publishes theoretical and experimental papers concerned with the transmission, processing, and utilization of information. The boundaries of acceptable subject matter are intentionally not sharply delimited. Rather, it is hoped that as the focus of research activity changes, a flexible policy will permit this Transactions to follow suit. Current appropriate topics are best reflected by recent Tables of Contents; they are summarized in the titles of editorial areas that appear on the inside front cover.

期刊最新文献

Table of Contents IEEE Transactions on Information Theory Information for Authors IEEE Transactions on Information Theory Publication Information Optimal Signals and Detectors Based on Correlation and Energy Multi-Armed Bandits With Costly Probes