{"title":"高效寻找近优通用命中集的随机并行算法","authors":"Barış Ekim, Bonnie Berger, Yaron Orenstein","doi":"10.1007/978-3-030-45257-5_3","DOIUrl":null,"url":null,"abstract":"<p><p>As the volume of next generation sequencing data increases, an urgent need for algorithms to efficiently process the data arises. <i>Universal hitting sets</i> (UHS) were recently introduced as an alternative to the central idea of minimizers in sequence analysis with the hopes that they could more efficiently address common tasks such as computing hash functions for read overlap, sparse suffix arrays, and Bloom filters. A UHS is a set of <math><mi>k</mi></math>-mers that hit every sequence of length <math><mi>L</mi></math>, and can thus serve as indices to <math><mi>L</mi></math>-long sequences. Unfortunately, methods for computing small UHSs are not yet practical for real-world sequencing instances due to their serial and deterministic nature, which leads to long runtimes and high memory demands when handling typical values of <math><mi>k</mi></math> (e.g. <math><mrow><mi>k</mi><mo>></mo><mn>13</mn></mrow></math>). To address this bottleneck, we present two algorithmic innovations to significantly decrease runtime while keeping memory usage low: (i) we leverage advanced theoretical and architectural techniques to parallelize and decrease memory usage in calculating <math><mi>k</mi></math>-mer hitting numbers; and (ii) we build upon techniques from randomized Set Cover to select universal <math><mi>k</mi></math>-mers much faster. We implemented these innovations in PASHA, the first randomized parallel algorithm for generating nearoptimal UHSs, which newly handles <math><mrow><mi>k</mi><mo>></mo><mn>13</mn></mrow></math>. We demonstrate empirically that PASHA produces sets only slightly larger than those of serial deterministic algorithms; moreover, the set size is provably guaranteed to be within a small constant factor of the optimal size. PASHA's runtime and memory-usage improvements are orders of magnitude faster than the current best algorithms. We expect our newly-practical construction of UHSs to be adopted in many high-throughput sequence analysis pipelines.</p>","PeriodicalId":74675,"journal":{"name":"Research in computational molecular biology : ... Annual International Conference, RECOMB ... : proceedings. RECOMB (Conference : 2005- )","volume":"12074 ","pages":"37-53"},"PeriodicalIF":0.0000,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11148856/pdf/","citationCount":"0","resultStr":"{\"title\":\"A Randomized Parallel Algorithm for Efficiently Finding Near-Optimal Universal Hitting Sets.\",\"authors\":\"Barış Ekim, Bonnie Berger, Yaron Orenstein\",\"doi\":\"10.1007/978-3-030-45257-5_3\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>As the volume of next generation sequencing data increases, an urgent need for algorithms to efficiently process the data arises. <i>Universal hitting sets</i> (UHS) were recently introduced as an alternative to the central idea of minimizers in sequence analysis with the hopes that they could more efficiently address common tasks such as computing hash functions for read overlap, sparse suffix arrays, and Bloom filters. A UHS is a set of <math><mi>k</mi></math>-mers that hit every sequence of length <math><mi>L</mi></math>, and can thus serve as indices to <math><mi>L</mi></math>-long sequences. Unfortunately, methods for computing small UHSs are not yet practical for real-world sequencing instances due to their serial and deterministic nature, which leads to long runtimes and high memory demands when handling typical values of <math><mi>k</mi></math> (e.g. <math><mrow><mi>k</mi><mo>></mo><mn>13</mn></mrow></math>). To address this bottleneck, we present two algorithmic innovations to significantly decrease runtime while keeping memory usage low: (i) we leverage advanced theoretical and architectural techniques to parallelize and decrease memory usage in calculating <math><mi>k</mi></math>-mer hitting numbers; and (ii) we build upon techniques from randomized Set Cover to select universal <math><mi>k</mi></math>-mers much faster. We implemented these innovations in PASHA, the first randomized parallel algorithm for generating nearoptimal UHSs, which newly handles <math><mrow><mi>k</mi><mo>></mo><mn>13</mn></mrow></math>. We demonstrate empirically that PASHA produces sets only slightly larger than those of serial deterministic algorithms; moreover, the set size is provably guaranteed to be within a small constant factor of the optimal size. PASHA's runtime and memory-usage improvements are orders of magnitude faster than the current best algorithms. We expect our newly-practical construction of UHSs to be adopted in many high-throughput sequence analysis pipelines.</p>\",\"PeriodicalId\":74675,\"journal\":{\"name\":\"Research in computational molecular biology : ... Annual International Conference, RECOMB ... : proceedings. RECOMB (Conference : 2005- )\",\"volume\":\"12074 \",\"pages\":\"37-53\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11148856/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Research in computational molecular biology : ... Annual International Conference, RECOMB ... : proceedings. RECOMB (Conference : 2005- )\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1007/978-3-030-45257-5_3\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2020/4/21 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Research in computational molecular biology : ... Annual International Conference, RECOMB ... : proceedings. RECOMB (Conference : 2005- )","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/978-3-030-45257-5_3","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2020/4/21 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
随着下一代测序数据量的增加,迫切需要高效处理数据的算法。最近,通用命中集(UHS)作为序列分析中最小化的核心思想的替代方案被提出,希望它们能更有效地解决常见任务,如计算读取重叠的哈希函数、稀疏后缀数组和布鲁姆过滤器。UHS 是一组 k-mers 的集合,可以命中每个长度为 L 的序列,因此可以作为 L 长序列的索引。遗憾的是,计算小型 UHS 的方法在现实世界的测序实例中并不实用,因为它们具有串行和确定性的特点,在处理典型的 k 值(如 k>13)时,会导致较长的运行时间和较高的内存需求。为了解决这一瓶颈问题,我们提出了两项算法创新,以大幅缩短运行时间,同时保持较低的内存使用率:(i) 我们利用先进的理论和架构技术来并行化和降低计算 k-mer命中数时的内存使用率;(ii) 我们利用随机集合覆盖技术来更快地选择通用 k-mer。我们在 PASHA 中实现了这些创新,PASHA 是首个用于生成近优 UHS 的随机并行算法,新算法可处理 k>13。我们通过实证证明,PASHA 生成的集合仅比串行确定性算法的集合稍大一些;此外,可以证明集合大小保证在最优大小的一个小常数因子之内。PASHA 在运行时间和内存使用方面的改进比目前最好的算法快了几个数量级。我们期待我们新近构建的 UHS 能被许多高通量序列分析流水线所采用。
A Randomized Parallel Algorithm for Efficiently Finding Near-Optimal Universal Hitting Sets.
As the volume of next generation sequencing data increases, an urgent need for algorithms to efficiently process the data arises. Universal hitting sets (UHS) were recently introduced as an alternative to the central idea of minimizers in sequence analysis with the hopes that they could more efficiently address common tasks such as computing hash functions for read overlap, sparse suffix arrays, and Bloom filters. A UHS is a set of -mers that hit every sequence of length , and can thus serve as indices to -long sequences. Unfortunately, methods for computing small UHSs are not yet practical for real-world sequencing instances due to their serial and deterministic nature, which leads to long runtimes and high memory demands when handling typical values of (e.g. ). To address this bottleneck, we present two algorithmic innovations to significantly decrease runtime while keeping memory usage low: (i) we leverage advanced theoretical and architectural techniques to parallelize and decrease memory usage in calculating -mer hitting numbers; and (ii) we build upon techniques from randomized Set Cover to select universal -mers much faster. We implemented these innovations in PASHA, the first randomized parallel algorithm for generating nearoptimal UHSs, which newly handles . We demonstrate empirically that PASHA produces sets only slightly larger than those of serial deterministic algorithms; moreover, the set size is provably guaranteed to be within a small constant factor of the optimal size. PASHA's runtime and memory-usage improvements are orders of magnitude faster than the current best algorithms. We expect our newly-practical construction of UHSs to be adopted in many high-throughput sequence analysis pipelines.