{"title":"分布式存储器中基于排序的高性能 k-mer 计数与灵活的混合并行性","authors":"Yifan Li, Giulia Guidi","doi":"arxiv-2407.07718","DOIUrl":null,"url":null,"abstract":"In generating large quantities of DNA data, high-throughput sequencing\ntechnologies require advanced bioinformatics infrastructures for efficient data\nanalysis. k-mer counting, the process of quantifying the frequency of\nfixed-length k DNA subsequences, is a fundamental step in various\nbioinformatics pipelines, including genome assembly and protein prediction. Due\nto the growing volume of data, the scaling of the counting process is critical.\nIn the literature, distributed memory software uses hash tables, which exhibit\npoor cache friendliness and consume excessive memory. They often also lack\nsupport for flexible parallelism, which makes integration into existing\nbioinformatics pipelines difficult. In this work, we propose HySortK, a highly\nefficient sorting-based distributed memory k-mer counter. HySortK reduces the\ncommunication volume through a carefully designed communication scheme and\ndomain-specific optimization strategies. Furthermore, we introduce an abstract\ntask layer for flexible hybrid parallelism to address load imbalances in\ndifferent scenarios. HySortK achieves a 2-10x speedup compared to the GPU\nbaseline on 4 and 8 nodes. Compared to state-of-the-art CPU software, HySortK\nachieves up to 2x speedup while reducing peak memory usage by 30% on 16 nodes.\nFinally, we integrated HySortK into an existing genome assembly pipeline and\nachieved up to 1.8x speedup, proving its flexibility and practicality in\nreal-world scenarios.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"23 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"High-Performance Sorting-Based k-mer Counting in Distributed Memory with Flexible Hybrid Parallelism\",\"authors\":\"Yifan Li, Giulia Guidi\",\"doi\":\"arxiv-2407.07718\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In generating large quantities of DNA data, high-throughput sequencing\\ntechnologies require advanced bioinformatics infrastructures for efficient data\\nanalysis. k-mer counting, the process of quantifying the frequency of\\nfixed-length k DNA subsequences, is a fundamental step in various\\nbioinformatics pipelines, including genome assembly and protein prediction. Due\\nto the growing volume of data, the scaling of the counting process is critical.\\nIn the literature, distributed memory software uses hash tables, which exhibit\\npoor cache friendliness and consume excessive memory. They often also lack\\nsupport for flexible parallelism, which makes integration into existing\\nbioinformatics pipelines difficult. In this work, we propose HySortK, a highly\\nefficient sorting-based distributed memory k-mer counter. HySortK reduces the\\ncommunication volume through a carefully designed communication scheme and\\ndomain-specific optimization strategies. Furthermore, we introduce an abstract\\ntask layer for flexible hybrid parallelism to address load imbalances in\\ndifferent scenarios. HySortK achieves a 2-10x speedup compared to the GPU\\nbaseline on 4 and 8 nodes. Compared to state-of-the-art CPU software, HySortK\\nachieves up to 2x speedup while reducing peak memory usage by 30% on 16 nodes.\\nFinally, we integrated HySortK into an existing genome assembly pipeline and\\nachieved up to 1.8x speedup, proving its flexibility and practicality in\\nreal-world scenarios.\",\"PeriodicalId\":501070,\"journal\":{\"name\":\"arXiv - QuanBio - Genomics\",\"volume\":\"23 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - QuanBio - Genomics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2407.07718\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.07718","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
在生成大量 DNA 数据的过程中,高通量测序技术需要先进的生物信息学基础设施来进行高效的数据分析。k-mer 计数是量化固定长度 k DNA 子序列频率的过程,是基因组组装和蛋白质预测等各种生物信息学流水线的基本步骤。随着数据量的不断增长,计数过程的扩展至关重要。它们通常还缺乏对灵活并行性的支持,因此很难集成到现有的生物信息学流水线中。在这项工作中,我们提出了基于高效排序的分布式内存 k-mer 计数器 HySortK。HySortK 通过精心设计的通信方案和特定领域的优化策略减少了通信量。此外,我们还引入了用于灵活混合并行的抽象任务层,以解决不同场景下的负载不平衡问题。与 4 节点和 8 节点上的 GPU 基准相比,HySortK 的速度提高了 2-10 倍。最后,我们将 HySortK 集成到现有的基因组组装流水线中,并实现了高达 1.8 倍的速度提升,证明了它在现实世界场景中的灵活性和实用性。
High-Performance Sorting-Based k-mer Counting in Distributed Memory with Flexible Hybrid Parallelism
In generating large quantities of DNA data, high-throughput sequencing
technologies require advanced bioinformatics infrastructures for efficient data
analysis. k-mer counting, the process of quantifying the frequency of
fixed-length k DNA subsequences, is a fundamental step in various
bioinformatics pipelines, including genome assembly and protein prediction. Due
to the growing volume of data, the scaling of the counting process is critical.
In the literature, distributed memory software uses hash tables, which exhibit
poor cache friendliness and consume excessive memory. They often also lack
support for flexible parallelism, which makes integration into existing
bioinformatics pipelines difficult. In this work, we propose HySortK, a highly
efficient sorting-based distributed memory k-mer counter. HySortK reduces the
communication volume through a carefully designed communication scheme and
domain-specific optimization strategies. Furthermore, we introduce an abstract
task layer for flexible hybrid parallelism to address load imbalances in
different scenarios. HySortK achieves a 2-10x speedup compared to the GPU
baseline on 4 and 8 nodes. Compared to state-of-the-art CPU software, HySortK
achieves up to 2x speedup while reducing peak memory usage by 30% on 16 nodes.
Finally, we integrated HySortK into an existing genome assembly pipeline and
achieved up to 1.8x speedup, proving its flexibility and practicality in
real-world scenarios.