MUSET: set of utilities for constructing abundance unitig matrices from sequencing data.

IF 5.4 Bioinformatics (Oxford, England) Pub Date : 2025-03-04 DOI:10.1093/bioinformatics/btaf054

Riccardo Vicedomini, Francesco Andreace, Yoann Dufresne, Rayan Chikhi, Camila Duitama González

{"title":"MUSET: set of utilities for constructing abundance unitig matrices from sequencing data.","authors":"Riccardo Vicedomini, Francesco Andreace, Yoann Dufresne, Rayan Chikhi, Camila Duitama González","doi":"10.1093/bioinformatics/btaf054","DOIUrl":null,"url":null,"abstract":"Summary: MUSET is a novel set of utilities designed to efficiently construct abundance unitig matrices from sequencing data. Unitig matrices extend the concept of k-mer matrices by merging overlapping k-mers that unambiguously belong to the same sequence. MUSET addresses the limitations of current software by integrating k-mer counting and unitig extraction to generate unitig matrices containing abundance values, as opposed to only presence-absence in previous tools. These matrices preserve variations between samples while reducing disk space and the number of rows compared to k-mer matrices. We evaluated MUSET's performance using datasets derived from a 618-GB collection of ancient oral sequencing samples, producing a filtered unitig matrix that records abundances in <10 h and 20 GB memory.Availability and implementation: MUSET is open source and publicly available under the AGPL-3.0 licence in GitHub at https://github.com/CamilaDuitama/muset. Source code is implemented in C++ and provided with kmat_tools, a collection of tools for processing k-mer matrices. Version v0.5.1 is available on Zenodo with DOI 10.5281/zenodo.14164801.","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4000,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11897428/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics (Oxford, England)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioinformatics/btaf054","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Summary: MUSET is a novel set of utilities designed to efficiently construct abundance unitig matrices from sequencing data. Unitig matrices extend the concept of k-mer matrices by merging overlapping k-mers that unambiguously belong to the same sequence. MUSET addresses the limitations of current software by integrating k-mer counting and unitig extraction to generate unitig matrices containing abundance values, as opposed to only presence-absence in previous tools. These matrices preserve variations between samples while reducing disk space and the number of rows compared to k-mer matrices. We evaluated MUSET's performance using datasets derived from a 618-GB collection of ancient oral sequencing samples, producing a filtered unitig matrix that records abundances in <10 h and 20 GB memory.

Availability and implementation: MUSET is open source and publicly available under the AGPL-3.0 licence in GitHub at https://github.com/CamilaDuitama/muset. Source code is implemented in C++ and provided with kmat_tools, a collection of tools for processing k-mer matrices. Version v0.5.1 is available on Zenodo with DOI 10.5281/zenodo.14164801.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

MUSET：用于从测序数据构建丰度单位矩阵的实用程序集。

MUSET是一套新颖的实用程序，旨在有效地从测序数据构建丰度单位矩阵。统一矩阵通过合并明确属于同一序列的重叠k-mer扩展了k-mer矩阵的概念。MUSET解决了当前软件的局限性，通过整合k-mer计数和单位提取来生成包含丰度值的统一矩阵，而不是在以前的工具中只存在-不存在。与k-mer矩阵相比，这些矩阵保留了样本之间的差异，同时减少了磁盘空间和行数。我们使用来自618gb古代口腔测序样本的数据集来评估MUSET的性能，产生一个过滤的统一矩阵，在不到10小时和20gb内存的情况下记录丰度。可用性和实现：MUSET是开源的，在GitHub的AGPL-3.0许可下可以在https://github.com/CamilaDuitama/muset上公开获得。源代码是用c++实现的，并提供了kmat_tools，这是一个处理k-mer矩阵的工具集合。版本v0.5.1在Zenodo上可用，DOI 10.5281/ Zenodo .14164801。补充信息：补充数据可在生物信息学在线获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Bioinformatics (Oxford, England)

自引率

0.00%

发文量

期刊最新文献

Topological model selection: a case-study in tumour-induced angiogenesis. Finding low-complexity DNA sequences with longdust. Beyond Blacklists: A Critical Assessment of Exclusion Set Generation Strategies and Alternative Approaches. ET-Pfam: Ensemble transfer learning for protein family prediction. Scalable analysis of whole slide spatial proteomics with Harpy.