k 名义空间:用反向补数绘制草图

Guillaume Marçais, C S Elder, Carl Kingsford
{"title":"k 名义空间:用反向补数绘制草图","authors":"Guillaume Marçais, C S Elder, Carl Kingsford","doi":"10.1093/bioinformatics/btae629","DOIUrl":null,"url":null,"abstract":"<p><strong>Motivation: </strong>Sequences equivalent to their reverse complements (i.e. double-stranded DNA) have no analogue in text analysis and non-biological string algorithms. Despite this striking difference, algorithms designed for computational biology (e.g. sketching algorithms) are designed and tested in the same way as classical string algorithms. Then, as a post-processing step, these algorithms are adapted to work with genomic sequences by folding a k-mer and its reverse complement into a single sequence: The canonical representation (k-nonical space).</p><p><strong>Results: </strong>The effect of using the canonical representation with sketching methods is understudied and not understood. As a first step, we use context-free sketching methods to illustrate the potentially detrimental effects of using canonical k-mers with string algorithms not designed to accommodate for them. In particular, we show that large stretches of the genome (\"sketching deserts\") are undersampled or entirely skipped by context-free sketching methods, effectively making these genomic regions invisible to subsequent algorithms using these sketches. We provide empirical data showing these effects and develop a theoretical framework explaining the appearance of sketching deserts. Finally, we propose two schemes to accommodate for these effects: (i) a new procedure that adapts existing sketching methods to k-nonical space and (ii) an optimization procedure to directly design new sketching methods for k-nonical space.</p><p><strong>Availability and implementation: </strong>The code used in this analysis is available under a permissive license at https://github.com/Kingsford-Group/mdsscope.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11549021/pdf/","citationCount":"0","resultStr":"{\"title\":\"k-nonical space: sketching with reverse complements.\",\"authors\":\"Guillaume Marçais, C S Elder, Carl Kingsford\",\"doi\":\"10.1093/bioinformatics/btae629\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Motivation: </strong>Sequences equivalent to their reverse complements (i.e. double-stranded DNA) have no analogue in text analysis and non-biological string algorithms. Despite this striking difference, algorithms designed for computational biology (e.g. sketching algorithms) are designed and tested in the same way as classical string algorithms. Then, as a post-processing step, these algorithms are adapted to work with genomic sequences by folding a k-mer and its reverse complement into a single sequence: The canonical representation (k-nonical space).</p><p><strong>Results: </strong>The effect of using the canonical representation with sketching methods is understudied and not understood. As a first step, we use context-free sketching methods to illustrate the potentially detrimental effects of using canonical k-mers with string algorithms not designed to accommodate for them. In particular, we show that large stretches of the genome (\\\"sketching deserts\\\") are undersampled or entirely skipped by context-free sketching methods, effectively making these genomic regions invisible to subsequent algorithms using these sketches. We provide empirical data showing these effects and develop a theoretical framework explaining the appearance of sketching deserts. Finally, we propose two schemes to accommodate for these effects: (i) a new procedure that adapts existing sketching methods to k-nonical space and (ii) an optimization procedure to directly design new sketching methods for k-nonical space.</p><p><strong>Availability and implementation: </strong>The code used in this analysis is available under a permissive license at https://github.com/Kingsford-Group/mdsscope.</p>\",\"PeriodicalId\":93899,\"journal\":{\"name\":\"Bioinformatics (Oxford, England)\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11549021/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Bioinformatics (Oxford, England)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/bioinformatics/btae629\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics (Oxford, England)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioinformatics/btae629","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

动机与反向互补序列(即双链 DNA)等价的序列在文本分析和非生物字符串算法中并不存在。尽管存在这种显著差异,但为计算生物学设计的算法(如草图算法)的设计和测试方法与经典字符串算法相同。然后,作为后处理步骤,通过将 k-mer 及其反向补码折叠成单一序列,使这些算法适用于基因组序列:结果:对草图绘制方法使用规范表示法的效果研究不足,也不了解。作为第一步,我们使用无上下文草图方法来说明使用非标准 k-mers 的字符串算法可能产生的不利影响。特别是,我们展示了基因组的大片段("草图沙漠")被无上下文草图方法采样不足或完全跳过,从而有效地使使用这些草图的后续算法看不到这些基因组区域。我们提供了显示这些影响的经验数据,并建立了解释草图沙漠出现的理论框架。最后,我们提出了两种方案来适应这些效应:(1)一种新的程序,将现有的草图绘制方法适应于 k-nonical 空间;(2)一种优化程序,直接为 k-nonical 空间设计新的草图绘制方法:本分析中使用的代码可在 https://github.com/Kingsford-Group/mdsscope.Supplementary 信息网站的许可下获取:补充数据可在牛津生物信息学网站获取。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
k-nonical space: sketching with reverse complements.

Motivation: Sequences equivalent to their reverse complements (i.e. double-stranded DNA) have no analogue in text analysis and non-biological string algorithms. Despite this striking difference, algorithms designed for computational biology (e.g. sketching algorithms) are designed and tested in the same way as classical string algorithms. Then, as a post-processing step, these algorithms are adapted to work with genomic sequences by folding a k-mer and its reverse complement into a single sequence: The canonical representation (k-nonical space).

Results: The effect of using the canonical representation with sketching methods is understudied and not understood. As a first step, we use context-free sketching methods to illustrate the potentially detrimental effects of using canonical k-mers with string algorithms not designed to accommodate for them. In particular, we show that large stretches of the genome ("sketching deserts") are undersampled or entirely skipped by context-free sketching methods, effectively making these genomic regions invisible to subsequent algorithms using these sketches. We provide empirical data showing these effects and develop a theoretical framework explaining the appearance of sketching deserts. Finally, we propose two schemes to accommodate for these effects: (i) a new procedure that adapts existing sketching methods to k-nonical space and (ii) an optimization procedure to directly design new sketching methods for k-nonical space.

Availability and implementation: The code used in this analysis is available under a permissive license at https://github.com/Kingsford-Group/mdsscope.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Damsel: analysis and visualisation of DamID sequencing in R. BWT construction and search at the terabase scale. HAlign 4: A New Strategy for Rapidly Aligning Millions of Sequences. Phasing Nanopore genome assembly by integrating heterozygous variations and Hi-C data. STRprofiler: efficient comparisons of short tandem repeat profiles for biomedical model authentication.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1