基于k-mer的原始序列数据高效管理:在苏氏果蝇中的应用

Mathieu Gautier
{"title":"基于k-mer的原始序列数据高效管理:在苏氏果蝇中的应用","authors":"Mathieu Gautier","doi":"10.24072/pcjournal.309","DOIUrl":null,"url":null,"abstract":"Several studies have highlighted the presence of contaminated entries in public sequence repositories, calling for special attention to the associated metadata. Here, we propose and evaluate a fast and efficient k–mer-based approach to assess the degree of mislabeling or contamination. We applied it to high-throughput whole-genome raw sequence data for 236 Ind-Seq and 22 Pool-Seq samples of the invasive species Drosophila suzukii. We first used Clark software to build a dictionary of species-discriminating k–mers from the curated assemblies of 29 target drosophilid species (including D. melanogaster, D. simulans, D. subpulchrella, or D. biarmipes) and 12 common drosophila pathogens and commensals (including Wolbachia). Counting the number of k–mers composing each query sample sequence that matched a discriminating k–mer from the dictionary provided a simple criterion for assignment to target species and evaluation of the entire sample. Analyses of a wide range of samples, representative of both target and other drosophilid species, demonstrated very good performance of the proposed approach, both in terms of run time and accuracy of sequence assignment. Of the 236 D. suzukii individuals, five were reassigned to D. simulans and eleven to D. subpulchrella. Another four showed moderate to substantial microbial contamination. Similarly, among the 22 Pool-Seq samples analyzed, two from the native range were found to be contaminated with 1 and 7 D. subpulchrella individuals, respectively (out of 50), and one from Europe was found to be contaminated with 5 to 6 D. immigrans individuals (out of 100). Overall, the present analysis allowed the definition of a large curated dataset consisting of > 60 population samples representative of the worldwide genetic diversity, which may be valuable for further population genetics studies on D. suzukii. More generally, while we advocate careful sample identification and verification prior to sequencing, the proposed framework is simple and computationally efficient enough to be included as a routine post-hoc quality check prior to any data analysis and prior to data submission to public repositories.","PeriodicalId":74413,"journal":{"name":"Peer community journal","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Efficient k-mer based curation of raw sequence data: application in Drosophila suzukii\",\"authors\":\"Mathieu Gautier\",\"doi\":\"10.24072/pcjournal.309\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Several studies have highlighted the presence of contaminated entries in public sequence repositories, calling for special attention to the associated metadata. Here, we propose and evaluate a fast and efficient k–mer-based approach to assess the degree of mislabeling or contamination. We applied it to high-throughput whole-genome raw sequence data for 236 Ind-Seq and 22 Pool-Seq samples of the invasive species Drosophila suzukii. We first used Clark software to build a dictionary of species-discriminating k–mers from the curated assemblies of 29 target drosophilid species (including D. melanogaster, D. simulans, D. subpulchrella, or D. biarmipes) and 12 common drosophila pathogens and commensals (including Wolbachia). Counting the number of k–mers composing each query sample sequence that matched a discriminating k–mer from the dictionary provided a simple criterion for assignment to target species and evaluation of the entire sample. Analyses of a wide range of samples, representative of both target and other drosophilid species, demonstrated very good performance of the proposed approach, both in terms of run time and accuracy of sequence assignment. Of the 236 D. suzukii individuals, five were reassigned to D. simulans and eleven to D. subpulchrella. Another four showed moderate to substantial microbial contamination. Similarly, among the 22 Pool-Seq samples analyzed, two from the native range were found to be contaminated with 1 and 7 D. subpulchrella individuals, respectively (out of 50), and one from Europe was found to be contaminated with 5 to 6 D. immigrans individuals (out of 100). Overall, the present analysis allowed the definition of a large curated dataset consisting of > 60 population samples representative of the worldwide genetic diversity, which may be valuable for further population genetics studies on D. suzukii. More generally, while we advocate careful sample identification and verification prior to sequencing, the proposed framework is simple and computationally efficient enough to be included as a routine post-hoc quality check prior to any data analysis and prior to data submission to public repositories.\",\"PeriodicalId\":74413,\"journal\":{\"name\":\"Peer community journal\",\"volume\":\"41 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Peer community journal\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.24072/pcjournal.309\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Peer community journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.24072/pcjournal.309","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

一些研究强调了公共序列存储库中存在受污染的条目,要求特别关注相关的元数据。在这里,我们提出并评估了一种快速有效的基于k - mer的方法来评估错误标记或污染的程度。我们将其应用于入侵物种铃木果蝇的236个Ind-Seq和22个Pool-Seq样本的高通量全基因组原始序列数据。我们首先使用Clark软件从29种目标果蝇物种(包括D. melanogaster、D. simulans、D. subpulchrella或D. biarmipes)和12种常见的果蝇病原体和共栖生物(包括Wolbachia)的组合中构建了一个物种区分k-mers字典。计算每个查询样本序列中与字典中具有鉴别性的k-mer相匹配的k-mer的数量,为目标物种的分配和整个样本的评估提供了一个简单的标准。广泛的样本分析,包括目标和其他果蝇物种的代表,证明了所提出的方法在运行时间和序列分配的准确性方面都具有非常好的性能。236只铃木夜蛾中,5只被重新分配到拟拟夜蛾,11只被重新分配到下脉夜蛾。另外四个显示中度到严重的微生物污染。同样,在分析的22个Pool-Seq样本中,来自本土地区的两个样本分别被1个和7个D. subpulchrella个体污染(50个),来自欧洲的一个样本被5到6个D.移民个体污染(100个)。总的来说,目前的分析允许定义一个由> 60个代表全球遗传多样性的群体样本组成的大型整理数据集,这可能对进一步的铃木龙虱群体遗传学研究有价值。更一般地说,虽然我们提倡在测序之前仔细识别和验证样本,但提议的框架简单且计算效率高,足以在任何数据分析之前和数据提交给公共存储库之前作为常规的后期质量检查。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Efficient k-mer based curation of raw sequence data: application in Drosophila suzukii
Several studies have highlighted the presence of contaminated entries in public sequence repositories, calling for special attention to the associated metadata. Here, we propose and evaluate a fast and efficient k–mer-based approach to assess the degree of mislabeling or contamination. We applied it to high-throughput whole-genome raw sequence data for 236 Ind-Seq and 22 Pool-Seq samples of the invasive species Drosophila suzukii. We first used Clark software to build a dictionary of species-discriminating k–mers from the curated assemblies of 29 target drosophilid species (including D. melanogaster, D. simulans, D. subpulchrella, or D. biarmipes) and 12 common drosophila pathogens and commensals (including Wolbachia). Counting the number of k–mers composing each query sample sequence that matched a discriminating k–mer from the dictionary provided a simple criterion for assignment to target species and evaluation of the entire sample. Analyses of a wide range of samples, representative of both target and other drosophilid species, demonstrated very good performance of the proposed approach, both in terms of run time and accuracy of sequence assignment. Of the 236 D. suzukii individuals, five were reassigned to D. simulans and eleven to D. subpulchrella. Another four showed moderate to substantial microbial contamination. Similarly, among the 22 Pool-Seq samples analyzed, two from the native range were found to be contaminated with 1 and 7 D. subpulchrella individuals, respectively (out of 50), and one from Europe was found to be contaminated with 5 to 6 D. immigrans individuals (out of 100). Overall, the present analysis allowed the definition of a large curated dataset consisting of > 60 population samples representative of the worldwide genetic diversity, which may be valuable for further population genetics studies on D. suzukii. More generally, while we advocate careful sample identification and verification prior to sequencing, the proposed framework is simple and computationally efficient enough to be included as a routine post-hoc quality check prior to any data analysis and prior to data submission to public repositories.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
High quality genome assembly and annotation (v1) of the eukaryotic terrestrial microalga Coccomyxa viridis SAG 216-4 Ecotoxicity of lanthanides to Daphnia magna: insights from elemental behavior and speciation in a standardized test medium T7 DNA polymerase treatment improves quantitative sequencing of both double-stranded and single-stranded DNA viruses Differences in specificity, development time and virulence between two acanthocephalan parasites, infecting two cryptic species of Gammarus fossarum Multiproxy analysis exploring patterns of diet and disease in dental calculus and skeletal remains from a 19th century Dutch population
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1