TagDigger:用户友好的从GBS和RAD-seq数据中提取读取计数。

Q2 Decision Sciences Source Code for Biology and Medicine Pub Date : 2016-07-11 eCollection Date: 2016-01-01 DOI:10.1186/s13029-016-0057-7
Lindsay V Clark, Erik J Sacks
{"title":"TagDigger:用户友好的从GBS和RAD-seq数据中提取读取计数。","authors":"Lindsay V Clark,&nbsp;Erik J Sacks","doi":"10.1186/s13029-016-0057-7","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>In genotyping-by-sequencing (GBS) and restriction site-associated DNA sequencing (RAD-seq), read depth is important for assessing the quality of genotype calls and estimating allele dosage in polyploids. However, existing pipelines for GBS and RAD-seq do not provide read counts in formats that are both accurate and easy to access. Additionally, although existing pipelines allow previously-mined SNPs to be genotyped on new samples, they do not allow the user to manually specify a subset of loci to examine. Pipelines that do not use a reference genome assign arbitrary names to SNPs, making meta-analysis across projects difficult.</p><p><strong>Results: </strong>We created the software TagDigger, which includes three programs for analyzing GBS and RAD-seq data. The first script, tagdigger_interactive.py, rapidly extracts read counts and genotypes from FASTQ files using user-supplied sets of barcodes and tags. Input and output is in CSV format so that it can be opened by spreadsheet software. Tag sequences can also be imported from the Stacks, TASSEL-GBSv2, TASSEL-UNEAK, or pyRAD pipelines, and a separate file can be imported listing the names of markers to retain. A second script, tag_manager.py, consolidates marker names and sequences across multiple projects. A third script, barcode_splitter.py, assists with preparing FASTQ data for deposit in a public archive by splitting FASTQ files by barcode and generating MD5 checksums for the resulting files.</p><p><strong>Conclusions: </strong>TagDigger is open-source and freely available software written in Python 3. It uses a scalable, rapid search algorithm that can process over 100 million FASTQ reads per hour. TagDigger will run on a laptop with any operating system, does not consume hard drive space with intermediate files, and does not require programming skill to use.</p>","PeriodicalId":35052,"journal":{"name":"Source Code for Biology and Medicine","volume":" ","pages":"11"},"PeriodicalIF":0.0000,"publicationDate":"2016-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13029-016-0057-7","citationCount":"11","resultStr":"{\"title\":\"TagDigger: user-friendly extraction of read counts from GBS and RAD-seq data.\",\"authors\":\"Lindsay V Clark,&nbsp;Erik J Sacks\",\"doi\":\"10.1186/s13029-016-0057-7\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>In genotyping-by-sequencing (GBS) and restriction site-associated DNA sequencing (RAD-seq), read depth is important for assessing the quality of genotype calls and estimating allele dosage in polyploids. However, existing pipelines for GBS and RAD-seq do not provide read counts in formats that are both accurate and easy to access. Additionally, although existing pipelines allow previously-mined SNPs to be genotyped on new samples, they do not allow the user to manually specify a subset of loci to examine. Pipelines that do not use a reference genome assign arbitrary names to SNPs, making meta-analysis across projects difficult.</p><p><strong>Results: </strong>We created the software TagDigger, which includes three programs for analyzing GBS and RAD-seq data. The first script, tagdigger_interactive.py, rapidly extracts read counts and genotypes from FASTQ files using user-supplied sets of barcodes and tags. Input and output is in CSV format so that it can be opened by spreadsheet software. Tag sequences can also be imported from the Stacks, TASSEL-GBSv2, TASSEL-UNEAK, or pyRAD pipelines, and a separate file can be imported listing the names of markers to retain. A second script, tag_manager.py, consolidates marker names and sequences across multiple projects. A third script, barcode_splitter.py, assists with preparing FASTQ data for deposit in a public archive by splitting FASTQ files by barcode and generating MD5 checksums for the resulting files.</p><p><strong>Conclusions: </strong>TagDigger is open-source and freely available software written in Python 3. It uses a scalable, rapid search algorithm that can process over 100 million FASTQ reads per hour. TagDigger will run on a laptop with any operating system, does not consume hard drive space with intermediate files, and does not require programming skill to use.</p>\",\"PeriodicalId\":35052,\"journal\":{\"name\":\"Source Code for Biology and Medicine\",\"volume\":\" \",\"pages\":\"11\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-07-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1186/s13029-016-0057-7\",\"citationCount\":\"11\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Source Code for Biology and Medicine\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1186/s13029-016-0057-7\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2016/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q2\",\"JCRName\":\"Decision Sciences\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Source Code for Biology and Medicine","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1186/s13029-016-0057-7","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2016/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"Decision Sciences","Score":null,"Total":0}
引用次数: 11

摘要

背景:在测序基因分型(GBS)和限制性内切位点相关DNA测序(RAD-seq)中,读取深度对于评估多倍体基因型的质量和估计等位基因的剂量是重要的。然而,GBS和RAD-seq的现有管道没有以既准确又易于访问的格式提供读计数。此外,尽管现有的管道允许先前挖掘的snp在新样品上进行基因分型,但它们不允许用户手动指定要检查的位点子集。不使用参考基因组的管道为snp指定任意名称,使得跨项目的元分析变得困难。结果:我们创建了TagDigger软件,该软件包含三个用于分析GBS和RAD-seq数据的程序。第一个脚本tagdigger_interactive.py使用用户提供的条形码和标签集从FASTQ文件中快速提取读取计数和基因型。输入输出为CSV格式,可通过电子表格软件打开。标签序列也可以从Stacks、TASSEL-GBSv2、TASSEL-UNEAK或pyRAD管道中导入,并且可以导入一个单独的文件,列出要保留的标记的名称。第二个脚本tag_manager.py整合了跨多个项目的标记名和序列。第三个脚本barcode_splitter.py通过按条形码分割FASTQ文件并为生成的文件生成MD5校验和,帮助准备FASTQ数据以存放到公共存档中。结论:TagDigger是一个开源的免费软件,使用Python 3编写。它使用可扩展的快速搜索算法,每小时可以处理超过1亿次FASTQ读取。TagDigger可以在任何操作系统的笔记本电脑上运行,不需要中间文件占用硬盘空间,也不需要编程技能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

摘要图片

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
TagDigger: user-friendly extraction of read counts from GBS and RAD-seq data.

Background: In genotyping-by-sequencing (GBS) and restriction site-associated DNA sequencing (RAD-seq), read depth is important for assessing the quality of genotype calls and estimating allele dosage in polyploids. However, existing pipelines for GBS and RAD-seq do not provide read counts in formats that are both accurate and easy to access. Additionally, although existing pipelines allow previously-mined SNPs to be genotyped on new samples, they do not allow the user to manually specify a subset of loci to examine. Pipelines that do not use a reference genome assign arbitrary names to SNPs, making meta-analysis across projects difficult.

Results: We created the software TagDigger, which includes three programs for analyzing GBS and RAD-seq data. The first script, tagdigger_interactive.py, rapidly extracts read counts and genotypes from FASTQ files using user-supplied sets of barcodes and tags. Input and output is in CSV format so that it can be opened by spreadsheet software. Tag sequences can also be imported from the Stacks, TASSEL-GBSv2, TASSEL-UNEAK, or pyRAD pipelines, and a separate file can be imported listing the names of markers to retain. A second script, tag_manager.py, consolidates marker names and sequences across multiple projects. A third script, barcode_splitter.py, assists with preparing FASTQ data for deposit in a public archive by splitting FASTQ files by barcode and generating MD5 checksums for the resulting files.

Conclusions: TagDigger is open-source and freely available software written in Python 3. It uses a scalable, rapid search algorithm that can process over 100 million FASTQ reads per hour. TagDigger will run on a laptop with any operating system, does not consume hard drive space with intermediate files, and does not require programming skill to use.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Source Code for Biology and Medicine
Source Code for Biology and Medicine Decision Sciences-Information Systems and Management
自引率
0.00%
发文量
0
期刊介绍: Source Code for Biology and Medicine is a peer-reviewed open access, online journal that publishes articles on source code employed over a wide range of applications in biology and medicine. The journal"s aim is to publish source code for distribution and use in the public domain in order to advance biological and medical research. Through this dissemination, it may be possible to shorten the time required for solving certain computational problems for which there is limited source code availability or resources.
期刊最新文献
2DKD: a toolkit for content-based local image search. Computing and graphing probability values of pearson distributions: a SAS/IML macro. iPBAvizu: a PyMOL plugin for an efficient 3D protein structure superimposition approach Social support for collaboration and group awareness in life science research teams. MZPAQ: a FASTQ data compression tool.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1