TagDigger: user-friendly extraction of read counts from GBS and RAD-seq data.

Q2 Decision Sciences Source Code for Biology and Medicine Pub Date : 2016-07-11 eCollection Date: 2016-01-01 DOI:10.1186/s13029-016-0057-7

Lindsay V Clark, Erik J Sacks

{"title":"TagDigger: user-friendly extraction of read counts from GBS and RAD-seq data.","authors":"Lindsay V Clark, Erik J Sacks","doi":"10.1186/s13029-016-0057-7","DOIUrl":null,"url":null,"abstract":"Background: In genotyping-by-sequencing (GBS) and restriction site-associated DNA sequencing (RAD-seq), read depth is important for assessing the quality of genotype calls and estimating allele dosage in polyploids. However, existing pipelines for GBS and RAD-seq do not provide read counts in formats that are both accurate and easy to access. Additionally, although existing pipelines allow previously-mined SNPs to be genotyped on new samples, they do not allow the user to manually specify a subset of loci to examine. Pipelines that do not use a reference genome assign arbitrary names to SNPs, making meta-analysis across projects difficult.Results: We created the software TagDigger, which includes three programs for analyzing GBS and RAD-seq data. The first script, tagdigger_interactive.py, rapidly extracts read counts and genotypes from FASTQ files using user-supplied sets of barcodes and tags. Input and output is in CSV format so that it can be opened by spreadsheet software. Tag sequences can also be imported from the Stacks, TASSEL-GBSv2, TASSEL-UNEAK, or pyRAD pipelines, and a separate file can be imported listing the names of markers to retain. A second script, tag_manager.py, consolidates marker names and sequences across multiple projects. A third script, barcode_splitter.py, assists with preparing FASTQ data for deposit in a public archive by splitting FASTQ files by barcode and generating MD5 checksums for the resulting files.Conclusions: TagDigger is open-source and freely available software written in Python 3. It uses a scalable, rapid search algorithm that can process over 100 million FASTQ reads per hour. TagDigger will run on a laptop with any operating system, does not consume hard drive space with intermediate files, and does not require programming skill to use.","PeriodicalId":35052,"journal":{"name":"Source Code for Biology and Medicine","volume":" ","pages":"11"},"PeriodicalIF":0.0000,"publicationDate":"2016-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13029-016-0057-7","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Source Code for Biology and Medicine","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1186/s13029-016-0057-7","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2016/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"Decision Sciences","Score":null,"Total":0}

引用次数: 11

Abstract

Background: In genotyping-by-sequencing (GBS) and restriction site-associated DNA sequencing (RAD-seq), read depth is important for assessing the quality of genotype calls and estimating allele dosage in polyploids. However, existing pipelines for GBS and RAD-seq do not provide read counts in formats that are both accurate and easy to access. Additionally, although existing pipelines allow previously-mined SNPs to be genotyped on new samples, they do not allow the user to manually specify a subset of loci to examine. Pipelines that do not use a reference genome assign arbitrary names to SNPs, making meta-analysis across projects difficult.

Results: We created the software TagDigger, which includes three programs for analyzing GBS and RAD-seq data. The first script, tagdigger_interactive.py, rapidly extracts read counts and genotypes from FASTQ files using user-supplied sets of barcodes and tags. Input and output is in CSV format so that it can be opened by spreadsheet software. Tag sequences can also be imported from the Stacks, TASSEL-GBSv2, TASSEL-UNEAK, or pyRAD pipelines, and a separate file can be imported listing the names of markers to retain. A second script, tag_manager.py, consolidates marker names and sequences across multiple projects. A third script, barcode_splitter.py, assists with preparing FASTQ data for deposit in a public archive by splitting FASTQ files by barcode and generating MD5 checksums for the resulting files.

Conclusions: TagDigger is open-source and freely available software written in Python 3. It uses a scalable, rapid search algorithm that can process over 100 million FASTQ reads per hour. TagDigger will run on a laptop with any operating system, does not consume hard drive space with intermediate files, and does not require programming skill to use.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

TagDigger:用户友好的从GBS和RAD-seq数据中提取读取计数。

背景:在测序基因分型(GBS)和限制性内切位点相关DNA测序(RAD-seq)中，读取深度对于评估多倍体基因型的质量和估计等位基因的剂量是重要的。然而，GBS和RAD-seq的现有管道没有以既准确又易于访问的格式提供读计数。此外，尽管现有的管道允许先前挖掘的snp在新样品上进行基因分型，但它们不允许用户手动指定要检查的位点子集。不使用参考基因组的管道为snp指定任意名称，使得跨项目的元分析变得困难。结果:我们创建了TagDigger软件，该软件包含三个用于分析GBS和RAD-seq数据的程序。第一个脚本tagdigger_interactive.py使用用户提供的条形码和标签集从FASTQ文件中快速提取读取计数和基因型。输入输出为CSV格式，可通过电子表格软件打开。标签序列也可以从Stacks、TASSEL-GBSv2、TASSEL-UNEAK或pyRAD管道中导入，并且可以导入一个单独的文件，列出要保留的标记的名称。第二个脚本tag_manager.py整合了跨多个项目的标记名和序列。第三个脚本barcode_splitter.py通过按条形码分割FASTQ文件并为生成的文件生成MD5校验和，帮助准备FASTQ数据以存放到公共存档中。结论:TagDigger是一个开源的免费软件，使用Python 3编写。它使用可扩展的快速搜索算法，每小时可以处理超过1亿次FASTQ读取。TagDigger可以在任何操作系统的笔记本电脑上运行，不需要中间文件占用硬盘空间，也不需要编程技能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Source Code for Biology and Medicine Decision Sciences-Information Systems and Management

自引率

0.00%

发文量

期刊介绍： Source Code for Biology and Medicine is a peer-reviewed open access, online journal that publishes articles on source code employed over a wide range of applications in biology and medicine. The journal"s aim is to publish source code for distribution and use in the public domain in order to advance biological and medical research. Through this dissemination, it may be possible to shorten the time required for solving certain computational problems for which there is limited source code availability or resources.

期刊最新文献

2DKD: a toolkit for content-based local image search. Computing and graphing probability values of pearson distributions: a SAS/IML macro. iPBAvizu: a PyMOL plugin for an efficient 3D protein structure superimposition approach Social support for collaboration and group awareness in life science research teams. MZPAQ: a FASTQ data compression tool.