Konstantinos A. Kyritsis, Nikolaos Pechlivanis, Fotis Psomopoulos
{"title":"Software pipelines for RNA-Seq, ChIP-Seq and germline variant calling analyses in common workflow language (CWL)","authors":"Konstantinos A. Kyritsis, Nikolaos Pechlivanis, Fotis Psomopoulos","doi":"10.3389/fbinf.2023.1275593","DOIUrl":null,"url":null,"abstract":"Background: Automating data analysis pipelines is a key requirement to ensure reproducibility of results, especially when dealing with large volumes of data. Here we assembled automated pipelines for the analysis of High-throughput Sequencing (HTS) data originating from RNA-Seq, ChIP-Seq and Germline variant calling experiments. We implemented these workflows in Common workflow language (CWL) and evaluated their performance by: i) reproducing the results of two previously published studies on Chronic Lymphocytic Leukemia (CLL), and ii) analyzing whole genome sequencing data from four Genome in a Bottle Consortium (GIAB) samples, comparing the detected variants against their respective golden standard truth sets. Findings: We demonstrated that CWL-implemented workflows clearly achieved high accuracy in reproducing previously published results, discovering significant biomarkers and detecting germline SNP and small INDEL variants. Conclusion: CWL pipelines are characterized by reproducibility and reusability; combined with containerization, they provide the ability to overcome issues of software incompatibility and laborious configuration requirements. In addition, they are flexible and can be used immediately or adapted to the specific needs of an experiment or study. The CWL-based workflows developed in this study, along with version information for all software tools, are publicly available on GitHub ( https://github.com/BiodataAnalysisGroup/CWL_HTS_pipelines ) under the MIT License. They are suitable for the analysis of short-read (such as Illumina-based) data and constitute an open resource that can facilitate automation, reproducibility and cross-platform compatibility for standard bioinformatic analyses.","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"131 1","pages":"0"},"PeriodicalIF":2.8000,"publicationDate":"2023-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fbinf.2023.1275593","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Automating data analysis pipelines is a key requirement to ensure reproducibility of results, especially when dealing with large volumes of data. Here we assembled automated pipelines for the analysis of High-throughput Sequencing (HTS) data originating from RNA-Seq, ChIP-Seq and Germline variant calling experiments. We implemented these workflows in Common workflow language (CWL) and evaluated their performance by: i) reproducing the results of two previously published studies on Chronic Lymphocytic Leukemia (CLL), and ii) analyzing whole genome sequencing data from four Genome in a Bottle Consortium (GIAB) samples, comparing the detected variants against their respective golden standard truth sets. Findings: We demonstrated that CWL-implemented workflows clearly achieved high accuracy in reproducing previously published results, discovering significant biomarkers and detecting germline SNP and small INDEL variants. Conclusion: CWL pipelines are characterized by reproducibility and reusability; combined with containerization, they provide the ability to overcome issues of software incompatibility and laborious configuration requirements. In addition, they are flexible and can be used immediately or adapted to the specific needs of an experiment or study. The CWL-based workflows developed in this study, along with version information for all software tools, are publicly available on GitHub ( https://github.com/BiodataAnalysisGroup/CWL_HTS_pipelines ) under the MIT License. They are suitable for the analysis of short-read (such as Illumina-based) data and constitute an open resource that can facilitate automation, reproducibility and cross-platform compatibility for standard bioinformatic analyses.
背景:自动化数据分析管道是确保结果可再现性的关键要求,特别是在处理大量数据时。在这里,我们组装了自动化管道,用于分析来自RNA-Seq, ChIP-Seq和种系变异召唤实验的高通量测序(HTS)数据。我们在通用工作流程语言(CWL)中实现了这些工作流程,并通过以下方式评估了它们的性能:i)再现了之前发表的两项关于慢性淋巴细胞白血病(CLL)的研究结果,ii)分析了来自四个genome in a Bottle Consortium (GIAB)样本的全基因组测序数据,将检测到的变体与各自的黄金标准真值集进行了比较。研究结果:我们证明了cwl实施的工作流程在复制先前发表的结果、发现重要的生物标志物和检测种系SNP和小INDEL变体方面明显达到了很高的准确性。结论:CWL管道具有重复性和可重用性;与容器化相结合,它们提供了克服软件不兼容和费力的配置需求问题的能力。此外,它们是灵活的,可以立即使用或适应实验或研究的具体需要。本研究中开发的基于cwl的工作流,以及所有软件工具的版本信息,在MIT许可下可在GitHub (https://github.com/BiodataAnalysisGroup/CWL_HTS_pipelines)上公开获得。它们适用于分析短读(如基于illumina的)数据,并构成一个开放资源,可以促进自动化,可重复性和标准生物信息学分析的跨平台兼容性。