支持高通量生物信息学的自动化基础设施

2014 International Conference on High Performance Computing & Simulation (HPCS) Pub Date : 2014-07-21 DOI:10.1109/HPCSim.2014.6903742

G. Cuccuru, Simone Leo, L. Lianas, Michele Muggiri, Andrea Pinna, L. Pireddu, P. Uva, A. Angius, G. Fotia, G. Zanetti

{"title":"支持高通量生物信息学的自动化基础设施","authors":"G. Cuccuru, Simone Leo, L. Lianas, Michele Muggiri, Andrea Pinna, L. Pireddu, P. Uva, A. Angius, G. Fotia, G. Zanetti","doi":"10.1109/HPCSim.2014.6903742","DOIUrl":null,"url":null,"abstract":"The number of domains affected by the big data phenomenon is constantly increasing, both in science and industry, with high-throughput DNA sequencers being among the most massive data producers. Building analysis frameworks that can keep up with such a high production rate, however, is only part of the problem: current challenges include dealing with articulated data repositories where objects are connected by multiple relationships, managing complex processing pipelines where each step depends on a large number of configuration parameters and ensuring reproducibility, error control and usability by non-technical staff. Here we describe an automated infrastructure built to address the above issues in the context of the analysis of the data produced by the CRS4 next-generation sequencing facility. The system integrates open source tools, either written by us or publicly available, into a framework that can handle the whole data transformation process, from raw sequencer output to primary analysis results.","PeriodicalId":6469,"journal":{"name":"2014 International Conference on High Performance Computing & Simulation (HPCS)","volume":"21 1","pages":"600-607"},"PeriodicalIF":0.0000,"publicationDate":"2014-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"An automated infrastructure to support high-throughput bioinformatics\",\"authors\":\"G. Cuccuru, Simone Leo, L. Lianas, Michele Muggiri, Andrea Pinna, L. Pireddu, P. Uva, A. Angius, G. Fotia, G. Zanetti\",\"doi\":\"10.1109/HPCSim.2014.6903742\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The number of domains affected by the big data phenomenon is constantly increasing, both in science and industry, with high-throughput DNA sequencers being among the most massive data producers. Building analysis frameworks that can keep up with such a high production rate, however, is only part of the problem: current challenges include dealing with articulated data repositories where objects are connected by multiple relationships, managing complex processing pipelines where each step depends on a large number of configuration parameters and ensuring reproducibility, error control and usability by non-technical staff. Here we describe an automated infrastructure built to address the above issues in the context of the analysis of the data produced by the CRS4 next-generation sequencing facility. The system integrates open source tools, either written by us or publicly available, into a framework that can handle the whole data transformation process, from raw sequencer output to primary analysis results.\",\"PeriodicalId\":6469,\"journal\":{\"name\":\"2014 International Conference on High Performance Computing & Simulation (HPCS)\",\"volume\":\"21 1\",\"pages\":\"600-607\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-07-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 International Conference on High Performance Computing & Simulation (HPCS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HPCSim.2014.6903742\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 International Conference on High Performance Computing & Simulation (HPCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCSim.2014.6903742","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

摘要

受大数据现象影响的领域不断增加，无论是在科学领域还是在工业领域，高通量DNA测序仪都是最大规模的数据生产者之一。然而，构建能够跟上如此高生产率的分析框架只是问题的一部分:当前的挑战包括处理铰接的数据存储库，其中对象通过多个关系连接，管理复杂的处理管道，其中每个步骤依赖于大量配置参数，并确保非技术人员的可重复性、错误控制和可用性。在这里，我们描述了一个自动化的基础设施，用于在分析CRS4下一代测序设备产生的数据的背景下解决上述问题。该系统将开源工具集成到一个框架中，该框架可以处理从原始测序器输出到主要分析结果的整个数据转换过程。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

An automated infrastructure to support high-throughput bioinformatics

The number of domains affected by the big data phenomenon is constantly increasing, both in science and industry, with high-throughput DNA sequencers being among the most massive data producers. Building analysis frameworks that can keep up with such a high production rate, however, is only part of the problem: current challenges include dealing with articulated data repositories where objects are connected by multiple relationships, managing complex processing pipelines where each step depends on a large number of configuration parameters and ensuring reproducibility, error control and usability by non-technical staff. Here we describe an automated infrastructure built to address the above issues in the context of the analysis of the data produced by the CRS4 next-generation sequencing facility. The system integrates open source tools, either written by us or publicly available, into a framework that can handle the whole data transformation process, from raw sequencer output to primary analysis results.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2014 International Conference on High Performance Computing & Simulation (HPCS)

自引率

0.00%

发文量