工作流编排工作流:使用组态Web资源api的数千个查询及其容错性

2018 IEEE 14th International Conference on e-Science (e-Science) Pub Date : 2018-10-01 DOI:10.1109/eScience.2018.00061

Yassene Mohammed

{"title":"工作流编排工作流:使用组态Web资源api的数千个查询及其容错性","authors":"Yassene Mohammed","doi":"10.1109/eScience.2018.00061","DOIUrl":null,"url":null,"abstract":"High throughput -omics like proteomics and genomics allow detailed molecular studies of organisms. Such studies are inherently on the Big Data side regarding volume and complexity. Following the FAIR principles and reaching for transparency in publication, raw data and results are often shared in public repositories. However, despite the steadily increased amount of shared omics data, it is still challenging to compare, correlate, and integrate it to answer new questions. Here we report on our experience in reusing and repurposing publically available proteomics and genomics data to design new targeted proteomics experiments. We have developed a scientific workflow to retrieve and integrate information from various repositories and domain knowledge-bases including UniPortKB [1], GPMDB [2], PRIDE [3], PeptideAtlas [4], ProteomicsDB [5], MassIVE [6], ExPASy [7], NCBI’s dbSNP [8], and PeptideTracker [9]. Following a “Map-Reduce” approach [10] the workflow select best proteotypic peptides for Multiple Reaction Monitoring (MRM) experiment. In an attempt to gain insights into the human proteome, we have designed a second workflow to orchestrate the selection workflow. 100,000s of queries were sent to online repositories to determine if peptides were seen in previous experiments. Fault tolerance ranged from dealing with no-reply to wrong annotations. Three months run of the workflow generated a comprehensive list of 165k+ suitable proteotypic peptides covering most human proteins. The main challenge has been the evolving APIs of the resources which continuously affects the components of our integrative bioinformatic solutions.","PeriodicalId":6476,"journal":{"name":"2018 IEEE 14th International Conference on e-Science (e-Science)","volume":"145 1","pages":"299-300"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Workflows Orchestrating Workflows: Thousands of Queries and Their Fault Tolerance Using APIs of Omics Web Resources\",\"authors\":\"Yassene Mohammed\",\"doi\":\"10.1109/eScience.2018.00061\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"High throughput -omics like proteomics and genomics allow detailed molecular studies of organisms. Such studies are inherently on the Big Data side regarding volume and complexity. Following the FAIR principles and reaching for transparency in publication, raw data and results are often shared in public repositories. However, despite the steadily increased amount of shared omics data, it is still challenging to compare, correlate, and integrate it to answer new questions. Here we report on our experience in reusing and repurposing publically available proteomics and genomics data to design new targeted proteomics experiments. We have developed a scientific workflow to retrieve and integrate information from various repositories and domain knowledge-bases including UniPortKB [1], GPMDB [2], PRIDE [3], PeptideAtlas [4], ProteomicsDB [5], MassIVE [6], ExPASy [7], NCBI’s dbSNP [8], and PeptideTracker [9]. Following a “Map-Reduce” approach [10] the workflow select best proteotypic peptides for Multiple Reaction Monitoring (MRM) experiment. In an attempt to gain insights into the human proteome, we have designed a second workflow to orchestrate the selection workflow. 100,000s of queries were sent to online repositories to determine if peptides were seen in previous experiments. Fault tolerance ranged from dealing with no-reply to wrong annotations. Three months run of the workflow generated a comprehensive list of 165k+ suitable proteotypic peptides covering most human proteins. The main challenge has been the evolving APIs of the resources which continuously affects the components of our integrative bioinformatic solutions.\",\"PeriodicalId\":6476,\"journal\":{\"name\":\"2018 IEEE 14th International Conference on e-Science (e-Science)\",\"volume\":\"145 1\",\"pages\":\"299-300\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE 14th International Conference on e-Science (e-Science)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/eScience.2018.00061\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE 14th International Conference on e-Science (e-Science)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/eScience.2018.00061","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

像蛋白质组学和基因组学这样的高通量组学允许对生物体进行详细的分子研究。这些研究在数量和复杂性方面本质上是大数据方面的。遵循公平原则并在发布中达到透明度，原始数据和结果通常在公共存储库中共享。然而，尽管共享组学数据的数量稳步增加，但对其进行比较、关联和整合以回答新问题仍然具有挑战性。在这里，我们报告了我们在重用和重新利用公开可用的蛋白质组学和基因组学数据来设计新的靶向蛋白质组学实验方面的经验。我们已经开发了一个科学的工作流程来检索和整合来自不同存储库和领域知识库的信息，包括UniPortKB[1]、GPMDB[2]、PRIDE[3]、PeptideAtlas[4]、ProteomicsDB[5]、MassIVE[6]、ExPASy[7]、NCBI的dbSNP[8]和PeptideTracker[9]。遵循“Map-Reduce”方法[10]，工作流程为多反应监测(MRM)实验选择最佳的蛋白型肽。为了深入了解人类蛋白质组，我们设计了第二个工作流程来协调选择工作流程。10万个查询被发送到在线存储库，以确定是否在以前的实验中看到了肽。容错范围从处理无回复到错误注释。三个月的工作流程生成了覆盖大多数人类蛋白质的165k+合适的蛋白型肽的综合列表。主要的挑战是不断发展的api资源，不断影响我们的综合生物信息学解决方案的组成部分。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Workflows Orchestrating Workflows: Thousands of Queries and Their Fault Tolerance Using APIs of Omics Web Resources

High throughput -omics like proteomics and genomics allow detailed molecular studies of organisms. Such studies are inherently on the Big Data side regarding volume and complexity. Following the FAIR principles and reaching for transparency in publication, raw data and results are often shared in public repositories. However, despite the steadily increased amount of shared omics data, it is still challenging to compare, correlate, and integrate it to answer new questions. Here we report on our experience in reusing and repurposing publically available proteomics and genomics data to design new targeted proteomics experiments. We have developed a scientific workflow to retrieve and integrate information from various repositories and domain knowledge-bases including UniPortKB [1], GPMDB [2], PRIDE [3], PeptideAtlas [4], ProteomicsDB [5], MassIVE [6], ExPASy [7], NCBI’s dbSNP [8], and PeptideTracker [9]. Following a “Map-Reduce” approach [10] the workflow select best proteotypic peptides for Multiple Reaction Monitoring (MRM) experiment. In an attempt to gain insights into the human proteome, we have designed a second workflow to orchestrate the selection workflow. 100,000s of queries were sent to online repositories to determine if peptides were seen in previous experiments. Fault tolerance ranged from dealing with no-reply to wrong annotations. Three months run of the workflow generated a comprehensive list of 165k+ suitable proteotypic peptides covering most human proteins. The main challenge has been the evolving APIs of the resources which continuously affects the components of our integrative bioinformatic solutions.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2018 IEEE 14th International Conference on e-Science (e-Science)

自引率

0.00%

发文量