Workflows Orchestrating Workflows: Thousands of Queries and Their Fault Tolerance Using APIs of Omics Web Resources

Yassene Mohammed
{"title":"Workflows Orchestrating Workflows: Thousands of Queries and Their Fault Tolerance Using APIs of Omics Web Resources","authors":"Yassene Mohammed","doi":"10.1109/eScience.2018.00061","DOIUrl":null,"url":null,"abstract":"High throughput -omics like proteomics and genomics allow detailed molecular studies of organisms. Such studies are inherently on the Big Data side regarding volume and complexity. Following the FAIR principles and reaching for transparency in publication, raw data and results are often shared in public repositories. However, despite the steadily increased amount of shared omics data, it is still challenging to compare, correlate, and integrate it to answer new questions. Here we report on our experience in reusing and repurposing publically available proteomics and genomics data to design new targeted proteomics experiments. We have developed a scientific workflow to retrieve and integrate information from various repositories and domain knowledge-bases including UniPortKB [1], GPMDB [2], PRIDE [3], PeptideAtlas [4], ProteomicsDB [5], MassIVE [6], ExPASy [7], NCBI’s dbSNP [8], and PeptideTracker [9]. Following a “Map-Reduce” approach [10] the workflow select best proteotypic peptides for Multiple Reaction Monitoring (MRM) experiment. In an attempt to gain insights into the human proteome, we have designed a second workflow to orchestrate the selection workflow. 100,000s of queries were sent to online repositories to determine if peptides were seen in previous experiments. Fault tolerance ranged from dealing with no-reply to wrong annotations. Three months run of the workflow generated a comprehensive list of 165k+ suitable proteotypic peptides covering most human proteins. The main challenge has been the evolving APIs of the resources which continuously affects the components of our integrative bioinformatic solutions.","PeriodicalId":6476,"journal":{"name":"2018 IEEE 14th International Conference on e-Science (e-Science)","volume":"145 1","pages":"299-300"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE 14th International Conference on e-Science (e-Science)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/eScience.2018.00061","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

High throughput -omics like proteomics and genomics allow detailed molecular studies of organisms. Such studies are inherently on the Big Data side regarding volume and complexity. Following the FAIR principles and reaching for transparency in publication, raw data and results are often shared in public repositories. However, despite the steadily increased amount of shared omics data, it is still challenging to compare, correlate, and integrate it to answer new questions. Here we report on our experience in reusing and repurposing publically available proteomics and genomics data to design new targeted proteomics experiments. We have developed a scientific workflow to retrieve and integrate information from various repositories and domain knowledge-bases including UniPortKB [1], GPMDB [2], PRIDE [3], PeptideAtlas [4], ProteomicsDB [5], MassIVE [6], ExPASy [7], NCBI’s dbSNP [8], and PeptideTracker [9]. Following a “Map-Reduce” approach [10] the workflow select best proteotypic peptides for Multiple Reaction Monitoring (MRM) experiment. In an attempt to gain insights into the human proteome, we have designed a second workflow to orchestrate the selection workflow. 100,000s of queries were sent to online repositories to determine if peptides were seen in previous experiments. Fault tolerance ranged from dealing with no-reply to wrong annotations. Three months run of the workflow generated a comprehensive list of 165k+ suitable proteotypic peptides covering most human proteins. The main challenge has been the evolving APIs of the resources which continuously affects the components of our integrative bioinformatic solutions.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
工作流编排工作流:使用组态Web资源api的数千个查询及其容错性
像蛋白质组学和基因组学这样的高通量组学允许对生物体进行详细的分子研究。这些研究在数量和复杂性方面本质上是大数据方面的。遵循公平原则并在发布中达到透明度,原始数据和结果通常在公共存储库中共享。然而,尽管共享组学数据的数量稳步增加,但对其进行比较、关联和整合以回答新问题仍然具有挑战性。在这里,我们报告了我们在重用和重新利用公开可用的蛋白质组学和基因组学数据来设计新的靶向蛋白质组学实验方面的经验。我们已经开发了一个科学的工作流程来检索和整合来自不同存储库和领域知识库的信息,包括UniPortKB[1]、GPMDB[2]、PRIDE[3]、PeptideAtlas[4]、ProteomicsDB[5]、MassIVE[6]、ExPASy[7]、NCBI的dbSNP[8]和PeptideTracker[9]。遵循“Map-Reduce”方法[10],工作流程为多反应监测(MRM)实验选择最佳的蛋白型肽。为了深入了解人类蛋白质组,我们设计了第二个工作流程来协调选择工作流程。10万个查询被发送到在线存储库,以确定是否在以前的实验中看到了肽。容错范围从处理无回复到错误注释。三个月的工作流程生成了覆盖大多数人类蛋白质的165k+合适的蛋白型肽的综合列表。主要的挑战是不断发展的api资源,不断影响我们的综合生物信息学解决方案的组成部分。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Occam: Software Environment for Creating Reproducible Research Smart Data Scouting in Professional Soccer: Evaluating Passing Performance Based on Position Tracking Data Improving LBFGS Optimizer in PyTorch: Knowledge Transfer from Radio Interferometric Calibration to Machine Learning Nordic Exome Variant Catalogue a Web Resource for Genomic Data Browsing Survey on Research Software Engineering in the Netherlands
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1