Pub Date : 2017-10-01Epub Date: 2017-11-16DOI: 10.1109/eScience.2017.83
Karl Czajkowski, Carl Kesselman, Robert Schuler
Creating and maintaining an accurate description of data assets and the relationships between assets is a critical aspect of making data findable, accessible, interoperable, and reusable (FAIR). Typically, such metadata are created and maintained in a data catalog by a curator as part of data publication. However, allowing metadata to be created and maintained by data producers as the data is generated rather then waiting for publication can have significant advantages in terms of productivity and repeatability. The responsibilities for metadata management need not fall on any one individual, but rather may be delegated to appropriate members of a collaboration, enabling participants to edit or maintain specific attributes, to describe relationships between data elements, or to correct errors. To support such collaborative data editing, we have created ERMrest, a relational data service for the Web that enables the creation, evolution and navigation of complex models used to describe and structure diverse file or relational data objects. A key capability of ERMrest is its ability to control operations down to the level of individual data elements, i.e. fine-grained access control, so that many different modes of data-oriented collaboration can be supported. In this paper we introduce ERMrest and describe its fine-grained access control capabilities that support collaborative editing. ERMrest is in daily use in many data driven collaborations and we describe a sample policy that is based on a common biocuration pattern.
{"title":"ERMrest: A Collaborative Data Catalog with Fine Grain Access Control.","authors":"Karl Czajkowski, Carl Kesselman, Robert Schuler","doi":"10.1109/eScience.2017.83","DOIUrl":"https://doi.org/10.1109/eScience.2017.83","url":null,"abstract":"<p><p>Creating and maintaining an accurate description of data assets and the relationships between assets is a critical aspect of making data findable, accessible, interoperable, and reusable (FAIR). Typically, such metadata are created and maintained in a data catalog by a curator as part of data publication. However, allowing metadata to be created and maintained by data producers as the data is generated rather then waiting for publication can have significant advantages in terms of productivity and repeatability. The responsibilities for metadata management need not fall on any one individual, but rather may be delegated to appropriate members of a collaboration, enabling participants to edit or maintain specific attributes, to describe relationships between data elements, or to correct errors. To support such collaborative data editing, we have created ERMrest, a relational data service for the Web that enables the creation, evolution and navigation of complex models used to describe and structure diverse file or relational data objects. A key capability of ERMrest is its ability to control operations down to the level of individual data elements, i.e. fine-grained access control, so that many different modes of data-oriented collaboration can be supported. In this paper we introduce ERMrest and describe its fine-grained access control capabilities that support collaborative editing. ERMrest is in daily use in many data driven collaborations and we describe a sample policy that is based on a common biocuration pattern.</p>","PeriodicalId":90293,"journal":{"name":"Proceedings ... IEEE International Conference on eScience. IEEE International Conference on eScience","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/eScience.2017.83","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36094989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-10-01Epub Date: 2017-11-16DOI: 10.1109/eScience.2017.20
Alejandro Bugacov, Karl Czajkowski, Carl Kesselman, Anoop Kumar, Robert E Schuler, Hongsuda Tangmunarunkit
The pace of discovery in eScience is increasingly dependent on a scientist's ability to acquire, curate, integrate, analyze, and share large and diverse collections of data. It is all too common for investigators to spend inordinate amounts of time developing ad hoc procedures to manage their data. In previous work, we presented Deriva, a Scientific Asset Management System, designed to accelerate data driven discovery. In this paper, we report on the use of Deriva in a number of substantial and diverse eScience applications. We describe the lessons we have learned, both from the perspective of the Deriva technology, as well as the ability and willingness of scientists to incorporate Scientific Asset Management into their daily workflows.
{"title":"Experiences with Deriva: An Asset Management Platform for Accelerating eScience.","authors":"Alejandro Bugacov, Karl Czajkowski, Carl Kesselman, Anoop Kumar, Robert E Schuler, Hongsuda Tangmunarunkit","doi":"10.1109/eScience.2017.20","DOIUrl":"https://doi.org/10.1109/eScience.2017.20","url":null,"abstract":"<p><p>The pace of discovery in eScience is increasingly dependent on a scientist's ability to acquire, curate, integrate, analyze, and share large and diverse collections of data. It is all too common for investigators to spend inordinate amounts of time developing ad hoc procedures to manage their data. In previous work, we presented Deriva, a Scientific Asset Management System, designed to accelerate data driven discovery. In this paper, we report on the use of Deriva in a number of substantial and diverse eScience applications. We describe the lessons we have learned, both from the perspective of the Deriva technology, as well as the ability and willingness of scientists to incorporate Scientific Asset Management into their daily workflows.</p>","PeriodicalId":90293,"journal":{"name":"Proceedings ... IEEE International Conference on eScience. IEEE International Conference on eScience","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/eScience.2017.20","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36094987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-10-01DOI: 10.1109/eScience.2014.29
Edgar F Black, Luigi Marini, Ashwini Vaidya, Dora Berman, Melissa Willman, Dan Salomon, Amelia Bartholomew, Norma Kenyon, Kenton McHenry
A novel application of Hidden Markov Models is used to help research intended to test the immunuregulatory effects of mesenchymal stem cells in a cynomolgus monkey model of islet transplantation. The Hidden Markov Model, an unsupervised learning data mining technique, is used to automatically determine the postoperative day (POD) corresponding to a decrease of graft function, a possible sign of transplant rejection, on nonhuman primates after isolated islet cell transplant. Currently, decrease of graft function is being determined solely on experts' judgment. Further, information gathered from the evaluation of construted Hidden Markov Models is used as part of a clustering method to aggregate the nonhuman subjects into groups or clusters with the objective of finding similarities that could potentially help predict the health outcome of subjects undergoing postoperative care. Results on expert labeled data show the HMM to be accurate 60% of the time. Clusters based on the HMMs further suggest a possible correspondence between donor haplotypes matching and loss of function outcomes.
{"title":"Using Hidden Markov Models to Determine Changes in Subject Data over Time, Studying the Immunoregulatory effect of Mesenchymal Stem Cells.","authors":"Edgar F Black, Luigi Marini, Ashwini Vaidya, Dora Berman, Melissa Willman, Dan Salomon, Amelia Bartholomew, Norma Kenyon, Kenton McHenry","doi":"10.1109/eScience.2014.29","DOIUrl":"https://doi.org/10.1109/eScience.2014.29","url":null,"abstract":"<p><p>A novel application of Hidden Markov Models is used to help research intended to test the immunuregulatory effects of mesenchymal stem cells in a cynomolgus monkey model of islet transplantation. The Hidden Markov Model, an <i>unsupervised learning</i> data mining technique, is used to automatically determine the postoperative day (POD) corresponding to a decrease of graft function, a possible sign of transplant rejection, on nonhuman primates after isolated islet cell transplant. Currently, decrease of graft function is being determined solely on experts' judgment. Further, information gathered from the evaluation of construted Hidden Markov Models is used as part of a clustering method to aggregate the nonhuman subjects into groups or clusters with the objective of finding similarities that could potentially help predict the health outcome of subjects undergoing postoperative care. Results on expert labeled data show the HMM to be accurate 60% of the time. Clusters based on the HMMs further suggest a possible correspondence between donor haplotypes matching and loss of function outcomes.</p>","PeriodicalId":90293,"journal":{"name":"Proceedings ... IEEE International Conference on eScience. IEEE International Conference on eScience","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/eScience.2014.29","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"33388827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-10-01DOI: 10.1109/eScience.2012.6404429
Badi' Abdul-Wahid, Li Yu, Dinesh Rajan, Haoyun Feng, Eric Darve, Douglas Thain, Jesús A Izaguirre
Molecular modeling is a field that traditionally has large computational costs. Until recently, most simulation techniques relied on long trajectories, which inherently have poor scalability. A new class of methods is proposed that requires only a large number of short calculations, and for which minimal communication between computer nodes is required. We considered one of the more accurate variants called Accelerated Weighted Ensemble Dynamics (AWE) and for which distributed computing can be made efficient. We implemented AWE using the Work Queue framework for task management and applied it to an all atom protein model (Fip35 WW domain). We can run with excellent scalability by simultaneously utilizing heterogeneous resources from multiple computing platforms such as clouds (Amazon EC2, Microsoft Azure), dedicated clusters, grids, on multiple architectures (CPU/GPU, 32/64bit), and in a dynamic environment in which processes are regularly added or removed from the pool. This has allowed us to achieve an aggregate sampling rate of over 500 ns/hour. As a comparison, a single process typically achieves 0.1 ns/hour.
分子建模是一个传统上具有大量计算成本的领域。直到最近,大多数仿真技术都依赖于长轨迹,这本身就具有较差的可扩展性。提出了一种新的方法,它只需要大量的短计算,并且在计算机节点之间需要最少的通信。我们考虑了一种更准确的变体,称为加速加权集成动力学(AWE),分布式计算可以变得高效。我们使用工作队列框架实现了AWE任务管理,并将其应用于全原子蛋白质模型(Fip35 WW结构域)。通过同时利用来自多个计算平台的异构资源,例如云(Amazon EC2, Microsoft Azure)、专用集群、网格、多个架构(CPU/GPU, 32/64位),以及定期从池中添加或删除进程的动态环境,我们可以以出色的可扩展性运行。这使我们能够实现超过500纳秒/小时的总采样率。相比之下,单个过程通常达到0.1纳秒/小时。
{"title":"Folding Proteins at 500 ns/hour with Work Queue.","authors":"Badi' Abdul-Wahid, Li Yu, Dinesh Rajan, Haoyun Feng, Eric Darve, Douglas Thain, Jesús A Izaguirre","doi":"10.1109/eScience.2012.6404429","DOIUrl":"https://doi.org/10.1109/eScience.2012.6404429","url":null,"abstract":"<p><p>Molecular modeling is a field that traditionally has large computational costs. Until recently, most simulation techniques relied on long trajectories, which inherently have poor scalability. A new class of methods is proposed that requires only a large number of short calculations, and for which minimal communication between computer nodes is required. We considered one of the more accurate variants called Accelerated Weighted Ensemble Dynamics (AWE) and for which distributed computing can be made efficient. We implemented AWE using the Work Queue framework for task management and applied it to an all atom protein model (Fip35 WW domain). We can run with excellent scalability by simultaneously utilizing heterogeneous resources from multiple computing platforms such as clouds (Amazon EC2, Microsoft Azure), dedicated clusters, grids, on multiple architectures (CPU/GPU, 32/64bit), and in a dynamic environment in which processes are regularly added or removed from the pool. This has allowed us to achieve an aggregate sampling rate of over 500 ns/hour. As a comparison, a single process typically achieves 0.1 ns/hour.</p>","PeriodicalId":90293,"journal":{"name":"Proceedings ... IEEE International Conference on eScience. IEEE International Conference on eScience","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/eScience.2012.6404429","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"32935576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-12-31DOI: 10.1109/eScience.2011.11
Stuart Ozer, Kishore J Doshi, Weijia Xu, Robin R Gutell
Beyond its direct involvement in protein synthesis with mRNA, tRNA, and rRNA, RNA is now being appreciated for its significance in the overall metabolism and regulation of the cell. Comparative analysis has been very effective in the identification and characterization of RNA molecules, including the accurate prediction of their secondary structure. We are developing an integrative scalable data management and analysis system, the RNA Comparative Analysis Database (rCAD), implemented with SQL Server to support RNA comparative analysis. The platformagnostic database schema of rCAD captures the essential relationships between the different dimensions of information for RNA comparative analysis datasets. The rCAD implementation enables a variety of comparative analysis manipulations with multiple integrated data dimensions for advanced RNA comparative analysis workflows. In this paper, we describe details of the rCAD schema design and illustrate its usefulness with two usage scenarios.
{"title":"rCAD: A Novel Database Schema for the Comparative Analysis of RNA.","authors":"Stuart Ozer, Kishore J Doshi, Weijia Xu, Robin R Gutell","doi":"10.1109/eScience.2011.11","DOIUrl":"https://doi.org/10.1109/eScience.2011.11","url":null,"abstract":"<p><p>Beyond its direct involvement in protein synthesis with mRNA, tRNA, and rRNA, RNA is now being appreciated for its significance in the overall metabolism and regulation of the cell. Comparative analysis has been very effective in the identification and characterization of RNA molecules, including the accurate prediction of their secondary structure. We are developing an integrative scalable data management and analysis system, the RNA Comparative Analysis Database (rCAD), implemented with SQL Server to support RNA comparative analysis. The platformagnostic database schema of rCAD captures the essential relationships between the different dimensions of information for RNA comparative analysis datasets. The rCAD implementation enables a variety of comparative analysis manipulations with multiple integrated data dimensions for advanced RNA comparative analysis workflows. In this paper, we describe details of the rCAD schema design and illustrate its usefulness with two usage scenarios.</p>","PeriodicalId":90293,"journal":{"name":"Proceedings ... IEEE International Conference on eScience. IEEE International Conference on eScience","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/eScience.2011.11","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"32296632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}