Pub Date : 2013-10-22DOI: 10.1109/eScience.2013.11
Yuwei Wang, Ze Luo, Gang Qin, Yuanchun Zhou, Danhuai Guo, Baoping Yan
Advanced satellite tracking technologies enable biologists to track animal movements at finer spatial and temporal scales. The resulting long-term movement data is very meaningful for understanding animal activities. Periodic pattern analysis can provide insightful approach to reveal animal activity patterns. However, individual GPS data is usually incomplete and in limited lifespan. In addition, individual periodic behaviors are inherently complicated with many uncertainties. In this paper, we address the problem of mining periodic patterns of animal movements by combining multiple individuals with similar periodicities. We formally define the problem of mining common periodicity and propose a novel periodicity measure. We introduce the information entropy in the proposed measure to detect common period. Data incompleteness, noises, and ambiguity of individual periodicity are considered in our method. Furthermore, we mine multiple common periodic patterns by grouping periodic segments w.r.t. the detected period, and provide a visualization method of common periodic patterns by designing a cyclical filled line chart. To assess effectiveness of our proposed method, we provide an experimental study using a real GPS dataset collected on 29 birds in Qinghai Lake, China.
{"title":"Mining Common Spatial-Temporal Periodic Patterns of Animal Movement","authors":"Yuwei Wang, Ze Luo, Gang Qin, Yuanchun Zhou, Danhuai Guo, Baoping Yan","doi":"10.1109/eScience.2013.11","DOIUrl":"https://doi.org/10.1109/eScience.2013.11","url":null,"abstract":"Advanced satellite tracking technologies enable biologists to track animal movements at finer spatial and temporal scales. The resulting long-term movement data is very meaningful for understanding animal activities. Periodic pattern analysis can provide insightful approach to reveal animal activity patterns. However, individual GPS data is usually incomplete and in limited lifespan. In addition, individual periodic behaviors are inherently complicated with many uncertainties. In this paper, we address the problem of mining periodic patterns of animal movements by combining multiple individuals with similar periodicities. We formally define the problem of mining common periodicity and propose a novel periodicity measure. We introduce the information entropy in the proposed measure to detect common period. Data incompleteness, noises, and ambiguity of individual periodicity are considered in our method. Furthermore, we mine multiple common periodic patterns by grouping periodic segments w.r.t. the detected period, and provide a visualization method of common periodic patterns by designing a cyclical filled line chart. To assess effectiveness of our proposed method, we provide an experimental study using a real GPS dataset collected on 29 birds in Qinghai Lake, China.","PeriodicalId":325272,"journal":{"name":"2013 IEEE 9th International Conference on e-Science","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126564506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-10-22DOI: 10.1109/ESCIENCE.2013.28
S. Guru, Xiaobin Shen, C. Love, A. Treloar, S. Phinn, Ross Wilkinson, Cathrine Brady, P. Isaac, T. Clancy
Collection based approaches are commonly used in libraries for collections of physical and electronic resources. However nationally significant collections of research data are new development, one that is increasing importance to researchers. Bringing together datasets of national significance and making them openly accessible will enable to address some of the critical questions facing our society. In ecosystem domain, this will enable us to understand more about causes and effects of changes in the ecosystem. An implementation case study based on Tern's OzFlux data collections as a national collection program initiated by ANDS is presented. This paper demonstrates how the Terrestrial Ecosystem Research Network (TERN) and the Australian National Data Service (ANDS) are working together to identify and publish nationally significant ecosystem data collections to enhance discoverability, accessibility and re-use.
{"title":"Sharing Australia's Nationally Significant Terrestrial Ecosystem Data: A Collaboration between TERN and ANDS","authors":"S. Guru, Xiaobin Shen, C. Love, A. Treloar, S. Phinn, Ross Wilkinson, Cathrine Brady, P. Isaac, T. Clancy","doi":"10.1109/ESCIENCE.2013.28","DOIUrl":"https://doi.org/10.1109/ESCIENCE.2013.28","url":null,"abstract":"Collection based approaches are commonly used in libraries for collections of physical and electronic resources. However nationally significant collections of research data are new development, one that is increasing importance to researchers. Bringing together datasets of national significance and making them openly accessible will enable to address some of the critical questions facing our society. In ecosystem domain, this will enable us to understand more about causes and effects of changes in the ecosystem. An implementation case study based on Tern's OzFlux data collections as a national collection program initiated by ANDS is presented. This paper demonstrates how the Terrestrial Ecosystem Research Network (TERN) and the Australian National Data Service (ANDS) are working together to identify and publish nationally significant ecosystem data collections to enhance discoverability, accessibility and re-use.","PeriodicalId":325272,"journal":{"name":"2013 IEEE 9th International Conference on e-Science","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125358786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The proliferation of cloud computing allows scientists to deploy computation and data intensive applications without infrastructure investment, where large generated datasets can be flexibly stored with multiple cloud service providers. Due to the pay-as-you-go model, the total application cost largely depends on the usage of computation, storage and bandwidth resources, and cutting the cost of cloud-based data storage becomes a big concern for deploying scientific applications in the cloud. In this paper, we propose a novel algorithm that can automatically decide whether a generated dataset should be 1) stored in the current cloud, 2) deleted and re-generated whenever reused or 3) transferred to cheaper cloud service for storage. The algorithm finds the trade-off among computation, storage and bandwidth costs in the cloud, which are three key factors for the cost of storing generated application datasets with multiple cloud service providers. Simulations conducted with popular cloud service providers' pricing models show that the proposed algorithm is highly cost-effective to be utilised in the cloud.
{"title":"An Algorithm for Cost-Effectively Storing Scientific Datasets with Multiple Service Providers in the Cloud","authors":"Dong Yuan, X. Liu, Li-zhen Cui, Tiantian Zhang, Wenhao Li, Dahai Cao, Yun Yang","doi":"10.1109/eScience.2013.34","DOIUrl":"https://doi.org/10.1109/eScience.2013.34","url":null,"abstract":"The proliferation of cloud computing allows scientists to deploy computation and data intensive applications without infrastructure investment, where large generated datasets can be flexibly stored with multiple cloud service providers. Due to the pay-as-you-go model, the total application cost largely depends on the usage of computation, storage and bandwidth resources, and cutting the cost of cloud-based data storage becomes a big concern for deploying scientific applications in the cloud. In this paper, we propose a novel algorithm that can automatically decide whether a generated dataset should be 1) stored in the current cloud, 2) deleted and re-generated whenever reused or 3) transferred to cheaper cloud service for storage. The algorithm finds the trade-off among computation, storage and bandwidth costs in the cloud, which are three key factors for the cost of storing generated application datasets with multiple cloud service providers. Simulations conducted with popular cloud service providers' pricing models show that the proposed algorithm is highly cost-effective to be utilised in the cloud.","PeriodicalId":325272,"journal":{"name":"2013 IEEE 9th International Conference on e-Science","volume":"2015 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127797984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-10-22DOI: 10.1109/eScience.2013.44
Per-Olov Östberg, E. Elmroth
Fairshare scheduling is an established technique to provide user-level differentiation in management of capacity consumption in high-performance and grid computing scheduler systems. In this paper we extend on a state-of-the-art approach to decentralized grid fairs hare and propose a generalized model for construction of decentralized prioritization-based management systems. The approach is based on (re)formulation of control problems as prioritization problems, and a proposed framework for computationally efficient decentralized priority calculation. The model is presented along with a discussion of application of decentralized management systems in distributed computing environments that outlines selected use cases and illustrates key trade-off behaviors of the proposed model.
{"title":"Decentralized Prioritization-Based Management Systems for Distributed Computing","authors":"Per-Olov Östberg, E. Elmroth","doi":"10.1109/eScience.2013.44","DOIUrl":"https://doi.org/10.1109/eScience.2013.44","url":null,"abstract":"Fairshare scheduling is an established technique to provide user-level differentiation in management of capacity consumption in high-performance and grid computing scheduler systems. In this paper we extend on a state-of-the-art approach to decentralized grid fairs hare and propose a generalized model for construction of decentralized prioritization-based management systems. The approach is based on (re)formulation of control problems as prioritization problems, and a proposed framework for computationally efficient decentralized priority calculation. The model is presented along with a discussion of application of decentralized management systems in distributed computing environments that outlines selected use cases and illustrates key trade-off behaviors of the proposed model.","PeriodicalId":325272,"journal":{"name":"2013 IEEE 9th International Conference on e-Science","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127735168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-10-22DOI: 10.1109/eScience.2013.14
D. C. Cugler, C. B. Medeiros, S. Shekhar, L. F. Toledo
This paper addresses the problem of improving the quality of metadata in biological observation databases, in particular those associated with observations of living beings, and which are often used as a starting point for biodiversity analyses. Poor quality metadata lead to incorrect scientific conclusions, and can mislead experts. Thus, it is important to design and develop methods to detect and correct metadata quality problems. This is a challenging problem because of the variety of issues concerning such metadata, e.g., misnaming of species, location uncertainty and imprecision concerning where observations were recorded. Related work is limited because it does not adequately model such issues. We propose a geographic approach based on expert-led classification of place and/or range mismatch anomalies detected by our algorithms. Our approach enables detection of anomalies in both species' reported geographic distributions and in species' identification. Our main contribution is our geographic algorithm that deals with uncertain/imprecise locations. Our work is tested using a case study with the Fonoteca Neotropical Jacques Vielliard, one of the 10 largest animal sound collections in the world.
本文讨论了如何提高生物观测数据库中元数据的质量,特别是那些与生物观测相关的元数据,这些元数据通常被用作生物多样性分析的起点。质量差的元数据会导致不正确的科学结论,并可能误导专家。因此,设计和开发检测和纠正元数据质量问题的方法非常重要。这是一个具有挑战性的问题,因为与这种元数据有关的各种问题,例如,物种的错误命名,地点的不确定性和观测记录地点的不精确。相关工作是有限的,因为它没有充分模拟这些问题。我们提出了一种地理方法,该方法基于我们的算法检测到的由专家主导的地点和/或范围不匹配异常分类。我们的方法可以检测物种报告的地理分布和物种鉴定中的异常。我们的主要贡献是处理不确定/不精确位置的地理算法。我们的工作通过与Fonoteca Neotropical Jacques Vielliard(世界十大动物声音收藏之一)的案例研究进行了验证。
{"title":"A Geographical Approach for Metadata Quality Improvement in Biological Observation Databases","authors":"D. C. Cugler, C. B. Medeiros, S. Shekhar, L. F. Toledo","doi":"10.1109/eScience.2013.14","DOIUrl":"https://doi.org/10.1109/eScience.2013.14","url":null,"abstract":"This paper addresses the problem of improving the quality of metadata in biological observation databases, in particular those associated with observations of living beings, and which are often used as a starting point for biodiversity analyses. Poor quality metadata lead to incorrect scientific conclusions, and can mislead experts. Thus, it is important to design and develop methods to detect and correct metadata quality problems. This is a challenging problem because of the variety of issues concerning such metadata, e.g., misnaming of species, location uncertainty and imprecision concerning where observations were recorded. Related work is limited because it does not adequately model such issues. We propose a geographic approach based on expert-led classification of place and/or range mismatch anomalies detected by our algorithms. Our approach enables detection of anomalies in both species' reported geographic distributions and in species' identification. Our main contribution is our geographic algorithm that deals with uncertain/imprecise locations. Our work is tested using a case study with the Fonoteca Neotropical Jacques Vielliard, one of the 10 largest animal sound collections in the world.","PeriodicalId":325272,"journal":{"name":"2013 IEEE 9th International Conference on e-Science","volume":"199 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128218935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-10-22DOI: 10.1109/eScience.2013.41
H. Kugler
Understanding how biological systems develop and function remains one of the main open scientific challenges of our times. An improved quantitative understanding of biological systems, assisted by computational models is also important for future bioengineering and biomedical applications. We present a computational approach aimed towards unifying hypotheses with models and experiments, allowing to formally represent what a biological system does (specification) how it does it (mechanism) and systematically compare to data characterizing system behavior(experiments). We describe our Biocharts framework geared towards supporting this approach and illustrate its application in several biological domains including bacterial colony growth, developmental biology, and stem cell population dynamics.
{"title":"Biocharts: Unifying Biological Hypotheses with Models and Experiments","authors":"H. Kugler","doi":"10.1109/eScience.2013.41","DOIUrl":"https://doi.org/10.1109/eScience.2013.41","url":null,"abstract":"Understanding how biological systems develop and function remains one of the main open scientific challenges of our times. An improved quantitative understanding of biological systems, assisted by computational models is also important for future bioengineering and biomedical applications. We present a computational approach aimed towards unifying hypotheses with models and experiments, allowing to formally represent what a biological system does (specification) how it does it (mechanism) and systematically compare to data characterizing system behavior(experiments). We describe our Biocharts framework geared towards supporting this approach and illustrate its application in several biological domains including bacterial colony growth, developmental biology, and stem cell population dynamics.","PeriodicalId":325272,"journal":{"name":"2013 IEEE 9th International Conference on e-Science","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132244640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-10-22DOI: 10.1109/eScience.2013.15
Daniel Li, B. Tsui, Charles Xue, J. Haga, Koheix Ichikawa, S. Date
Advances in sequencing technology have resulted in an exponential increase in the availability of protein sequence information. In order to fully utilize information, it is important to translate the primary sequences into high-resolution tertiary protein structures. MODELLER is a leading homology modeling method that produces high quality protein structures. In this study, the function of MODELLER was expanded by configuring and deploying it on a parallel grid computing platform using a custom four-step workflow. The workflow consisted of template selection through a protein BLAST algorithm, target-template protein sequence alignment, distribution of model generation jobs among the compute clusters, and final protein model optimization. To test the validity of this workflow, we used the Dual Specificity Phosphatase (DSP) protein family, which shares high homology among each other. Comparison of the DSP member SSH-2 with its model counterpart revealed a minimal 1.3% difference in output energy scores. Furthermore, the Dali Pair wise Comparison Program demonstrated a 98% match among amino acid features and a Z-score of 26.6 indicating very significant similarities between the model and actual protein structure. After confirming the accuracy of our workflow, we generated 23 previously unknown DSP family protein structure models. Over 40,000 models were generated 30 times faster than conventional computing. Virtual receptor-ligand screening results of modeled protein DSP21 were compared with two known structures that had either higher or lower structural homology to DSP21. There was a significant difference (p!0.001) between the average ligand ranking discrepancy of a more homologous protein pair and a less homologous protein pair, suggesting that the protein models generated were sufficiently accurate for virtual screening. These results demonstrate the accuracy and usability of a grid-enabled MODELLER program and the increased efficiency of processing protein structure models. This workflow will help increase the speed of future drug development pipelines.
{"title":"Protein Structure Modeling in a Grid Computing Environment","authors":"Daniel Li, B. Tsui, Charles Xue, J. Haga, Koheix Ichikawa, S. Date","doi":"10.1109/eScience.2013.15","DOIUrl":"https://doi.org/10.1109/eScience.2013.15","url":null,"abstract":"Advances in sequencing technology have resulted in an exponential increase in the availability of protein sequence information. In order to fully utilize information, it is important to translate the primary sequences into high-resolution tertiary protein structures. MODELLER is a leading homology modeling method that produces high quality protein structures. In this study, the function of MODELLER was expanded by configuring and deploying it on a parallel grid computing platform using a custom four-step workflow. The workflow consisted of template selection through a protein BLAST algorithm, target-template protein sequence alignment, distribution of model generation jobs among the compute clusters, and final protein model optimization. To test the validity of this workflow, we used the Dual Specificity Phosphatase (DSP) protein family, which shares high homology among each other. Comparison of the DSP member SSH-2 with its model counterpart revealed a minimal 1.3% difference in output energy scores. Furthermore, the Dali Pair wise Comparison Program demonstrated a 98% match among amino acid features and a Z-score of 26.6 indicating very significant similarities between the model and actual protein structure. After confirming the accuracy of our workflow, we generated 23 previously unknown DSP family protein structure models. Over 40,000 models were generated 30 times faster than conventional computing. Virtual receptor-ligand screening results of modeled protein DSP21 were compared with two known structures that had either higher or lower structural homology to DSP21. There was a significant difference (p!0.001) between the average ligand ranking discrepancy of a more homologous protein pair and a less homologous protein pair, suggesting that the protein models generated were sufficiently accurate for virtual screening. These results demonstrate the accuracy and usability of a grid-enabled MODELLER program and the increased efficiency of processing protein structure models. This workflow will help increase the speed of future drug development pipelines.","PeriodicalId":325272,"journal":{"name":"2013 IEEE 9th International Conference on e-Science","volume":"159 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121220086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-10-22DOI: 10.1109/eScience.2013.10
B. Zhao, Qiong Luo, Chao Wu
In astronomical observatory projects, raw images are processed so that information about the celestial objects in the images is extracted into catalogs. As such, this source extraction is the basis for the various analysis tasks that are subsequently performed on the catalog products. With the rapid progress of new, large astronomical projects, observational images will be produced every few seconds. This high speed of image production requires fast source extraction. Unfortunately, current source extraction tools cannot meet the speed requirement. To address this problem, we propose to use the GPU (Graphics Processing Unit) to accelerate source extraction. Specifically, we start from SExtractor, an astronomical source extraction tool widely used in astronomy projects, and study its parallelization on the GPU. We identify the object detection and deblending components as the most complex and time-consuming, and design a parallel connected component labelling algorithm for detection and a parallel object tree pruning method for deblending respectively on the GPU. We further parallelize other components, including cleaning, background subtraction, and measurement, effectively on the GPU, such that the entire source extraction is done on the GPU. We have evaluated our GPU-SExtractor in comparison with the original SExtractor on a desktop with an Intel i7 CPU and an NVIDIA GTX670 GPU on a set of real-world and synthetic astronomical images of different sizes. Our results show that the GPU-SExtractor outperforms the original SExtractor by a factor of 6, taking a merely 1.9 second to process a typical 4KX4K image containing 167 thousands objects.
{"title":"Parallelizing Astronomical Source Extraction on the GPU","authors":"B. Zhao, Qiong Luo, Chao Wu","doi":"10.1109/eScience.2013.10","DOIUrl":"https://doi.org/10.1109/eScience.2013.10","url":null,"abstract":"In astronomical observatory projects, raw images are processed so that information about the celestial objects in the images is extracted into catalogs. As such, this source extraction is the basis for the various analysis tasks that are subsequently performed on the catalog products. With the rapid progress of new, large astronomical projects, observational images will be produced every few seconds. This high speed of image production requires fast source extraction. Unfortunately, current source extraction tools cannot meet the speed requirement. To address this problem, we propose to use the GPU (Graphics Processing Unit) to accelerate source extraction. Specifically, we start from SExtractor, an astronomical source extraction tool widely used in astronomy projects, and study its parallelization on the GPU. We identify the object detection and deblending components as the most complex and time-consuming, and design a parallel connected component labelling algorithm for detection and a parallel object tree pruning method for deblending respectively on the GPU. We further parallelize other components, including cleaning, background subtraction, and measurement, effectively on the GPU, such that the entire source extraction is done on the GPU. We have evaluated our GPU-SExtractor in comparison with the original SExtractor on a desktop with an Intel i7 CPU and an NVIDIA GTX670 GPU on a set of real-world and synthetic astronomical images of different sizes. Our results show that the GPU-SExtractor outperforms the original SExtractor by a factor of 6, taking a merely 1.9 second to process a typical 4KX4K image containing 167 thousands objects.","PeriodicalId":325272,"journal":{"name":"2013 IEEE 9th International Conference on e-Science","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117062419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-10-22DOI: 10.1109/eScience.2013.39
Peng Chen, Beth Plale, Tom Evans
Researchers who use agent-based models (ABM) to model social patterns often focus on the model's aggregate phenomena. However, aggregation of individuals complicates the understanding of agent interactions and the uniqueness of individuals. We develop a method for tracing and capturing the provenance of individuals and their interactions in the Net Logo ABM, and from this create a "dependency provenance slice", which combines a data slice and a program slice to yield insights into the cause-effect relations among system behaviors. To cope with the large volume of fine-grained provenance traces, we propose use-inspired filters to reduce the amount of provenance, and a provenance slicing technique called "non-preprocessing provenance slicing" that directly queries over provenance traces without recovering all provenance entities and dependencies beforehand. We evaluate performance and utility using a well known ecological Net Logo model called "wolf-sheep-predation".
使用基于主体的模型(ABM)对社会模式进行建模的研究人员通常关注模型的聚合现象。然而,个体的聚集使对代理相互作用和个体独特性的理解变得复杂。我们开发了一种在Net Logo ABM中跟踪和捕获个体及其交互的来源的方法,并由此创建了一个“依赖来源片”,它结合了数据片和程序片,以深入了解系统行为之间的因果关系。为了应对大量细粒度的来源痕迹,我们提出了使用启发过滤器来减少来源数量,并提出了一种称为“非预处理来源切片”的来源切片技术,该技术直接查询来源痕迹,而无需事先恢复所有来源实体和依赖关系。我们使用一个众所周知的生态网络标志模型“狼-羊-捕食”来评估性能和效用。
{"title":"Dependency Provenance in Agent Based Modeling","authors":"Peng Chen, Beth Plale, Tom Evans","doi":"10.1109/eScience.2013.39","DOIUrl":"https://doi.org/10.1109/eScience.2013.39","url":null,"abstract":"Researchers who use agent-based models (ABM) to model social patterns often focus on the model's aggregate phenomena. However, aggregation of individuals complicates the understanding of agent interactions and the uniqueness of individuals. We develop a method for tracing and capturing the provenance of individuals and their interactions in the Net Logo ABM, and from this create a \"dependency provenance slice\", which combines a data slice and a program slice to yield insights into the cause-effect relations among system behaviors. To cope with the large volume of fine-grained provenance traces, we propose use-inspired filters to reduce the amount of provenance, and a provenance slicing technique called \"non-preprocessing provenance slicing\" that directly queries over provenance traces without recovering all provenance entities and dependencies beforehand. We evaluate performance and utility using a well known ecological Net Logo model called \"wolf-sheep-predation\".","PeriodicalId":325272,"journal":{"name":"2013 IEEE 9th International Conference on e-Science","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124190557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-10-22DOI: 10.1109/eScience.2013.40
Weiwei Chen, Rafael Ferreira da Silva, E. Deelman, R. Sakellariou
Scientific workflows can be composed of many fine computational granularity tasks. The runtime of these tasks may be shorter than the duration of system overheads, for example, when using multiple resources of a cloud infrastructure. Task clustering is a runtime optimization technique that merges multiple short tasks into a single job such that the scheduling overhead is reduced and the overall runtime performance is improved. However, existing task clustering strategies only provide a coarse-grained approach that relies on an over-simplified workflow model. In our work, we examine the reasons that cause Runtime Imbalance and Dependency Imbalance in task clustering. Next, we propose quantitative metrics to evaluate the severity of the two imbalance problems respectively. Furthermore, we propose a series of task balancing methods to address these imbalance problems. Finally, we analyze their relationship with the performance of these task balancing methods. A trace-based simulation shows our methods can significantly improve the runtime performance of two widely used workflows compared to the actual implementation of task clustering.
{"title":"Balanced Task Clustering in Scientific Workflows","authors":"Weiwei Chen, Rafael Ferreira da Silva, E. Deelman, R. Sakellariou","doi":"10.1109/eScience.2013.40","DOIUrl":"https://doi.org/10.1109/eScience.2013.40","url":null,"abstract":"Scientific workflows can be composed of many fine computational granularity tasks. The runtime of these tasks may be shorter than the duration of system overheads, for example, when using multiple resources of a cloud infrastructure. Task clustering is a runtime optimization technique that merges multiple short tasks into a single job such that the scheduling overhead is reduced and the overall runtime performance is improved. However, existing task clustering strategies only provide a coarse-grained approach that relies on an over-simplified workflow model. In our work, we examine the reasons that cause Runtime Imbalance and Dependency Imbalance in task clustering. Next, we propose quantitative metrics to evaluate the severity of the two imbalance problems respectively. Furthermore, we propose a series of task balancing methods to address these imbalance problems. Finally, we analyze their relationship with the performance of these task balancing methods. A trace-based simulation shows our methods can significantly improve the runtime performance of two widely used workflows compared to the actual implementation of task clustering.","PeriodicalId":325272,"journal":{"name":"2013 IEEE 9th International Conference on e-Science","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123301916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}