Mengying Yang, Xinyu Liu, W. Kroeger, A. Sim, Kesheng Wu
This short paper reports our on-going work to study and identify anomalous file transfers for a large scientific facility known as Linac Coherent Light Source (LCLS). We identify the anomalies based on the statistical models extracted from the recent observations of the file transfer events. This data-driven approach could be used in different use cases to identify unusual events. More specifically, we propose two different identification strategies based on the different properties of the observed file transfers. Because these methods capture key aspects of the two different segments of the data transfer pipeline, they are able to make accurate identifications for their respective workflow components. The current anomaly detection algorithms only make use of the file sizes as the primary feature. We anticipate that integrating more information will improve the prediction accuracy. Additional work is planned to validate the identification algorithms on more data and in different use cases.
{"title":"Identifying Anomalous File Transfer Events in LCLS Workflow","authors":"Mengying Yang, Xinyu Liu, W. Kroeger, A. Sim, Kesheng Wu","doi":"10.1145/3217197.3217203","DOIUrl":"https://doi.org/10.1145/3217197.3217203","url":null,"abstract":"This short paper reports our on-going work to study and identify anomalous file transfers for a large scientific facility known as Linac Coherent Light Source (LCLS). We identify the anomalies based on the statistical models extracted from the recent observations of the file transfer events. This data-driven approach could be used in different use cases to identify unusual events. More specifically, we propose two different identification strategies based on the different properties of the observed file transfers. Because these methods capture key aspects of the two different segments of the data transfer pipeline, they are able to make accurate identifications for their respective workflow components. The current anomaly detection algorithms only make use of the file sizes as the primary feature. We anticipate that integrating more information will improve the prediction accuracy. Additional work is planned to validate the identification algorithms on more data and in different use cases.","PeriodicalId":118966,"journal":{"name":"Proceedings of the 1st International Workshop on Autonomous Infrastructure for Science","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115493590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xi Yang, T. Lehman, R. Kettimuthu, L. Winkler, Eun-Sung Jung
Today's scientific computing applications and workflows operate on heterogeneous and vastly distributed infrastructures. Traditional human-in-the-loop service engineering approach met its insurmountable challenge in dealing with these very complex and diverse networked systems, including conventional and software defined networks, compute, storage, clouds and instruments. Orchestration is the key to integrate and coordinate the networked multi-services and automate end-to-end workflows. In this work, we present a model driven intelligent orchestration approach to this end-to-end automation, which is built upon a semantic modeling solution that supports the full stack of service integration, orchestration, abstraction, and intent and policy representation. We also present the design of a real-world orchestrator called StackV that is able to accommodate highly complex application scenarios such as Software Defined ScienceDMZ (SD-SDMZ) and Hybrid Cloud Inter-Networking (HCIN) by implementing this approach.
{"title":"A Model Driven Intelligent Orchestration Approach to Service Automation in Large Distributed Infrastructures","authors":"Xi Yang, T. Lehman, R. Kettimuthu, L. Winkler, Eun-Sung Jung","doi":"10.1145/3217197.3217207","DOIUrl":"https://doi.org/10.1145/3217197.3217207","url":null,"abstract":"Today's scientific computing applications and workflows operate on heterogeneous and vastly distributed infrastructures. Traditional human-in-the-loop service engineering approach met its insurmountable challenge in dealing with these very complex and diverse networked systems, including conventional and software defined networks, compute, storage, clouds and instruments. Orchestration is the key to integrate and coordinate the networked multi-services and automate end-to-end workflows. In this work, we present a model driven intelligent orchestration approach to this end-to-end automation, which is built upon a semantic modeling solution that supports the full stack of service integration, orchestration, abstraction, and intent and policy representation. We also present the design of a real-world orchestrator called StackV that is able to accommodate highly complex application scenarios such as Software Defined ScienceDMZ (SD-SDMZ) and Hybrid Cloud Inter-Networking (HCIN) by implementing this approach.","PeriodicalId":118966,"journal":{"name":"Proceedings of the 1st International Workshop on Autonomous Infrastructure for Science","volume":"107 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115952619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Emma Stahl, A. Yabo, Olivier Richard, B. Bzeznik, B. Robu, É. Rutten
HPC systems are facing more and more variability in their behavior, related to e.g., performance and power consumption, and the fact that they are less predictable requires more runtime management. This can be done in an Autonomic Management feedback loop, in response to monitored information in the systems, by analysis of this data and utilization of the results in order to activate appropriate system-level or application-level feedback mechanisms (e.g., informing schedulers, down-clocking CPUs). One such problem is found in the context of CiGri, a simple, lightweight, scalable and fault tolerant grid system which exploits the unused resources of a set of computing clusters. Computing power left over by the execution of a main HPC application scheduling is used to execute smaller jobs, which are injected as much as the global system allows. This paper presents first results addressing the problem of automated resource management in an HPC infrastructure, using techniques from Control Theory to design a controller that maximizes cluster utilization while avoiding overload. We put in place a mechanism for feedback (Proportional Integral, PI) control system software, through a maximum number of jobs to be sent to the cluster, in response to system information about the current number of jobs processed.
{"title":"Towards a control-theory approach for minimizing unused grid resources","authors":"Emma Stahl, A. Yabo, Olivier Richard, B. Bzeznik, B. Robu, É. Rutten","doi":"10.1145/3217197.3217201","DOIUrl":"https://doi.org/10.1145/3217197.3217201","url":null,"abstract":"HPC systems are facing more and more variability in their behavior, related to e.g., performance and power consumption, and the fact that they are less predictable requires more runtime management. This can be done in an Autonomic Management feedback loop, in response to monitored information in the systems, by analysis of this data and utilization of the results in order to activate appropriate system-level or application-level feedback mechanisms (e.g., informing schedulers, down-clocking CPUs). One such problem is found in the context of CiGri, a simple, lightweight, scalable and fault tolerant grid system which exploits the unused resources of a set of computing clusters. Computing power left over by the execution of a main HPC application scheduling is used to execute smaller jobs, which are injected as much as the global system allows. This paper presents first results addressing the problem of automated resource management in an HPC infrastructure, using techniques from Control Theory to design a controller that maximizes cluster utilization while avoiding overload. We put in place a mechanism for feedback (Proportional Integral, PI) control system software, through a maximum number of jobs to be sent to the cluster, in response to system information about the current number of jobs processed.","PeriodicalId":118966,"journal":{"name":"Proceedings of the 1st International Workshop on Autonomous Infrastructure for Science","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130283181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ryan Chard, Rafael Vescovi, Ming Du, Hanyu Li, K. Chard, S. Tuecke, N. Kasthuri, Ian T Foster
Exponential increases in data volumes and velocities are overwhelming finite human capabilities. Continued progress in science and engineering demands that we automate a broad spectrum of currently manual research data manipulation tasks, from data transfer and sharing to acquisition, publication, and analysis. These needs are particularly evident in large-scale experimental science, in which researchers are typically granted short periods of instrument time and must maximize experiment efficiency as well as output data quality and accuracy. To address the need for automation, which is pervasive across science and engineering, we present our experiences using Trigger-Action-Programming to automate a real-world scientific workflow. We evaluate our methods by applying them to a neuroanatomy application in which a synchrotron is used to image cm-scale mouse brains with sub-micrometer resolution. In this use case, data is acquired in real-time at the synchrotron and are automatically passed through a complex automation flow that involves reconstruction using HPC resources, human-in-the-loop coordination, and finally data publication and visualization. We describe the lessons learned from these experiences and outline the design for a new research automation platform.
{"title":"High-Throughput Neuroanatomy and Trigger-Action Programming: A Case Study in Research Automation","authors":"Ryan Chard, Rafael Vescovi, Ming Du, Hanyu Li, K. Chard, S. Tuecke, N. Kasthuri, Ian T Foster","doi":"10.1145/3217197.3217206","DOIUrl":"https://doi.org/10.1145/3217197.3217206","url":null,"abstract":"Exponential increases in data volumes and velocities are overwhelming finite human capabilities. Continued progress in science and engineering demands that we automate a broad spectrum of currently manual research data manipulation tasks, from data transfer and sharing to acquisition, publication, and analysis. These needs are particularly evident in large-scale experimental science, in which researchers are typically granted short periods of instrument time and must maximize experiment efficiency as well as output data quality and accuracy. To address the need for automation, which is pervasive across science and engineering, we present our experiences using Trigger-Action-Programming to automate a real-world scientific workflow. We evaluate our methods by applying them to a neuroanatomy application in which a synchrotron is used to image cm-scale mouse brains with sub-micrometer resolution. In this use case, data is acquired in real-time at the synchrotron and are automatically passed through a complex automation flow that involves reconstruction using HPC resources, human-in-the-loop coordination, and finally data publication and visualization. We describe the lessons learned from these experiences and outline the design for a new research automation platform.","PeriodicalId":118966,"journal":{"name":"Proceedings of the 1st International Workshop on Autonomous Infrastructure for Science","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121064697","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the 1st International Workshop on Autonomous Infrastructure for Science","authors":"","doi":"10.1145/3217197","DOIUrl":"https://doi.org/10.1145/3217197","url":null,"abstract":"","PeriodicalId":118966,"journal":{"name":"Proceedings of the 1st International Workshop on Autonomous Infrastructure for Science","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133057011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pranjal Sahu, Dantong Yu, K. Yager, Mallesham Dasari, Hong Qin
The structures of many material systems evolve as they are treated with physical processing. For instance, organic and inorganic crystalline materials frequently coarsen over time as they are thermally treated; with domains (grains) rotating and growing in size. When a material system undergoing the structural transformation is probed using x-ray scattering beams, the peaks in the scattering images will sharpen and intensify, and the scattering rings will become increasingly 'textured'. Accurate identification of the transition frame in advance brings multiple benefits to the NSLS-II in-operando experiments of studying material systems such as minimal beamline damage to samples, reduced energy costs, and the optimal sampling of material properties. In this paper, we formulate the prediction and identification of the structural transition event as a classification problem and apply a novel LSTM model to identify sequences having transition event. The preliminary results of the experiments are encouraging and confirm the viability of the detection and prediction of transition in advance. Our ultimate goal is to deploy such a prediction system in the real-world environment at the selected beamline of NSLS-II for improving the efficiency of the experimental facility.
{"title":"In-Operando Tracking and Prediction of Transition in Material System using LSTM","authors":"Pranjal Sahu, Dantong Yu, K. Yager, Mallesham Dasari, Hong Qin","doi":"10.1145/3217197.3217204","DOIUrl":"https://doi.org/10.1145/3217197.3217204","url":null,"abstract":"The structures of many material systems evolve as they are treated with physical processing. For instance, organic and inorganic crystalline materials frequently coarsen over time as they are thermally treated; with domains (grains) rotating and growing in size. When a material system undergoing the structural transformation is probed using x-ray scattering beams, the peaks in the scattering images will sharpen and intensify, and the scattering rings will become increasingly 'textured'. Accurate identification of the transition frame in advance brings multiple benefits to the NSLS-II in-operando experiments of studying material systems such as minimal beamline damage to samples, reduced energy costs, and the optimal sampling of material properties. In this paper, we formulate the prediction and identification of the structural transition event as a classification problem and apply a novel LSTM model to identify sequences having transition event. The preliminary results of the experiments are encouraging and confirm the viability of the detection and prediction of transition in advance. Our ultimate goal is to deploy such a prediction system in the real-world environment at the selected beamline of NSLS-II for improving the efficiency of the experimental facility.","PeriodicalId":118966,"journal":{"name":"Proceedings of the 1st International Workshop on Autonomous Infrastructure for Science","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114491992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qiang Liu, N. Rao, S. Sen, B. Settlemyer, Hsing-bung Chen, J. Boley, R. Kettimuthu, D. Katramatos
Recent developments in software-defined infrastructures promise that scientific workflows utilizing supercomputers, instruments, and storage systems will be dynamically composed and orchestrated using software at unprecedented speed and scale in the near future. Testing of the underlying networking software, particularly during initial exploratory stages, remains a challenge due to potential disruptions, and resource allocation and coordination needed over the multi-domain physical infrastructure. To overcome these challenges, we develop the Virtual Science Network Environment (VSNE) that emulates the multi-site host, storage, and network infrastructure using Virtual Machines (VMs), wherein the production and nascent software can be tested. Within each VM, which represents a site, the hosts and local-area networks are emulated using Mininet, and the Software-Defined Network (SDN) controllers and service daemon codes are natively run to support dynamic provisioning of network connections. Additionally, Lustre filesystem support at the sites and an emulation of the long-haul network using Mininet, are provided using separate VMs. As case studies, we describe Lustre file transfers using XDD, Red5 streaming service demonstration, and an emulated experiment with remote monitoring and steering modules, all supported over dynamically configured connections using SDN controllers.
{"title":"Virtual Environment for Testing Software-Defined Networking Solutions for Scientific Workflows","authors":"Qiang Liu, N. Rao, S. Sen, B. Settlemyer, Hsing-bung Chen, J. Boley, R. Kettimuthu, D. Katramatos","doi":"10.1145/3217197.3217202","DOIUrl":"https://doi.org/10.1145/3217197.3217202","url":null,"abstract":"Recent developments in software-defined infrastructures promise that scientific workflows utilizing supercomputers, instruments, and storage systems will be dynamically composed and orchestrated using software at unprecedented speed and scale in the near future. Testing of the underlying networking software, particularly during initial exploratory stages, remains a challenge due to potential disruptions, and resource allocation and coordination needed over the multi-domain physical infrastructure. To overcome these challenges, we develop the Virtual Science Network Environment (VSNE) that emulates the multi-site host, storage, and network infrastructure using Virtual Machines (VMs), wherein the production and nascent software can be tested. Within each VM, which represents a site, the hosts and local-area networks are emulated using Mininet, and the Software-Defined Network (SDN) controllers and service daemon codes are natively run to support dynamic provisioning of network connections. Additionally, Lustre filesystem support at the sites and an emulation of the long-haul network using Mininet, are provided using separate VMs. As case studies, we describe Lustre file transfers using XDD, Red5 streaming service demonstration, and an emulated experiment with remote monitoring and steering modules, all supported over dynamically configured connections using SDN controllers.","PeriodicalId":118966,"journal":{"name":"Proceedings of the 1st International Workshop on Autonomous Infrastructure for Science","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115735697","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Kettimuthu, Zhengchun Liu, Ian T Foster, P. Beckman, A. Sim, Kesheng Wu, W. Liao, Qiao Kang, Ankit Agrawal, A. Choudhary
Scientific computing systems are becoming increasingly complex and indeed are close to reaching a critical limit in manageability when using current human-in-the-loop techniques. In order to address this problem, autonomic, goal-driven management actions based on machine learning must be applied end to end across the scientific computing landscape. Even though researchers proposed architectures and design choices for autonomic computing systems more than a decade ago, practical realization of such systems has been limited, especially in scientific computing environments. Growing interest and recent developments in machine learning have spurred proposals to apply machine learning for goal-based optimization of computing systems in an autonomous fashion. We review recent work that uses machine learning algorithms to improve computer system performance, identify gaps and open issues. We propose a hierarchical architecture that builds on the earlier proposals for autonomic computing systems to realize an autonomous science infrastructure.
{"title":"Towards Autonomic Science Infrastructure: Architecture, Limitations, and Open Issues","authors":"R. Kettimuthu, Zhengchun Liu, Ian T Foster, P. Beckman, A. Sim, Kesheng Wu, W. Liao, Qiao Kang, Ankit Agrawal, A. Choudhary","doi":"10.1145/3217197.3217205","DOIUrl":"https://doi.org/10.1145/3217197.3217205","url":null,"abstract":"Scientific computing systems are becoming increasingly complex and indeed are close to reaching a critical limit in manageability when using current human-in-the-loop techniques. In order to address this problem, autonomic, goal-driven management actions based on machine learning must be applied end to end across the scientific computing landscape. Even though researchers proposed architectures and design choices for autonomic computing systems more than a decade ago, practical realization of such systems has been limited, especially in scientific computing environments. Growing interest and recent developments in machine learning have spurred proposals to apply machine learning for goal-based optimization of computing systems in an autonomous fashion. We review recent work that uses machine learning algorithms to improve computer system performance, identify gaps and open issues. We propose a hierarchical architecture that builds on the earlier proposals for autonomic computing systems to realize an autonomous science infrastructure.","PeriodicalId":118966,"journal":{"name":"Proceedings of the 1st International Workshop on Autonomous Infrastructure for Science","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131730064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}