Pub Date : 2019-09-01DOI: 10.1109/eScience.2019.00076
Y. Demchenko, T. Wiktorski, J. Cuadrado-Gallego, Steve Brewer
The emerging data-driven economy (also defined as Industry 4.0 or simply 4IR), encompassing industry, research and business, requires new types of specialists that are able to support all stages of the data lifecycle from data production and input, to data processing and actionable results delivery, visualisation and reporting, which can be collectively defined as the Data Science family of professions. Data Science as a research and academic discipline provides a basis for Data Analytics and ML/AI applications. The education and training of the data related professions must reflect all multi-disciplinary knowledge and competences that are required from the Data Science and handling practitioners in modern, data-driven research and the digital economy. In the modern era, with ever faster technology changes, matched by strong skills demand, the Data Science education and training programme should be customizable and deliverable in multiple forms, tailored for different categories of professional roles and profiles. Referring to other publications by the authors on building customizable and interoperable Data Science curricula for different types of learners and target application domains, this paper is focused on defining a set of transversal competences and skills that are required from modern and future Data Science professions. These include workplace and professional skills that cover critical thinking, problem solving, and creativity required to work in highly automated and dynamic environment. The proposed approach is based on the EDISON Data Science Framework (EDSF) initially developed within the EU funded Project EDISON and currently being further developed in the EU funded MATES project and also the FAIRsFAIR projects.
{"title":"EDISON Data Science Framework (EDSF) Extension to Address Transversal Skills Required by Emerging Industry 4.0 Transformation","authors":"Y. Demchenko, T. Wiktorski, J. Cuadrado-Gallego, Steve Brewer","doi":"10.1109/eScience.2019.00076","DOIUrl":"https://doi.org/10.1109/eScience.2019.00076","url":null,"abstract":"The emerging data-driven economy (also defined as Industry 4.0 or simply 4IR), encompassing industry, research and business, requires new types of specialists that are able to support all stages of the data lifecycle from data production and input, to data processing and actionable results delivery, visualisation and reporting, which can be collectively defined as the Data Science family of professions. Data Science as a research and academic discipline provides a basis for Data Analytics and ML/AI applications. The education and training of the data related professions must reflect all multi-disciplinary knowledge and competences that are required from the Data Science and handling practitioners in modern, data-driven research and the digital economy. In the modern era, with ever faster technology changes, matched by strong skills demand, the Data Science education and training programme should be customizable and deliverable in multiple forms, tailored for different categories of professional roles and profiles. Referring to other publications by the authors on building customizable and interoperable Data Science curricula for different types of learners and target application domains, this paper is focused on defining a set of transversal competences and skills that are required from modern and future Data Science professions. These include workplace and professional skills that cover critical thinking, problem solving, and creativity required to work in highly automated and dynamic environment. The proposed approach is based on the EDISON Data Science Framework (EDSF) initially developed within the EU funded Project EDISON and currently being further developed in the EU funded MATES project and also the FAIRsFAIR projects.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130268761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/eScience.2019.00091
Rokas Maciulaitis, T. Simko, P. Brenner, Scott S. Hampton, M. Hildreth, K. H. Anampa, Irena Johnson, Cody Kankel, Jan Okraska, D. Rodríguez
REANA is a reusable and reproducible data analysis platform allowing researchers to structure their analysis pipelines and run them on remote containerised compute clouds. REANA supports several different workflows systems (CWL, Serial, Yadage) and uses Kubernetes' job execution backend. We have designed an abstract job execution component that extends the REANA platform job execution capabilities to support multiple compute backends. We have tested the abstract job execution component with HTCondor and verified the scalability of the designed solution. The results show that the REANA platform would be able to support hybrid scientific workflows where different parts of the analysis pipelines can be executed on multiple computing backends.
{"title":"Support for HTCondor high-Throughput Computing Workflows in the REANA Reusable Analysis Platform","authors":"Rokas Maciulaitis, T. Simko, P. Brenner, Scott S. Hampton, M. Hildreth, K. H. Anampa, Irena Johnson, Cody Kankel, Jan Okraska, D. Rodríguez","doi":"10.1109/eScience.2019.00091","DOIUrl":"https://doi.org/10.1109/eScience.2019.00091","url":null,"abstract":"REANA is a reusable and reproducible data analysis platform allowing researchers to structure their analysis pipelines and run them on remote containerised compute clouds. REANA supports several different workflows systems (CWL, Serial, Yadage) and uses Kubernetes' job execution backend. We have designed an abstract job execution component that extends the REANA platform job execution capabilities to support multiple compute backends. We have tested the abstract job execution component with HTCondor and verified the scalability of the designed solution. The results show that the REANA platform would be able to support hybrid scientific workflows where different parts of the analysis pipelines can be executed on multiple computing backends.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131402793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/eScience.2019.00014
Yogesh L. Simmhan, M. Hegde, Rajesh Zele, S. Tripathi, S. Nair, S. Monga, R. Sahu, Kuldeep Dixit, R. Sutaria, Brijesh Mishra, Anamika Sharma, A. Svr
Air pollution is a public health emergency in large cities. The availability of commodity sensors and the advent of Internet of Things (IoT) enable the deployment of a city-wide network of 1000's of low-cost real-time air quality monitors to help manage this challenge. This needs to be supported by an IoT cyber-infrastructure for reliable and scalable data acquisition from the edge to the Cloud. The low accuracy of such sensors also motivates the need for data-driven calibration models that can accurately predict the science variables from the raw sensor signals. Here, we offer our experiences with designing and deploying such an IoT software platform and calibration models, and validate it through a pilot field deployment at two mega-cities, Delhi and Mumbai. Our edge data service is able to even-out the differential bandwidths from the sensing devices and to the Cloud repository, and recover from transient failures. Our analytical models reduce the errors of the sensors from a best-case of 63% using the factory baseline to as low as 21%, and substantially advances the state-of-the-art in this domain.
{"title":"SATVAM: Toward an IoT Cyber-Infrastructure for Low-Cost Urban Air Quality Monitoring","authors":"Yogesh L. Simmhan, M. Hegde, Rajesh Zele, S. Tripathi, S. Nair, S. Monga, R. Sahu, Kuldeep Dixit, R. Sutaria, Brijesh Mishra, Anamika Sharma, A. Svr","doi":"10.1109/eScience.2019.00014","DOIUrl":"https://doi.org/10.1109/eScience.2019.00014","url":null,"abstract":"Air pollution is a public health emergency in large cities. The availability of commodity sensors and the advent of Internet of Things (IoT) enable the deployment of a city-wide network of 1000's of low-cost real-time air quality monitors to help manage this challenge. This needs to be supported by an IoT cyber-infrastructure for reliable and scalable data acquisition from the edge to the Cloud. The low accuracy of such sensors also motivates the need for data-driven calibration models that can accurately predict the science variables from the raw sensor signals. Here, we offer our experiences with designing and deploying such an IoT software platform and calibration models, and validate it through a pilot field deployment at two mega-cities, Delhi and Mumbai. Our edge data service is able to even-out the differential bandwidths from the sensing devices and to the Cloud repository, and recover from transient failures. Our analytical models reduce the errors of the sensors from a best-case of 63% using the factory baseline to as low as 21%, and substantially advances the state-of-the-art in this domain.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129609615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/eScience.2019.00078
Sara Shakeri, Valentina Maccatrozzo, L. Veen, R. Bakhshi, L. Gommans, C. D. Laat, P. Grosso
Recently, Digital Data Marketplaces (DDMs) are gaining wide attention as a sharing platform among different organizations. That is due to the fact that sharing the information and participating in research collaborations play an important role in addressing multiple scientific challenges. To increase trust among participating organizations multiple contracts and agreements should be established in order to determine regulations and policies about who has access to what. Describing these agreements in a general model to be applicable in different DDMs is of utmost importance. In this paper, we present a semantic model for describing the access policies by means of semantic web technologies. In particular, we use and extend the Open Digital Rights Language (ODRL) to describe the pre-established agreements in a DDM.
{"title":"Modeling and Matching Digital Data Marketplace Policies","authors":"Sara Shakeri, Valentina Maccatrozzo, L. Veen, R. Bakhshi, L. Gommans, C. D. Laat, P. Grosso","doi":"10.1109/eScience.2019.00078","DOIUrl":"https://doi.org/10.1109/eScience.2019.00078","url":null,"abstract":"Recently, Digital Data Marketplaces (DDMs) are gaining wide attention as a sharing platform among different organizations. That is due to the fact that sharing the information and participating in research collaborations play an important role in addressing multiple scientific challenges. To increase trust among participating organizations multiple contracts and agreements should be established in order to determine regulations and policies about who has access to what. Describing these agreements in a general model to be applicable in different DDMs is of utmost importance. In this paper, we present a semantic model for describing the access policies by means of semantic web technologies. In particular, we use and extend the Open Digital Rights Language (ODRL) to describe the pre-established agreements in a DDM.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134628264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/eScience.2019.00082
Rui Zhao, M. Atkinson
With the needs of science and business, data sharing and re-use has become an intensive activity for various areas. In many cases, governance imposes rules concerning data use, but there is no existing computational technique to help data-users comply with such rules. We argue that intelligent systems can be used to improve the situation, by recording provenance records during processing, encoding the rules and performing reasoning. We present our initial work, designing formal models for data rules and flow rules and the reasoning system, as the first step towards helping data providers and data users sustain productive relationships.
{"title":"Towards a Computer-Interpretable Actionable Formal Model to Encode Data Governance Rules","authors":"Rui Zhao, M. Atkinson","doi":"10.1109/eScience.2019.00082","DOIUrl":"https://doi.org/10.1109/eScience.2019.00082","url":null,"abstract":"With the needs of science and business, data sharing and re-use has become an intensive activity for various areas. In many cases, governance imposes rules concerning data use, but there is no existing computational technique to help data-users comply with such rules. We argue that intelligent systems can be used to improve the situation, by recording provenance records during processing, encoding the rules and performing reasoning. We present our initial work, designing formal models for data rules and flow rules and the reasoning system, as the first step towards helping data providers and data users sustain productive relationships.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132546605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/eScience.2019.00026
Nicholas L. Hazekamp, Benjamín Tovar, D. Thain
Many scientific applications operate on large datasets that can be partitioned and operated on concurrently. The existing approaches for concurrent execution generally rely on statically partitioned data. This static partitioning can lock performance in a sub-optimal configuration, leading to higher execution time and an inability to respond to dynamic resources. We present the Continuously Divisible Job abstraction which allows statically defined applications to have their component tasks dynamically sized responding to system behavior. The Continuously Divisible Job abstraction defines a simple interface that dictates how work can be recursively divided, executed, and merged. Implementing this abstraction allows scientific applications to leverage dynamic job coordinators for execution. We also propose the Virtual File abstraction which allows read-only subsets of large files to be treated as separate files. In exploring the Continuously Divisible Job abstraction, two applications were implemented using the Continuously Divisible Job interface: a bioinformatics application and a high-energy physics event analysis. These were tested using an abstract job interface and several job coordinators. Comparing these against a previous static partitioning implementation we show comparable or better performance without having to make static decisions or implement complex dynamic application handling.
{"title":"Dynamic Sizing of Continuously Divisible Jobs for Heterogeneous Resources","authors":"Nicholas L. Hazekamp, Benjamín Tovar, D. Thain","doi":"10.1109/eScience.2019.00026","DOIUrl":"https://doi.org/10.1109/eScience.2019.00026","url":null,"abstract":"Many scientific applications operate on large datasets that can be partitioned and operated on concurrently. The existing approaches for concurrent execution generally rely on statically partitioned data. This static partitioning can lock performance in a sub-optimal configuration, leading to higher execution time and an inability to respond to dynamic resources. We present the Continuously Divisible Job abstraction which allows statically defined applications to have their component tasks dynamically sized responding to system behavior. The Continuously Divisible Job abstraction defines a simple interface that dictates how work can be recursively divided, executed, and merged. Implementing this abstraction allows scientific applications to leverage dynamic job coordinators for execution. We also propose the Virtual File abstraction which allows read-only subsets of large files to be treated as separate files. In exploring the Continuously Divisible Job abstraction, two applications were implemented using the Continuously Divisible Job interface: a bioinformatics application and a high-energy physics event analysis. These were tested using an abstract job interface and several job coordinators. Comparing these against a previous static partitioning implementation we show comparable or better performance without having to make static decisions or implement complex dynamic application handling.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115820744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/eScience.2019.00102
Saba Amiri, Sara Salimzadeh, Adam Belloum
Machine learning models recently have seen a large increase in usage across different disciplines. Their ability to learn complex concepts from the data and perform sophisticated tasks combined with their ability to leverage vast computational infrastructures available today have made them a very attractive choice for many challenges in academia and industry. In this context, deep Learning as a sub-class of machine learning is specifically becoming an important tool in modern computing applications. It has been successfully used for a wide range of different use cases, from medical applications to playing games. Due to the nature of these systems and the fact that a considerable portion of their use-cases deal with large volumes of data, training them is a very time and resource consuming task and requires vast amounts of computing cycles. To overcome this issue, it is only natural to try to scale deep learning applications to be able to run them across in order to achieve fast and manageable training speeds while maintaining a high level of accuracy. In recent years, a number of frameworks have been proposed to scale up ML algorithms to overcome the scalability issue, with roots both in the academia and the industry. With most of them being open source and supported by the increasingly large community of AI specialists and data scientists, their capabilities, performance and compatibility with modern hardware have been honed and extended. Thus, it is not easy for the domain scientist to pick the tool/framework best suited for their needs. This research aims to provide an overview of the relevant, widely used scalable machine learning and deep learning frameworks currently available and to provide the grounds on which researchers can compare and choose the best set of tools for their ML pipeline.
{"title":"A Survey of Scalable Deep Learning Frameworks","authors":"Saba Amiri, Sara Salimzadeh, Adam Belloum","doi":"10.1109/eScience.2019.00102","DOIUrl":"https://doi.org/10.1109/eScience.2019.00102","url":null,"abstract":"Machine learning models recently have seen a large increase in usage across different disciplines. Their ability to learn complex concepts from the data and perform sophisticated tasks combined with their ability to leverage vast computational infrastructures available today have made them a very attractive choice for many challenges in academia and industry. In this context, deep Learning as a sub-class of machine learning is specifically becoming an important tool in modern computing applications. It has been successfully used for a wide range of different use cases, from medical applications to playing games. Due to the nature of these systems and the fact that a considerable portion of their use-cases deal with large volumes of data, training them is a very time and resource consuming task and requires vast amounts of computing cycles. To overcome this issue, it is only natural to try to scale deep learning applications to be able to run them across in order to achieve fast and manageable training speeds while maintaining a high level of accuracy. In recent years, a number of frameworks have been proposed to scale up ML algorithms to overcome the scalability issue, with roots both in the academia and the industry. With most of them being open source and supported by the increasingly large community of AI specialists and data scientists, their capabilities, performance and compatibility with modern hardware have been honed and extended. Thus, it is not easy for the domain scientist to pick the tool/framework best suited for their needs. This research aims to provide an overview of the relevant, widely used scalable machine learning and deep learning frameworks currently available and to provide the grounds on which researchers can compare and choose the best set of tools for their ML pipeline.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123691694","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/eScience.2019.00029
Chreston A. Miller, Christa Miller
In this paper we explore the use of temporal patterns to define interaction dynamics between different kinds of meetings. Meetings occur on a daily basis and include different behavioral dynamics between participants, such as floor shifts and intense dialog. These dynamics can tell a story of the meeting and provide insight into how participants interact. We focus our investigation on defining diversity metrics to compare the interaction dynamics of scenario and non-scenario meetings. These metrics may be able to provide insight into the similarities and differences between scenario and non-scenario meetings. We observe that certain interaction dynamics can be identified through temporal patterns of speech intervals, i.e., when a participant is talking. We apply the principles of Parallel Episodes in identifying moments of speech overlap, e.g., interaction "bursts", and introduce Situated Data Mining, an approach for identifying repeated behavior patterns based on situated context. Applying these algorithms provides an overview of certain meeting dynamics and defines metrics for meeting comparison and diversity of interaction. We tested on a subset of the AMI corpus and developed three diversity metrics to describe similarities and differences between meetings. These metrics also present the researcher with an overview of interaction dynamics and presents points-of-interest for analysis.
{"title":"Timing is Everything: Identifying Diverse Interaction Dynamics in Scenario and Non-Scenario Meetings","authors":"Chreston A. Miller, Christa Miller","doi":"10.1109/eScience.2019.00029","DOIUrl":"https://doi.org/10.1109/eScience.2019.00029","url":null,"abstract":"In this paper we explore the use of temporal patterns to define interaction dynamics between different kinds of meetings. Meetings occur on a daily basis and include different behavioral dynamics between participants, such as floor shifts and intense dialog. These dynamics can tell a story of the meeting and provide insight into how participants interact. We focus our investigation on defining diversity metrics to compare the interaction dynamics of scenario and non-scenario meetings. These metrics may be able to provide insight into the similarities and differences between scenario and non-scenario meetings. We observe that certain interaction dynamics can be identified through temporal patterns of speech intervals, i.e., when a participant is talking. We apply the principles of Parallel Episodes in identifying moments of speech overlap, e.g., interaction \"bursts\", and introduce Situated Data Mining, an approach for identifying repeated behavior patterns based on situated context. Applying these algorithms provides an overview of certain meeting dynamics and defines metrics for meeting comparison and diversity of interaction. We tested on a subset of the AMI corpus and developed three diversity metrics to describe similarities and differences between meetings. These metrics also present the researcher with an overview of interaction dynamics and presents points-of-interest for analysis.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121304260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/eScience.2019.00054
Geoffrey Fox, S. Jha
We recently outlined the vision of "Learning Everywhere" which captures the possibility and impact of how learning methods and traditional HPC methods can be coupled together. A primary driver of such coupling is the promise that Machine Learning (ML) will give major performance improvements for traditional HPC simulations. Motivated by this potential, the ML around HPC class of integration is of particular significance. In a related follow-up paper, we provided an initial taxonomy for integrating learning around HPC methods. In this paper which is part of the Learning Everywhere series, we discuss ``how'' learning methods and HPC simulations are being integrated to enhance effective performance of computations. This paper describes several modes --- substitution, assimilation, and control, in which learning methods integrate with HPC simulations and provide representative applications in each mode. This paper discusses some open research questions and we hope will motivate and clear the ground for MLaroundHPC benchmarks.
{"title":"Understanding ML Driven HPC: Applications and Infrastructure","authors":"Geoffrey Fox, S. Jha","doi":"10.1109/eScience.2019.00054","DOIUrl":"https://doi.org/10.1109/eScience.2019.00054","url":null,"abstract":"We recently outlined the vision of \"Learning Everywhere\" which captures the possibility and impact of how learning methods and traditional HPC methods can be coupled together. A primary driver of such coupling is the promise that Machine Learning (ML) will give major performance improvements for traditional HPC simulations. Motivated by this potential, the ML around HPC class of integration is of particular significance. In a related follow-up paper, we provided an initial taxonomy for integrating learning around HPC methods. In this paper which is part of the Learning Everywhere series, we discuss ``how'' learning methods and HPC simulations are being integrated to enhance effective performance of computations. This paper describes several modes --- substitution, assimilation, and control, in which learning methods integrate with HPC simulations and provide representative applications in each mode. This paper discusses some open research questions and we hope will motivate and clear the ground for MLaroundHPC benchmarks.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126775311","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/eScience.2019.00036
Joaquín Chung, Zhengchun Liu, R. Kettimuthu, Ian T Foster
Data transfer over wide area networks is an integral part of many science workflows that must, for example, move data from scientific facilities to remote resources for analysis, sharing, and storage. Yet despite continued enhancements in data transfer infrastructure (DTI), our previous analyses of approximately 40 billion GridFTP command logs collected over four years from the Globus transfer service show that data transfer nodes (DTNs) are idle (i.e., are performing no transfers) 94.3% of the time. On the other hand, we have also observed periods in which CPU resource scarcity negatively impacts DTN throughput. Motivated by the opportunity to optimize DTI performance, we present here an elastic DTI architecture in which the pool of nodes allocated to DTN activities expands and shrinks over time, based on demand. Our results show that this elastic DTI can save up to ~95% of resources compared with a typical static DTN deployment, with the median slowdown incurred remaining close to one for most of the evaluated scenarios.
{"title":"Toward an Elastic Data Transfer Infrastructure","authors":"Joaquín Chung, Zhengchun Liu, R. Kettimuthu, Ian T Foster","doi":"10.1109/eScience.2019.00036","DOIUrl":"https://doi.org/10.1109/eScience.2019.00036","url":null,"abstract":"Data transfer over wide area networks is an integral part of many science workflows that must, for example, move data from scientific facilities to remote resources for analysis, sharing, and storage. Yet despite continued enhancements in data transfer infrastructure (DTI), our previous analyses of approximately 40 billion GridFTP command logs collected over four years from the Globus transfer service show that data transfer nodes (DTNs) are idle (i.e., are performing no transfers) 94.3% of the time. On the other hand, we have also observed periods in which CPU resource scarcity negatively impacts DTN throughput. Motivated by the opportunity to optimize DTI performance, we present here an elastic DTI architecture in which the pool of nodes allocated to DTN activities expands and shrinks over time, based on demand. Our results show that this elastic DTI can save up to ~95% of resources compared with a typical static DTN deployment, with the median slowdown incurred remaining close to one for most of the evaluated scenarios.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"11 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130725156","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}